DETAILED ACTION
This action is in response to the claims filed June 26th 2018. Claims 1-20 are pending and have been examined.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 16-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because the claim is directed to a transitory computer medium. Although ¶0070 of the specification states “the computer-readable storage media is non-transitory” This embodiment is only disclosed exemplarily. The specification also discloses media the non-limiting statement ¶0086 “The storage device 616 may include a computing-readable (or machine-readable) storage media.” Examiner recommends 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Cao et al. “MobiRNN: Efficient Recurrent Neural Network Execution on Mobile GPU” hereinafter Cao. Further in view of Yang et al. “A Systematic Approach to Blocking Convolutional Neural Networks” hereinafter Yang. Further still in view of Yan et al. “SERF: Efficient Scheduling for Fast Deep Neural Network Serving via Judicious Parallelism” hereinafter Yan.

Regarding Claim 1
Cao teaches, A method for determining a computation schedule for a recurrent neural network (RNN), the method comprising: receiving a matrix multiplication (MM) directed-acyclic graph (DAG) that models for computations of the RNN; (Section 4 ¶01 and Figure 2 “our work focuses on Recurrent Neural Network models (in the form of LSTMs). We study the effect of offloading the models to the GPU on mobile devices.” Examiner notes that Figure 2 
    PNG
    media_image1.png
    173
    299
    media_image1.png
    Greyscale
 illustrates the subdividing work units received by the GPU cores. A neural network describes the multiplication dependencies and order in a directed graph as shown in Figure 1. This corresponds to a computation schedule.) 
	Cao does not appear to explicitly teach generating a plurality of valid phased computation schedules for the RNN from the MM-DAG, wherein each of the valid phase computation schedule includes an ordering of MM operations; partitioning, for each of the plurality of valid phased computation schedules, each of the MM operations to a plurality of processor cores based on L3 cache to L2 cache data movement; executing, for each of the plurality of valid phased 
	However, Yang when addressing issues related to segmented Neural Networks into blocks to minimize memory flow teaches  generating a plurality of valid phased computation schedules for the RNN from the MM-DAG, wherein each of the valid phase computation schedule includes an ordering of MM operations; (Section 3.5 ¶01 “With our analysis of the optimal memory hierarchy and our memory model, we can compute the memory energy for any given [a plurality of] blocking string [MM operation blocks]… so finding a true optimum requires exhaustive search….Our initial optimizer simply enumerated all consistent parameter values in all possible strings and chose those with minimum energy” Section 3.1 ¶01 “there are no dependencies in this computation, the loops in the algorithm can be done in any order” Examiner notes that performing the exhaustive search is equivalent to generating a plurality of computation schedules for the Neural Network. When combined with Cao the Neural Network is a RNN structured as a MM-DAG. Because the elements with no dependencies can be done in any order, the variation in order corresponds to a phased schedule, where operations are performed in different phases in time.) partitioning, for each of the plurality of valid phased computation schedules, each of the MM operations to a plurality of processor cores based on L3 cache to L2 cache data movement; (Section 3.2 “Thus we will consider a memory for kernel coefficients KB (kernel buffer), input image data IB, and output data OB. Since these memories exist at multiple levels in the memory hierarchy, we use KB0, IB0, OB0, to indicate the kernel, input, and output memory that is closest to the compute unit”  Section 4.1 “we will use the Xeon E5645 (Westmere) CPU as our base platform for evaluating memory statistics on a general processor. The system has 32KB L1 data cache, 256K L2 cache, 12MB L3 cache” Examiner notes that the multi core CPU has various cache levels corresponds the distance from the compute unit. The input and output as well as the kernel, MM operations, are partitioned in different degrees of cache in order to minimize data movement.)
	It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method for partitioning computations of a neural network that minimizes memory energy as taught by Yang to the disclosed invention of Cao.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a method that improves power limited systems (Conclusion Yang)
Cao/Yang does not appear to explicitly teach executing, for each of the plurality of valid phased computation schedules, the RNN based on the partitioning; and storing a final computation schedule based on the executing, wherein the final computation schedule is used for subsequent executions of the RNN, and wherein the plurality of valid phased computation schedules comprises the final computation schedule.
Moreover Yan when addressing issues related to DNN workload characterizing and profiling teaches, executing, for each of the plurality of valid phased computation schedules, the RNN based on the partitioning; (Section 4B ¶01 “An easy but inefficient way to achieve the scheduling objective [of the RNN taught by Cao/Yang] is via exhaustive profiling: execute all possible parallelism configurations [valid phased computation schedules] for all possible loads and find the best parallel configuration for each load.” and storing a final computation schedule based on the executing, wherein the final computation schedule is used for subsequent executions of the RNN, and wherein the plurality of valid phased computation schedules comprises the final computation schedule. (Section A Framework Overview ¶01 “This table only needs to be built once [storing the final computation], provided that DNN [RNN] workload characteristics and system hardware remain the same. The scheduler uses the current system load as index to search the configuration reference table, find and adapt to the best parallel configurations…. The scheduler uses the current system load as index to search the configuration reference table, find and adapt to the best parallel configurations” Examiner notes that the table is representative of a plurality of parallelism configurations which corresponds to the computation schedule. Further, the examiner notes the limitation “the plurality of valid phased computation schedules comprises the final computation schedule” under broadest reasonable interpretation reads as the final computation schedule being identical to the whole set of phased computation schedules that were generated.)
 It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method for scheduling and executing a DNN that minimizes data latency through parallel configurations as taught by Yan to the disclosed invention of Cao/Yang.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a method “efficiently identifies best parallel configurations to minimize average request latency and it dynamically adapts to varying loads” (Conclusion Yan)

Regarding Claim 2
	Cao/Yang/Yan teach the method of Claim 1 
	Further Yang teaches, wherein generating a plurality of valid phased computation schedules for the RNN comprises generating schedules with time- independent phases before time-dependent phases. (Section 3.5 ¶05 “We speed up the optimization by iteratively optimizing the blocking from lower memory levels to higher ones, corresponding to optimizing from inner to outer loops” Examiner notes that in the context the loops refer to a 6 level nested loop of Algorithm 1. Inner loops correspond to time independent phases because they represent low data/time dependency. In contrast, outer loops correspond to time-dependent phases because they represent high data/time dependency. Where in the context of recurrent neural networks data dependencies are formulated as time dependencies of data flow through a network or DAG. )

Regarding Claim 3
	Cao/Yang/Yan teach the method of Claim 1 
	Further Cao teaches, wherein the partitioning further comprises mapping a partition of an MM operation to a single processor core, (Section 3.1 ¶01 “Work units are executed in parallel one in each of the available cores in the GPU. If there are more work units than cores then the units wait until one of the cores becomes available” Examiner notes that there is a one to one correspondence. Each work unit is mapped to an available core.) wherein a weight matrix is reused over a sequence of MM operations, wherein the partition of the MM operation is part of the sequence of MM operations (Section 3.2 ¶02 “we also optimize memory allocations for variables [weight matrix] using RenderScript primitives that allow for reuse of previously allotted memory, thereby reducing unnecessary and frequent on-demand memory allocation” Examiner notes that in the context of Neural Networks variable reuse includes weight matrix reuse. Using previously allotted memory corresponds to a sequence of computations or MM operations. Wherein a MM operation partition is distributed to cores as stated previously) and wherein a part of the weight matrix is stored in an L2 cache of the single processor core. (Section 3.2 ¶02 “reuse of previously allotted memory [memory allotted to the cores GPU or CPU] thereby reducing unnecessary and frequent on-demand memory allocation… For example, since the dimension of the cell state(c) and hidden state(h) matrix is known as the model is fixed, they can be preallocated” Examiner notes that as work units are allocated to cores, some elements the weight filters associated with the cell state and hidden stats do not need to be reallocated. Because of this this data can be stored for reuse on the core memory for use in the future computations. This would correspond to L2 cache, or memory privately available to the computational unit, in this case CPU and GPU cores.)

Regarding Claim 4
	Cao/Yang/Yan teach the method of Claim 3
	Further Cao teaches, determining two MM operations in a phase have a shared input matrix; and fusing the two MM operations into a single MM operation. (Section 3.3 ¶01 “We also use known optimizations like combining inputs and weights, fuse point-wise operations” Examiner notes that a work unit phase that combines inputs and weight operations corresponds fusing a plurality of operations into a single operation through combination.)

Regarding Claim 5
	Cao/Yang/Yan teach the method of Claim 4
	Further Yan teaches, determining a plurality of parallelism degrees for multiple MM operations in a first phase for a first phased computation schedule, (Section C Queueing-based Prediction Model ¶02 “We define the problem as predicting DNN request latency for any given parallel configuration under any given load [multiple MM operations] We denote parallelism configuration with (maximum service parallelism Cservice, inter-node parallelism Cinter, and intra-node parallelism Cintra)” Examiner notes that for a plurality of loads or operations the prediction model determines a parallelism configuration or degree of parallelism based on the DNN requests or computation schedule.) wherein the first phase computation schedule is executed with each of the plurality of parallelism levels. (Section A Impact of Parallelism on service time ¶02 “Figure 3 shows the DNN request service speedup for different degrees of intra-node, inter-node, and service parallelism. For intra-node parallelism, the speedup is close to linear up to 3 cores, but slows down beyond 4 cores” Examiner notes that for each parallelism the computation schedule denotes by the DNN request service is executed for each degree in order to access the speedup. The SERF method’s prediction model does these actions in order to define the optimal configurations to the DNN request service)

Regarding Claim 6
	Cao/Yang/Yan teach the method of Claim 5
	Further Yan teaches, wherein a selected degree of parallelism is less than the number of the plurality of processor cores. (Section A Impact of Parallelism on service time ¶02 “Figure 3 shows the DNN request service speedup for different degrees of intra-node, inter-node, and service parallelism. For intra-node parallelism, the speedup is close to linear up to 3 cores, but slows down beyond 4 cores” Section D Load-dependent Behavior “The left plot in Figure 7 shows the service time of using the 6 configurations under different loads, the middle plot in Figure 7 shows their waiting time, and the right plot in Figure 7 shows their latency… the ability to estimate the latency impact according to the load and a scheduler that can change the parallel configurations based on load are two necessary and important features” Examiner notes that as evident in Figure 3 it is not always expedient to use all available cores. If this were the case the Prediction Model that determines the configuration would not need to search for the best configuration, as it would simply pick to utilize all cores.)

Regarding Claim 7
	Cao/Yang/Yan teach the method of Claim 1
	Further Yang teaches, wherein the partitioning minimizes total data movement from an L3 cache to an L2 cache of a processor core for the computations of the RNN. (Section 3.1 ¶03 “Using this representation of nested loops, the blocking problem is easy to state: Find the loop order string, and the size of each loop, which minimizes the memory energy” Section 5.1 ¶1-2 “In Figure 3, our blocking achieves the fewest L2 cache accesses on each of the five layers… In Figure 4, our blocking significantly reduces the L3 cache
Accesses for all benchmarks as well” Examiner notes that the blocking or partitioning presented is implemented in order to minimize memory energy or memory flow in L3 and L2 caches. Having fewer memory accesses in both caches corresponds to minimizing data movement between the caches.)

Regarding Claim 8
	Cao/Yang/Yan teach the method of Claim 1
	Further Yan teaches, receiving a request to execute the RNN; and executing the RNN with the final computation schedule. (Introduction ¶08 “We stress that SERF is not limited to the Adam architecture, but also applicable to serving systems based on other DNN frameworks [including RNN taught by Cao]… We show that our prediction model achieves high accuracy: the average error is less than 4% comparing to measurement results. SERF always correctly identifies best parallel configurations under a variety of benchmarks and system loads” Examiner notes that the SERF framework receives a DNN request and executes it in order to compute the average error. The best parallel configuration corresponds to the final computation schedule.)
	
Regarding Claim 9
	Cao/Yang/Yan teach the method of Claim 1
	Further Yan teaches, determining the fastest executing valid phase computation schedule based on the executing (Section A overview ¶01 “We choose to optimize average latency [determining the fastest executing schedule] because DNN requests are homogenous and have similar service time”) wherein the fastest executing valid phase computation schedule is the final computation schedule. (Section D Scheduler “The scheduler takes the current system load as input, searches the configuration reference table, finds and adapts to the best parallelism configuration” Examiner notes that the best configuration corresponds to the final computation schedule that is fastest or optimizes for average latency of the executions of the DNN requests.)

Regarding Claim 10
Cao teaches, A system for determining a computation schedule for a recurrent neural network (RNN), the system comprising: an electronic processor configured: receive a matrix multiplication (MM) directed-acyclic graph (DAG) for the RNN that models for computations of the RNN (Section 4 ¶01 and Figure 2 “our work focuses on Recurrent Neural Network models (in the form of LSTMs). We study the effect of offloading the models to the GPU on mobile devices.” Examiner notes that Figure 2 
    PNG
    media_image1.png
    173
    299
    media_image1.png
    Greyscale
 illustrates the subdividing work units received by the GPU cores. A neural network describes the multiplication dependencies and order in a directed graph as shown in Figure 1. This corresponds to a computation schedule. Section 3.1 “The CUDA programming model used in a desktop GPU [an electronic processor] provides a way to specify how to break down a large unit of computation”)
	Cao does not appear to explicitly teach generate a plurality of valid phased computation schedules for the RNN from the MM-DAG, wherein each of the valid phase computation schedule includes an ordering of MM operations; partition, for each of the plurality of valid phased computation schedules, each of the MM operations to a plurality of processor cores based on L3 cache to L2 cache data movement; cause execution, for each of the plurality of valid phased computation schedules, of the RNN based on the partitioning; and store a final computation schedule based on the execution, wherein the final computation schedule is used for future executions of the RNN, and wherein the plurality of valid phased computation schedules comprises the final computation schedule.  
However, Yang when addressing issues related to segmented Neural Networks into blocks to minimize memory flow teaches  generate a plurality of valid phased computation schedules for the RNN from the MM-DAG, wherein each of the valid phase computation schedule includes an ordering of MM operations; (Section 3.5 ¶01 “With our analysis of the optimal memory hierarchy and our memory model, we can compute the memory energy for any given [a plurality of] blocking string [MM operation blocks]… so finding a true optimum requires exhaustive search….Our initial optimizer simply enumerated all consistent parameter values in all possible strings and chose those with minimum energy” Section 3.1 ¶01 “there are no dependencies in this computation, the loops in the algorithm can be done in any order” Examiner notes that performing the exhaustive search is equivalent to generating a plurality of computation schedules for the Neural Network. When combined with Cao the Neural Network is a RNN structured as a MM-DAG. Because the elements with no dependencies can be done in any order, the variation in order corresponds to a phased schedule, where operations are performed in different phases in time.) partition, for each of the plurality of valid phased computation schedules, each of the MM operations to a plurality of processor cores based on L3 cache to L2 cache data movement; (Section 3.2 “Thus we will consider a memory for kernel coefficients KB (kernel buffer), input image data IB, and output data OB. Since these memories exist at multiple levels in the memory hierarchy, we use KB0, IB0, OB0, to indicate the kernel, input, and output memory that is closest to the compute unit”  Section 4.1 “we will use the Xeon E5645 (Westmere) CPU as our base platform for evaluating memory statistics on a general processor. The system has 32KB L1 data cache, 256K L2 cache, 12MB L3 cache” Examiner notes that the multi core CPU has various cache levels corresponds the distance from the compute unit. The input and output as well as the kernel, MM operations, are partitioned in different degrees of cache in order to minimize data movement.)
	It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method for partitioning computations of a neural network that minimizes memory energy as taught by Yang to the disclosed invention of Cao.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a method that improves power limited systems by “reducing energy per operation… large potential energy savings available from directly blocking the computation and provide a method for finding efficient schedules” (Conclusion Yang)
Cao/Yang does not appear to explicitly teach cause execution, for each of the plurality of valid phased computation schedules, of the RNN based on the partitioning; and store a final computation schedule based on the execution, wherein the final computation schedule is used for future executions of the RNN, and wherein the plurality of valid phased computation schedules comprises the final computation schedule.  
Moreover Yan when addressing issues related to DNN workload characterizing and profiling teaches, cause execution, for each of the plurality of valid phased computation schedules, of the RNN based on the partitioning; (Section 4B ¶01 “An easy but inefficient way to achieve the scheduling objective [of the RNN taught by Cao/Yang] is via exhaustive profiling: execute all possible parallelism configurations [valid phased computation schedules] for all possible loads and find the best parallel configuration for each load.” and store a final computation schedule based on the execution, wherein the final computation schedule is used for future executions of the RNN, and wherein the plurality of valid phased computation schedules comprises the final computation schedule. (Section A Framework Overview ¶01 “This table only needs to be built once [storing the final computation], provided that DNN [RNN] workload characteristics and system hardware remain the same. The scheduler uses the current system load as index to search the configuration reference table, find and adapt to the best parallel configurations…. The scheduler uses the current system load as index to search the configuration reference table, find and adapt to the best parallel configurations” Examiner notes that the table is representative of a plurality of parallelism configurations which corresponds to the computation schedule. Further, the examiner notes the limitation “the plurality of valid phased computation schedules comprises the final computation schedule” under broadest reasonable interpretation reads as the final computation schedule being identical to the whole set of phased computation schedules that were generated.)
 It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method for scheduling and executing a DNN that minimizes data latency through parallel configurations as taught by Yan to the disclosed invention of Cao/Yang.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a method “efficiently identifies best parallel configurations to minimize average request latency and it dynamically adapts to varying loads” (Conclusion Yan)

Regarding Claim 11
	Cao/Yang/Yan teach the method of Claim 10 
	Further Yang teaches, wherein to generate a plurality of valid phased computation schedules for the RNN the electronic processor is configured to generate schedules with time-independent phases before time-dependent phases. (Section 3.5 ¶05 “We speed up the optimization by iteratively optimizing the blocking from lower memory levels to higher ones, corresponding to optimizing from inner to outer loops” Examiner notes that in the context the loops refer to a 6 level nested loop of Algorithm 1. Inner loops correspond to time independent phases because they represent low data/time dependency. In contrast, outer loops correspond to time-dependent phases because they represent high data/time dependency. Where in the context of recurrent neural networks data dependencies are formulated as time dependencies of data flow through a network or DAG. )

Regarding Claim 12
	Cao/Yang/Yan teach the method of Claim 10 
	Further Cao teaches, wherein to partition the electronic processor is further configured to map a partition of an MM operation to a single processor core (Section 3.1 ¶01 “Work units are executed in parallel one in each of the available cores in the GPU. If there are more work units than cores then the units wait until one of the cores becomes available” Examiner notes that there is a one to one correspondence. Each work unit is mapped to an available core.) wherein a weight matrix is reused over a sequence of MM operations, wherein the partition of the MM operation is part of the sequence of MM operations (Section 3.2 ¶02 “we also optimize memory allocations for variables [weight matrix] using RenderScript primitives that allow for reuse of previously allotted memory, thereby reducing unnecessary and frequent on-demand memory allocation” Examiner notes that in the context of Neural Networks variable reuse includes weight matrix reuse. Using previously allotted memory corresponds to a sequence of computations or MM operations. Wherein a MM operation partition is distributed to cores as stated previously) and wherein a part of the weight matrix is stored in an L2 cache of the single processor core. (Section 3.2 ¶02 “reuse of previously allotted memory [memory allotted to the cores GPU or CPU] thereby reducing unnecessary and frequent on-demand memory allocation… For example, since the dimension of the cell state(c) and hidden state(h) matrix is known as the model is fixed, they can be preallocated” Examiner notes that as work units are allocated to cores, some elements the weight filters associated with the cell state and hidden stats do not need to be reallocated. Because of this this data can be stored for reuse on the core memory for use in the future computations. This would correspond to L2 cache, or memory privately available to the computational unit, in this case CPU and GPU cores.)

Regarding Claim 13
	Cao/Yang/Yan teach the method of Claim 12
	Further Cao teaches, wherein the electronic processor is further configured to: determine two MM operations in a phase have a shared input matrix; and fuse the two MM operations into a single MM operation. (Section 3.3 ¶01 “We also use known optimizations like combining inputs and weights, fuse point-wise operations” Examiner notes that a work unit phase that combines inputs and weight operations corresponds fusing a plurality of operations into a single operation through combination.)

Regarding Claim 14
	Cao/Yang/Yan teach the method of Claim 13
	Further Yan teaches, wherein the electronic processor is further configured to determine a plurality of parallelism degrees for multiple MM operations in a first phase for a first phased computation schedule (Section C Queueing-based Prediction Model ¶02 “We define the problem as predicting DNN request latency for any given parallel configuration under any given load [multiple MM operations] We denote parallelism configuration with (maximum service parallelism Cservice, inter-node parallelism Cinter, and intra-node parallelism Cintra)” Examiner notes that for a plurality of loads or operations the prediction model determines a parallelism configuration or degree of parallelism based on the DNN requests or computation schedule.) wherein the first phase computation schedule is executed with each of the plurality of parallelism levels. (Section A Impact of Parallelism on service time ¶02 “Figure 3 shows the DNN request service speedup for different degrees of intra-node, inter-node, and service parallelism. For intra-node parallelism, the speedup is close to linear up to 3 cores, but slows down beyond 4 cores” Examiner notes that for each parallelism the computation schedule denotes by the DNN request service is executed for each degree in order to access the speedup. The SERF method’s prediction model does these actions in order to define the optimal configurations to the DNN request service)

Regarding Claim 15
	Cao/Yang/Yan teach the method of Claim 14
	Further Yan teaches, wherein a selected degree of parallelism is less than the number of the plurality of processor cores. (Section A Impact of Parallelism on service time ¶02 “Figure 3 shows the DNN request service speedup for different degrees of intra-node, inter-node, and service parallelism. For intra-node parallelism, the speedup is close to linear up to 3 cores, but slows down beyond 4 cores” Section D Load-dependent Behavior “The left plot in Figure 7 shows the service time of using the 6 configurations under different loads, the middle plot in Figure 7 shows their waiting time, and the right plot in Figure 7 shows their latency… the ability to estimate the latency impact according to the load and a scheduler that can change the parallel configurations based on load are two necessary and important features” Examiner notes that as evident in Figure 3 it is not always expedient to use all available cores. If this were the case the Prediction Model that determines the configuration would not need to search for the best configuration, as it would simply pick to utilize all cores.)

Regarding Claim 16
Cao teaches, A computer-readable storage medium storing computer-executable instructions (Introduction “In response we develop MobiRNN, a mobile specific optimization for RNNs that focusses on offloading deep learning tasks to the mobile GPU. Our approach to offloading is to use a mobile-specific parallelization framework RenderScript” Examiner notes that in order to execute the method described in a mobile environment using RenderScript, the corresponding instructions residing on associated memory.) for determining a computation schedule for a recurrent neural network (RNN), the stored instructions comprising: instructions to receive a matrix multiplication (MM) directed-acyclic graph (DAG) that models for computations of the RNN; (Section 4 ¶01 and Figure 2 “our work focuses on Recurrent Neural Network models (in the form of LSTMs). We study the effect of offloading the models to the GPU on mobile devices.” Examiner notes that Figure 2 
    PNG
    media_image1.png
    173
    299
    media_image1.png
    Greyscale
 illustrates the subdividing work units received by the GPU cores. A neural network describes the multiplication dependencies and order in a directed graph as shown in Figure 1. This corresponds to a computation schedule.) 
	Cao does not appear to explicitly teach instructions to generate a plurality of valid phased computation schedules for the RNN from the MM-DAG, wherein each of the valid phase computation schedule includes an ordering of MM operations; instructions to partition, for each of the plurality of valid phased computation schedules, each of the MM operations to a plurality of processor cores based on L3 cache to L2 cache data movement; instructions to execute, for each of 
	However, Yang when addressing issues related to segmented Neural Networks into blocks to minimize memory flow teaches  instructions to generate a plurality of valid phased computation schedules for the RNN from the MM-DAG, wherein each of the valid phase computation schedule includes an ordering of MM operations; (Section 3.5 ¶01 “With our analysis of the optimal memory hierarchy and our memory model, we can compute the memory energy for any given [a plurality of] blocking string [MM operation blocks]… so finding a true optimum requires exhaustive search….Our initial optimizer simply enumerated all consistent parameter values in all possible strings and chose those with minimum energy” Section 3.1 ¶01 “there are no dependencies in this computation, the loops in the algorithm can be done in any order” Examiner notes that performing the exhaustive search is equivalent to generating a plurality of computation schedules for the Neural Network. When combined with Cao the Neural Network is a RNN structured as a MM-DAG. Because the elements with no dependencies can be done in any order, the variation in order corresponds to a phased schedule, where operations are performed in different phases in time.) instructions to partition, for each of the plurality of valid phased computation schedules, each of the MM operations to a plurality of processor cores based on L3 cache to L2 cache data movement; (Section 3.2 “Thus we will consider a memory for kernel coefficients KB (kernel buffer), input image data IB, and output data OB. Since these memories exist at multiple levels in the memory hierarchy, we use KB0, IB0, OB0, to indicate the kernel, input, and output memory that is closest to the compute unit”  Section 4.1 “we will use the Xeon E5645 (Westmere) CPU as our base platform for evaluating memory statistics on a general processor. The system has 32KB L1 data cache, 256K L2 cache, 12MB L3 cache” Examiner notes that the multi core CPU has various cache levels corresponds the distance from the compute unit. The input and output as well as the kernel, MM operations, are partitioned in different degrees of cache in order to minimize data movement.)
	It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method for partitioning computations of a neural network that minimizes memory energy as taught by Yang to the disclosed invention of Cao.
(Conclusion Yang)
Cao/Yang does not appear to explicitly teach instructions to execute, for each of the plurality of valid phased computation schedules, the RNN based on the partitioning; and instructions to store a final computation schedule based on the executing, wherein the final computation schedule is used for future executions of the RNN, and wherein the plurality of valid phased computation schedules comprises the final computation schedule
Moreover Yan when addressing issues related to DNN workload characterizing and profiling teaches, instructions to execute, for each of the plurality of valid phased computation schedules, the RNN based on the partitioning; (Section 4B ¶01 “An easy but inefficient way to achieve the scheduling objective [of the RNN taught by Cao/Yang] is via exhaustive profiling: execute all possible parallelism configurations [valid phased computation schedules] for all possible loads and find the best parallel configuration for each load.” and  instructions to store a final computation schedule based on the executing, wherein the final computation schedule is used (Section A Framework Overview ¶01 “This table only needs to be built once [storing the final computation], provided that DNN [RNN] workload characteristics and system hardware remain the same. The scheduler uses the current system load as index to search the configuration reference table, find and adapt to the best parallel configurations…. The scheduler uses the current system load as index to search the configuration reference table, find and adapt to the best parallel configurations” Examiner notes that the table is representative of a plurality of parallelism configurations which corresponds to the computation schedule. Further, the examiner notes the limitation “the plurality of valid phased computation schedules comprises the final computation schedule” under broadest reasonable interpretation reads as the final computation schedule being identical to the whole set of phased computation schedules that were generated.)
 It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method for scheduling and executing a DNN that minimizes data latency through parallel configurations as taught by Yan to the disclosed invention of Cao/Yang.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a method “efficiently identifies best parallel (Conclusion Yan)

Regarding Claim 17
	Cao/Yang/Yan teach the method of Claim 16 
	Further Yang teaches, wherein the instructions to generate a plurality of valid phased computation schedules for the RNN comprise instructions to generate schedules with time-independent phases before time-dependent phases. (Section 3.5 ¶05 “We speed up the optimization by iteratively optimizing the blocking from lower memory levels to higher ones, corresponding to optimizing from inner to outer loops” Examiner notes that in the context the loops refer to a 6 level nested loop of Algorithm 1. Inner loops correspond to time independent phases because they represent low data/time dependency. In contrast, outer loops correspond to time-dependent phases because they represent high data/time dependency. Where in the context of recurrent neural networks data dependencies are formulated as time dependencies of data flow through a network or DAG. )

Regarding Claim 18
	Cao/Yang/Yan teach the method of Claim 16 
	Further Cao teaches, wherein the instructions to partition comprise instructions to map a partition of an MM operation to a single processor core, (Section 3.1 ¶01 “Work units are executed in parallel one in each of the available cores in the GPU. If there are more work units than cores then the units wait until one of the cores becomes available” Examiner notes that there is a one to one correspondence. Each work unit is mapped to an available core.) wherein a weight matrix is reused over a sequence of MM operations, wherein the partition of the MM operation is part of the sequence of MM operations (Section 3.2 ¶02 “we also optimize memory allocations for variables [weight matrix] using RenderScript primitives that allow for reuse of previously allotted memory, thereby reducing unnecessary and frequent on-demand memory allocation” Examiner notes that in the context of Neural Networks variable reuse includes weight matrix reuse. Using previously allotted memory corresponds to a sequence of computations or MM operations. Wherein a MM operation partition is distributed to cores as stated previously) and wherein a part of the weight matrix is stored in an L2 cache of the single processor core. (Section 3.2 ¶02 “reuse of previously allotted memory [memory allotted to the cores GPU or CPU] thereby reducing unnecessary and frequent on-demand memory allocation… For example, since the dimension of the cell state(c) and hidden state(h) matrix is known as the model is fixed, they can be preallocated” Examiner notes that as work units are allocated to cores, some elements the weight filters associated with the cell state and hidden stats do not need to be reallocated. Because of this this data can be stored for reuse on the core memory for use in the future computations. This would correspond to L2 cache, or memory privately available to the computational unit, in this case CPU and GPU cores.)

Regarding Claim 19
	Cao/Yang/Yan teach the method of Claim 18
	Further Cao teaches, wherein the stored instructions further comprise: instructions to determine two MM operations in a phase have a shared input matrix; and instructions to fuse the two MM operations into a single MM operation. (Section 3.3 ¶01 “We also use known optimizations like combining inputs and weights, fuse point-wise operations” Examiner notes that a work unit phase that combines inputs and weight operations corresponds fusing a plurality of operations into a single operation through combination.)

Regarding Claim 20
	Cao/Yang/Yan teach the method of Claim 19
	Further Yan teaches, wherein the stored instructions further comprise instructions to determine a plurality of parallelism degrees for multiple MM operations in a first phase for a first phased computation schedule, (Section C Queueing-based Prediction Model ¶02 “We define the problem as predicting DNN request latency for any given parallel configuration under any given load [multiple MM operations] We denote parallelism configuration with (maximum service parallelism Cservice, inter-node parallelism Cinter, and intra-node parallelism Cintra)” Examiner notes that for a plurality of loads or operations the prediction model determines a parallelism configuration or degree of parallelism based on the DNN requests or computation schedule.) wherein the first phase computation schedule is executed with each of the plurality of parallelism levels. (Section A Impact of Parallelism on service time ¶02 “Figure 3 shows the DNN request service speedup for different degrees of intra-node, inter-node, and service parallelism. For intra-node parallelism, the speedup is close to linear up to 3 cores, but slows down beyond 4 cores” Examiner notes that for each parallelism the computation schedule denotes by the DNN request service is executed for each degree in order to access the speedup. The SERF method’s prediction model does these actions in order to define the optimal configurations to the DNN request service)


Conclusion
Prior art
US document ID US 20210064997 A1, A memory management method that generated a GPU compute schedule to minimize the latency between the CPU and GPU. 
Chen “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks” minimizes data movement between processing engines in convolution neural network configurations using local data reuse of filter weights.
Gao et al. “TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory” a neural network accelerator that moves computations to parallel local memory processors to decrease bandwidth pressure an increase energy efficiency through exhaustive search scheduling scheme.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached on Monday-Friday 7:30 am – 4:00 pm (EST).

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://portal.uspto.gov/external/portal. Should you have questions about access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

/J.R.G./Examiner, Art Unit 2122                                                                                                                                                                                                        
/ERIC NILSSON/Primary Examiner, Art Unit 2122