DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Amendments
Claims 1-20 are pending and have been examined. Claims 1, 8, 14, 17, and 19-20 have been amended.
Note: The amended claims are not in compliance with 37 CFR 1.121 because claim 3 is missing a status label. Claim 3 is being examined as an original claim.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-2, 7-8, 11-17, and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Woo (U.S. 10,019,668).

ing CLAIM 1, Woo teaches: A computer-implemented method, the method comprising: 
obtaining, as input for inferencing of one or more deep neural networks, (C. 4, L. 6-10 teaches inferencing: “A neural network having multiple layers can be used to compute inferences. For example, given an input, the neural network can compute an inference for the input. The neural network computes this inference by processing the input through each of the layers of the neural network.” Additionally, C. 5, L. 10-17 teaches inferencing: “In some implementations, when performing inference computations, rather than using the off-chip memory, the external controller can use the on-chip memory of the hardware circuit to store inputs and parameters. In response to receiving controls signals from at least one controller of the system, the hardware circuit accesses the on-chip memory and uses the stored inputs and parameters to perform neural network computations.”)
(i) an inferencing model and (An inferencing model is taught by C. 1, L. 18-19: “convolutional neural network layers”. Obtaining an inferencing model for inferencing is interpreted as retrieving data values such as weights for the neural network, as taught by C. 6, L. 10-16:  “In some implementations, input values can be pre-loaded to activation memory 104 and parameter/weight values can be pre-loaded to parameter memory 106 using data values received by circuit 100 from an external or higher level control device associated with a neural network computing system.”) 
(ii) one or more resource constraints (Obtaining resource constraints as inputs for inferencing is taught by a circuit 100 determining total capacity of memory of a hardware circuit — C. 14, L. 53-59: “In some implementations, determining a partitioning of neural network layers into a sequence of superlayers includes: … ii) circuit 100 determining a particular aggregate input activation and parameter capacity of a memory of a hardware circuit;” C. 14, L. 65 discloses a total available on-chip memory: “a storage capacity… of on-chip memory may be 500 megabyte (MB).” — The inferencing is taught by C. 14, L. 48-52: “Hence, neural network layers can be partitioned into a sequence of superlayers so as to ; 
computing, based at least in part on the obtained input, a set of statistics pertaining to resource utilization for each of multiple layers in the one or more deep neural networks (A set of statistics is interpreted as at least one statistic. Woo teaches the statistic of a working set, defined as, “a size parameter that indicates an amount of memory needed to process the one or more inputs through each of the layers in the superlayer” (C. 2 L. 22-30) and where “Circuit 100 can then determine an amount of memory required to store respective sets of parameters for each layer of a neural network” (C. 14, L. 52 - C. 15, L. 5)); 
determining, for each respective one of the multiple layers of the one or more deep neural networks, a corresponding batch size, wherein the determining is based at least in part on (i) the obtained input and (ii) the computed set of statistics, and wherein the batch size determined for a first one of the multiple layers is different than the batch size determined for a second one of the multiple layers; and (C. 15, L. 13-17: “For respective layers A, B, C, circuit 100 can determine a particular size parameter for inputs of working sets to be processed by respective layers and a corresponding batch size for the working set.” Corresponding batch sizes indicates different batch sizes.)
using the determined batch sizes for inferencing the multiple layers of the one or more deep neural networks; (C. 4, L. 6-10 teaches inferencing, and C. 15, L. 13-17 teaches using the determined batch sizes.)
wherein the method is carried out by at least one computing device. (“a CPU or GPU” C. 4 L. 55)

CLAIM 2, Woo teaches: The computer-implemented method of claim 1, wherein the inferencing model comprises a feed forward model. (Woo teaches convolutional neural network layers at C. 1, L. 18-19, which is interpreted as a feed forward model.)

Regarding CLAIM 7, Woo teaches: The computer-implemented method of claim 1, wherein the one or more resource constraints comprises at least one of (i) total available memory, (ii) maximum latency for inferencing, and (iii) maximum energy for inferencing. (Woo in C. 15 L. 65 teaches (i) total available memory of may be 500 megabyte (MB). Examiner is only required to cite prior art teaching one of resource constraints (i), (ii), and (iii).)

	Regarding CLAIM 8, Woo teaches: The computer-implemented method of claim 1, wherein the set of statistics comprises at least one of (i) amount of working memory, (ii) input and activation size for each sample, (iii) time to process a layer for each of multiple permissible batch sizes that are based at least in part on the one or more resource constraints, and (iv) energy to process a layer for each of multiple permissible batch sizes that are based at least in part on the one or more resource constraints. (Woo teaches (i), the statistic of a working set, defined as, “a size parameter that indicates an amount of memory needed to process the one or more inputs through each of the layers in the superlayer” (C. 2 L. 22-30) and where “Circuit 100 can then determine an amount of memory required to store respective sets of parameters for each layer of a neural network” (C. 14, L. 52 - C. 15, L. 5). Examiner is only required to cite prior art teaching one of the set of statistics (i), (ii), (iii), and (iv).)

Regarding CLAIM 11, Woo teaches: The computer-implemented method of claim 1, wherein said determining decreases one or more energy values associated with the inferencing of the one or more deep neural networks. (Woo teaches “energy optimization by the hardware circuit” at C. 3 L. 12-

Regarding CLAIM 12, Woo teaches: The computer-implemented method of claim 1, wherein said determining decreases one or more latency values associated with the inferencing of the one or more deep neural networks. (Woo teaches that “external communications can… increase system latency” (C. 10 L. 55), and so the “use of this on-chip storage and other local resources can serve to minimize external communications by the hardware circuit during processing of inputs through layers of a neural network” (C. 10 L. 42-46), thereby resulting in decreased latency for circuit 100) 

Regarding CLAIM 13, Woo teaches: The computer-implemented method of claim 1, wherein said determining decreases one or more memory values associated with the inferencing of the one or more deep neural networks. (A memory value is interpreted as a working set size (C. 2 L. 22-30). Woo teaches at C. 12 L. 43-46: “For example, at least with regard to batch processing at layer B for batch element 1, alternating between different batch elements can reduce a maximum working set size of layer B to 10 units, instead of the maximum working set size of 16 units required when using the conventional scheduling policy described above.”)

	Regarding CLAIM 14, Woo teaches: A computer program product comprising a computer readable storage medium (“computer storage devices” C. 2 L. 62) having program instructions embodied therewith, the program instructions executable by a computing device to cause the computing device to: (instructions executed by processor is taught by C. 2 L. 67 – C. 3 L. 2)
obtain, as input for inferencing of one or more deep neural networks, (C. 4, L. 6-10 teaches inferencing: “A neural network having multiple layers can be used to compute inferences. For example, given an input, the neural network can compute an inference for the input. The neural network computes this inference by processing the input through each of the layers of the neural network.” Additionally, C. 5, L. 10-17 teaches inferencing: “In some implementations, when performing inference computations, rather than using the off-chip memory, the external controller can use the on-chip memory of the hardware circuit to store inputs and parameters. In response to receiving controls signals from at least one controller of the system, the hardware circuit accesses the on-chip memory and uses the stored inputs and parameters to perform neural network computations.”)
(i) an inferencing model and (An inferencing model is taught by C. 1, L. 18-19: “convolutional neural network layers”. Obtaining an inferencing model for inferencing is interpreted as retrieving data values for the neural network, as taught by C. 6, L. 10-16:  “In some implementations, input values can be pre-loaded to activation memory 104 and parameter/weight values can be pre-loaded to parameter memory 106 using data values received by circuit 100 from an external or higher level control device associated with a neural network computing system.”) 
(ii) one or more resource constraints; (Obtaining resource constraints as inputs for inferencing is taught by circuit 100 determining total capacity of memory of a hardware circuit — C. 14, L. 53-59: “In some implementations, determining a partitioning of neural network layers into a sequence of superlayers includes: … ii) circuit 100 determining a particular aggregate input activation and parameter capacity of a memory of a hardware circuit;” C. 14, L. 65 discloses a total available on-chip memory: “a storage capacity… of on-chip memory may be 500 megabyte (MB).” — The inferencing is taught by C. 14, L. 48-52: “Hence, neural network layers can be partitioned into a sequence of superlayers so as to not exceed a threshold storage capacity of on-chip memory when a hardware circuit of circuit 100 processes one or more batches of neural network inputs.”)
compute, based at least in part on the obtained input, a set of statistics pertaining to resource utilization for each of multiple layers in the one or more deep neural networks; (A set of statistics is interpreted as at least one statistic. Woo teaches the statistic of a working set, defined as, “a size parameter that indicates an amount of memory needed to process the one or more inputs through each of the layers in the superlayer” (C. 2 L. 22-30) and where “Circuit 100 can then determine an amount of memory required to store respective sets of parameters for each layer of a neural network” (C. 14, L. 52 - C. 15, L. 5))
determine, for each respective one of the multiple layers of the one or more deep neural networks, a corresponding batch size, wherein the determining is based at least in part on (i) the obtained input and (ii) the computed set of statistics, and wherein the batch size determined for a first one of the multiple layers is different than the batch size determined for a second one of the multiple layers; and (C. 15, L. 13-17: “For respective layers A, B, C, circuit 100 can determine a particular size parameter for inputs of working sets to be processed by respective layers and a corresponding batch size for the working set.” Corresponding batch sizes indicates different batch sizes.)
apply the determined batch sizes for inferencing the multiple layers of the one or more deep neural networks. (C. 4, L. 6-10 teaches inferencing, and C. 15, L. 13-17 teaches using the determined batch sizes.)

	Regarding CLAIM 15, Woo teaches: The computer program product of claim 14, wherein the inferencing model comprises a feed forward model. (Woo teaches convolutional neural network layers at C. 1, L. 18-19, which are interpreted as a feed forward model)

Regarding CLAIM 16, Woo teaches: The computer program product of claim 14, wherein the one or more resource constraints comprises at least one of (i) total available memory, (ii) maximum latency for inferencing, and (iii) maximum energy for inferencing. (Woo in C. 15 L. 65 teaches (i) total available memory of may be 500 megabyte (MB). Examiner is only required to cite prior art teaching one of resource constraints (i), (ii), and (iii).)

Regarding CLAIM 17, Woo teaches: The computer program product of claim 14, wherein the set of statistics comprises at least one of (i) amount of working memory, (ii) input and activation size for each sample, (iii) time to process a layer for each of multiple permissible batch sizes that are based at least in part on the one or more resource constraints, and (iv) energy to process a layer for each of multiple permissible batch sizes that are based at least in part on the one or more resource constraints. (Woo teaches (i), the statistic of a working set, defined as, “a size parameter that indicates an amount of memory needed to process the one or more inputs through each of the layers in the superlayer” (C. 2 L. 22-30) and where “Circuit 100 can then determine an amount of memory required to store respective sets of parameters for each layer of a neural network” (C. 14, L. 52 - C. 15, L. 5). Examiner is only required to cite prior art teaching one of the set of statistics (i), (ii), (iii), and (iv).)

	Regarding CLAIM 19, Woo teaches: A system comprising: a memory (Fig. 1, 102 and 104); and at least one processor (“a CPU or GPU” C. 4 L. 55) operably coupled to the memory and configured for: 
obtaining, as input for inferencing of one or more deep neural networks, (C. 4, L. 6-10 teaches inferencing: “A neural network having multiple layers can be used to compute inferences. For example, given an input, the neural network can compute an inference for the input. The neural network computes this inference by processing the input through each of the layers of the neural network.” Additionally, C. 5, L. 10-17 teaches inferencing: “In some implementations, when performing inference computations, rather than using the off-chip memory, the external controller can use the on-chip memory of the hardware circuit to store inputs and parameters. In response to receiving controls signals 
(i) an inferencing model and (An inferencing model is taught by C. 1, L. 18-19: “convolutional neural network layers”. Obtaining an inferencing model for inferencing is interpreted as retrieving data values for the neural network, as taught by C. 6, L. 10-16:  “In some implementations, input values can be pre-loaded to activation memory 104 and parameter/weight values can be pre-loaded to parameter memory 106 using data values received by circuit 100 from an external or higher level control device associated with a neural network computing system.”) 
(ii) one or more resource constraints; (Obtaining resource constraints as inputs for inferencing is taught by circuit 100 determining total capacity of memory of a hardware circuit — C. 14, L. 53-59: “In some implementations, determining a partitioning of neural network layers into a sequence of superlayers includes: … ii) circuit 100 determining a particular aggregate input activation and parameter capacity of a memory of a hardware circuit;” C. 14, L. 65 discloses a total available on-chip memory: “a storage capacity… of on-chip memory may be 500 megabyte (MB).” — The inferencing is taught by C. 14, L. 48-52: “Hence, neural network layers can be partitioned into a sequence of superlayers so as to not exceed a threshold storage capacity of on-chip memory when a hardware circuit of circuit 100 processes one or more batches of neural network inputs.”)
computing, based at least in part on the obtained input, a set of statistics pertaining to resource utilization for each of multiple layers in the one or more deep neural networks; (A set of statistics is interpreted as at least one statistic. Woo teaches the statistic of a working set, defined as, “a size parameter that indicates an amount of memory needed to process the one or more inputs through each of the layers in the superlayer” (C. 2 L. 22-30) and where “Circuit 100 can then determine an amount of memory required to store respective sets of parameters for each layer of a neural network” (C. 15 L. 2-5))
determining, for each respective one of the multiple layers of the one or more deep neural networks, a corresponding batch size, wherein the determining is based at least in part on (i) the obtained input and (ii) the computed set of statistics, and wherein the batch size determined for a first one of the multiple layers is different than the batch size determined for a second one of the multiple layers; and (C. 15, L. 13-17: “For respective layers A, B, C, circuit 100 can determine a particular size parameter for inputs of working sets to be processed by respective layers and a corresponding batch size for the working set.” Corresponding batch sizes indicates different batch sizes.)
using the determined batch sizes for inferencing the multiple layers of the one or more deep neural networks. (C. 4, L. 6-10 teaches inferencing, and C. 15, L. 13-17 teaches using the determined batch sizes.)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

CLAIM 3 is rejected under 35 U.S.C. 103 as being unpatentable over Woo in view of Molchanov et al. (“Pruning Convolutional Neural Networks for Resource Efficient Inference”).

Regarding CLAIM 3, Woo teaches: The computer-implemented method of claim 1,
However, Woo does not explicitly teach: wherein the inferencing model comprises a compressed model generated through weight-based pruning.
Molchanov teaches: wherein the inferencing model comprises a compressed model generated through weight-based pruning. (“Pruning by magnitude of kernel weights is perhaps the simplest possible criterion.” p. 3, § 2.2 Criteria for Pruning, ¶ Minimum weight. By default, pruning weights compresses the model.)
Molchanov is in the field of optimizing neural network for inferencing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have compressed the model and pruned the weights in Woo’s system using the method of Molchanov’s system. A motivation for this combination is “that a convolutional kernel with low L2 norm detects less important features than those with a high norm.” (“The motivation to apply this type of pruning is that a convolutional kernel with low L2 norm detects less important features than those with a high norm”, ¶ Minimum weight)


Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Woo in view of Zhu et al. (“Trained ternary quantization”).

Regarding CLAIM 4, Woo teaches: The computer-implemented method of claim 1, 
wherein the inferencing model comprises a compressed model generated through at least one of (i) quantization and (ii) weight sharing.
But Zhu teaches: wherein the inferencing model comprises a compressed model generated through at least one of (i) quantization and (ii) weight sharing. (Zhu teaches (i) quantization: “We highlight our trained quantization method that can learn both ternary values and ternary assignments. During inference, only ternary values (2-bit weights) and scaling factors are needed” (Abstract). By default, quantizing weights compresses the model).
Zhu is in the field of optimizing neural network for inferencing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have quantized the weights in Woo’s system to ternary values from Zhu’s system, with a motivation to shrink the models (“our models are nearly 16x smaller than full-precision models” Zhu, Abstract). 

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Woo in view of Han (“EIE: Efficient Inference Engine on Compressed Deep Neural Network”).

Regarding CLAIM 5, Woo teaches: The computer-implemented method of claim 1, 
However, Woo does not explicitly teach: wherein the inferencing model comprises a compressed model generated through relative indexing.
But Han teaches: wherein the inferencing model comprises a compressed model generated through relative indexing. (Han, Fig. 3 below shows memory layout for relative indexed CSC (compressed sparse column) format.)


    PNG
    media_image1.png
    189
    686
    media_image1.png
    Greyscale


Han is in the field of optimizing neural network for inferencing. Therefore, it would have been obvious to one of ordinary skill in the art to have compressed Woo’s neural network using the method of Han’s relative row indexing. A motivation is to efficiently operate on compressed DNN models (Han p. 244, col. 1, first full paragraph) and to keep the weight matrix in sparse form instead of converting back to dense form (Han p. 244, end of col. 2).

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Woo in view of Reagen et al. (“Weightless: Lossy Weight Encoding For Deep Neural Network Compression”).

Regarding CLAIM 6, Woo teaches: The computer-implemented method of claim 1, 
However, Woo does not explicitly teach: wherein the inferencing model comprises a compressed model generated through encoding.
But Reagen teaches: wherein the inferencing model comprises a compressed model generated through encoding. (End of p. 2: “Weightless is a lossy encoding scheme based around Bloomier filters… We then show how to encode neural network weights using this data structure and propose a set of augmentations to make it an effective compression strategy for deep neural networks.” Reagen teaches this in § 3.1, ¶Decoding and ¶Encoding. By default, encoding weights compresses the model.)
Reagen is in the field of optimizing neural network for inferencing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to (“We propose using the Bloomier filter to compactly store weights in a neural network”, p. 4 §3.2, ¶1)

Claims 9, 10 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Woo in view of Gao et al. (“Low latency RNN inference with cellular batching”).

Regarding CLAIM 9, Woo teaches: The computer-implemented method of claim 1, 
However, Woo does not explicitly teach: wherein said determining comprises determining a sequence of variable batch sizes corresponding to the multiple layers of the one or more deep neural networks.
	But Gao teaches: wherein said determining comprises determining a sequence of variable batch sizes corresponding to the multiple layers of the one or more deep neural networks. (“We perform microbenchmarks using various input batch sizes” (p. 12, col. 2)) 
	Gao is in the field of batching for improving neural network inferencing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the cellular batching method on recurrent neural networks from Gao’s system into the batching in Woo’s system, with a motivation to achieve high throughput values and low latency simultaneously. (“We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference.” (Gao, Astract))


Regarding CLAIM 10, Woo teaches: The computer-implemented method of claim 1, 
 Woo does not explicitly teach: wherein said determining increases one or more throughput values associated with the inferencing of the one or more deep neural networks.
	But Gao teaches: wherein said determining increases one or more throughput values associated with the inferencing of the one or more deep neural networks. (Throughput is interpreted as requests per second (req/s) as used by Gao. “The inference throughput of BatchMaker for TreeLSTM is 4× and 1.8× that of TensorFlow Fold and DyNet, respectively” (Gao, col. 1, end of section 1))
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the cellular batching method on recurrent neural networks from Gao’s system into the batching in Woo’s system, with a motivation to achieve high throughput values and low latency simultaneously. (“We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference.” (Gao, Astract))

Regarding CLAIM 18, Woo teaches: The computer program product of claim 14, 
However, Woo does not explicitly teach: wherein said determining comprises determining a sequence of variable batch sizes corresponding to the multiple layers of the one or more deep neural networks.
But Gao teaches: wherein said determining comprises determining a sequence of variable batch sizes corresponding to the multiple layers of the one or more deep neural networks. (“We perform microbenchmarks using various input batch sizes” (p. 12, col. 2)) 
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the cellular batching method on recurrent neural networks from Gao’s system into the batching in Woo’s system, with a motivation to achieve high throughput values and low latency simultaneously. (“We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference.” (Gao, Astract))

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Woo in view of Canziani et al. (“An Analysis of Deep Neural Network Models for Practical Applications”), Molchanov et al. (“Pruning Convolutional Neural Networks for Resource Efficient Inference”), and Ambrose et al. (US 20170344882 A1).

Regarding CLAIM 20 Woo teaches: A computer-implemented method, the method comprising: 
obtaining, as input for inferencing of one or more deep neural networks, (C. 4, L. 6-10 teaches inferencing: “A neural network having multiple layers can be used to compute inferences. For example, given an input, the neural network can compute an inference for the input. The neural network computes this inference by processing the input through each of the layers of the neural network.” Additionally, C. 5, L. 10-17 teaches inferencing: “In some implementations, when performing inference computations, rather than using the off-chip memory, the external controller can use the on-chip memory of the hardware circuit to store inputs and parameters. In response to receiving controls signals from at least one controller of the system, the hardware circuit accesses the on-chip memory and uses the stored inputs and parameters to perform neural network computations.”)
 (i) an inferencing model, wherein the inferencing model comprises a feed forward model, and (An inferencing model is taught by C. 1, L. 18-19: “convolutional neural network layers” (interpreted as a feed forward model). Obtaining an inferencing model for inferencing is interpreted as retrieving data values for the neural network, as taught by C. 6, L. 10-16:  “In some implementations, input values can be pre-loaded to activation memory 104 and parameter/weight values can be pre-loaded to parameter memory 106 using data values received by circuit 100 from an external or higher level control device associated with a neural network computing system.”) 
(ii) constraints comprising (a) total available memory (Obtaining constraints as inputs for inferencing is taught by circuit 100 determining total capacity of memory of a hardware circuit — C. 14, L. 53-59: “In some implementations, determining a partitioning of neural network layers into a sequence of superlayers includes: … ii) circuit 100 determining a particular aggregate input activation and parameter capacity of a memory of a hardware circuit;” C. 14, L. 65 discloses a total available on-chip memory: “a storage capacity… of on-chip memory may be 500 megabyte (MB).” — The inferencing is taught by C. 14, L. 48-52: “Hence, neural network layers can be partitioned into a sequence of superlayers so as to not exceed a threshold storage capacity of on-chip memory when a hardware circuit of circuit 100 processes one or more batches of neural network inputs.”)
computing, based at least in part on the obtained input, a set of statistics pertaining to resource utilization for each of multiple layers in the one or more deep neural networks, wherein the set of statistics comprises 
(i) amount of working memory, (Woo teaches (i), the statistic of a working set, defined as, “a size parameter that indicates an amount of memory needed to process the one or more inputs through each of the layers in the superlayer” (C. 2 L. 22-30) and where “Circuit 100 can then determine an amount of memory required to store respective sets of parameters for each layer of a neural network” (C. 14, L. 52 - C. 15, L. 5)) 
(ii) input size (“Thus, in this example, circuit 100 determines that aggregate memory usage for respective sets of parameters for layers A, B, and C is 300 MB, leaving 200 MB of available on-chip memory for use in storing inputs. For respective layers A, B, C, circuit 100 can determine a particular size parameter for inputs of working sets to be processed by respective layers and a corresponding batch size for the working set.” (C. 15 L. 10-17))
determining, for each respective one of the multiple layers of the one or more deep neural networks, a corresponding batch size, wherein the determining is based at least in part on (i) the obtained input and (ii) the computed set of statistics, and wherein the batch size determined for a first one of the multiple layers is different than the batch size determined for a second one of the multiple layers; and (C. 15, L. 13-17: “For respective layers A, B, C, circuit 100 can determine a particular size parameter for inputs of working sets to be processed by respective layers and a corresponding batch size for the working set.” Corresponding batch sizes indicates different batch sizes.)
using the determined batch sizes for inferencing the multiple layers of the one or more deep neural networks; (C. 4, L. 6-10 teaches inferencing, and C. 15, L. 13-17 teaches using the determined batch sizes.)
wherein the method is carried out by at least one computing device. (“a CPU or GPU” C. 4 L. 55).
However, Woo does not explicitly teach the constraints comprising (b) maximum latency for inferencing, and (c) maximum energy for inferencing; set of statistics comprising (ii) activation size, (iii) time to process a layer for each of multiple batch sizes, and (iv) energy to process a layer for each of the multiple batch sizes;
But Canziani teaches: constraints comprising (b) maximum latency for inferencing, and (Canziani at p. 4 §3.5 states: “there is a linear relationship between operations count and inference time per image. Therefore, at design time, we can pose a constraint on the number of operation to keep processing speed in a usable range for real-time applications or resource-limited deployments.” In this context, Canziani’s constraint is a minimum number of operations. Canziani teaches a minimum number of operation is proportional to a minimum inference time. Where inference time is interpreted as being inversely proportional to inference time, a constraint on the minimum number of operations is directly proportional to a constraint on a maximum latency for inferencing.)
 (c) maximum energy for inferencing; (Canziani, p. 6, first paragraph: “an upper bound in accuracy even for an energetic constraint, which could possibly be an essential designing factor for a 
a set of statistics comprising (iv) energy to process a layer for each of the multiple batch sizes (Canziani Fig. 4, where net power consumption is an average energy, i.e., an energy statistic, consumed while processing layers. Fig. 4 shows net power consumption for each of multiple batch sizes. Broadly, an energy statistic can be the energy constraint as recited in limitation (c) (p. 6 first and last paragraphs).
	Canziani is in the field of inferencing neural networks. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Canziani’s system into Woo’s system. The combination further constrains Woo’s neural network by a certain latency and energy as taught by Canziani. The combination would also consider the energy to process, in Woo, a layer for each of the multiple batch sizes. A motivation for this combination is to further optimize neural networks. (“The purpose of this paper is to stress the importance of these figures, which are essential hard constraints for the optimisation of these networks in practical deployments and applications.” (Canziani, p. 1))
	However, the combination of Woo and Canziani does not explicitly teach: set of statistics comprising (ii) activation size and (iii) time to process a layer for each of multiple batch sizes
But Molchanov teaches: set of statistics comprising (ii) activation size (“If an activation value (an output feature map) is small then this feature detector is not important for the prediction task at hand. We may evaluate this by the mean activation… or by the standard deviation of the activation” (end of p. 3))
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Molchanov’s system into the combination (Molcanov § 2). 
	However, the combination of Woo, Canziani, and Molchanov does not explicitly teach: set of statistics comprising (iii) time to process a layer for each of multiple batch sizes
	But Ambrose teaches: set of statistics comprising (iii) time to process a layer for each of multiple batch sizes ([0153] Extended Cost Estimations [0154] The following are specific example formulations to estimate the memory size and execution time for different scheduling schemes, depending upon the location of the data. [0155] The minimum on-chip shared memory size required depends on the scheduling scheme and whether input/output data is stored in on-chip memory or external memory… “layer execution time = input FM processing pipeline latency*inFM/numPU” [0183])
	Ambrose is in the field of scheduling schemes for executing convolutional neural networks. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Abrose’s system into the combination of Woo, Canziani, and Molchanov’s system by accounting for the layer execution time for Woo’s neural networks using the methods taught by Ambrose, with a motivation to estimate the memory size and execution time for different scheduling schemes. (Ambrose [0153])

Response to Arguments
	The following is a response to the claim amendments and remarks filed 07/26/2021 and the telephone interview with Applicant on 07/21/2021.

Claim Rejections under 35 U.S.C. § 112: The 35 USC 112 rejections of claims 8 and 17 are withdrawn due to the claim amendments and remarks.

Claim Rejections under 35 U.S.C. § 101: The 35 USC 101 rejection of claims 1-20 are withdrawn due to the claim amendments and remarks. Specifically the limitations reciting “using the determined batch sizes for inferencing” in the independent claims 1, 19, and 20 and the limitation reciting “apply the determined batch size for inferencing” in independent claim 14 integrate the judicial exception into a practical application in Step 2A Prong Two. 

Claim Rejections under 35 U.S.C. § 103: The Applicant's arguments filed 07/26/2021 have been fully considered but they are not persuasive. The broadest reasonable interpretation of the independent claim limitations determining, for each respective one of the multiple layers of the one or more deep neural networks, a corresponding batch size… and wherein the batch size determined for a first one of the multiple layers is different than the batch size determined for a second one of the multiple layers is taught by Woo (U.S. 10,019,668) at C. 15, L. 13-17, as indicated in the prior art claim rejections in this office action. 
The claim rejection is maintained.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Boesch et al. (U.S. 10,417,364) discloses different layers of a neural network using different batch sizes at C. 38, L. 12-23.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Asher Jablon whose telephone number is (571)270-7648. The examiner can normally be reached Monday - Friday, 9:00 am - 6:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ASHER JABLON/

/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127