Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Amendments 
Claims 1, 4, 7, 14, 16, and 19-20 are amended. Claims 8 and 17 are canceled. Claims 21-22 are new. Claims 1-7, 9-16, and 18-22 are pending and have been considered.
Examiner’s note: Claim 19 recites the same features as the method of claim 1 but it is missing the limitation “for each of the multiple layers” from claim 1, line 6.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-2, 7, 9-16, 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Young (US 20160342890 A1, cited in IDS filed 03/20/2019) in view of Woo et al. (US 10019668 B1, cited in PTO-892 filed 04/27/2021) and Rouhani et al. (US 20200027016 A1).

	Regarding Claim 1, Young teaches: A computer-implemented method, the method comprising: 
obtaining, as input for inferencing of one or more deep neural networks, an inferencing model and 
computing … wherein, for each of the multiple layers, 
determining, for each respective one of the multiple layers of the one or more deep neural networks, a corresponding one of the multiple batch sizes, wherein the determining is based at least in part on the obtained input 
using the determined batch sizes for inferencing the multiple layers of the one or more deep neural networks; (Last sentence of ¶ [0064] and all of ¶ [0066])
wherein the method is carried out by at least one computing device. (¶ [0030] teaches a systolic array and ¶ [0059] teaches CPU and GPU)
Young teaches resource constraints of processing speed and memory access speed in
¶ [0047] and a clock rate of the memory storing the weight inputs, a number of arithmetic units inside
the circuit, and a number of channels in memory in ¶ [0049], but Young does not explicitly teach that these resource constraints are used to compute input and output activation sizes for each of multiple permissible batch sizes. Therefore, Young does not explicitly teach: obtaining, as input for inferencing of one or more deep neural networks, an inferencing model and one or more resource constraints; 
computing, based at least in part on the obtained input, a set of statistics pertaining to resource utilization for each of multiple layers in the one or more deep neural networks, wherein, for each of the multiple layers, the set of statistics comprises: input and output activation sizes for each of multiple batch sizes; an amount of working memory; a time to process a layer for each of the multiple batch sizes; and energy to process each of the multiple batch sizes, wherein the multiple batch sizes are based at least in part on the one or more resource constraints, and wherein the time and the energy are determined by executing the one or more deep neural networks with each of the multiple batch sizes; 
wherein the determining a corresponding batch size is based at least in part on the computed set of statistics
	But Woo teaches: obtaining, as input for inferencing of one or more deep neural networks, an inferencing model and one or more resource constraints; (Woo teaches a neural network for inferencing at C. 7, L. 63-66. Obtaining resource constraints as inputs for inferencing is taught by a circuit 100 determining total capacity of memory of a hardware circuit — C. 14, L. 53-59: “In some implementations, determining a partitioning of neural network layers into a sequence of superlayers includes: … ii) circuit 100 determining a particular aggregate input activation and parameter capacity of a memory of a hardware circuit;” C. 14, L. 65 discloses a total available on-chip memory: “a storage capacity… of on-chip memory may be 500 megabyte (MB).”)
computing, based at least in part on the obtained input, a set of statistics pertaining to resource utilization for each of multiple layers in the one or more deep neural networks, wherein, for each of the multiple layers, the set of statistics comprises: input and output activation sizes for each of multiple batch sizes; an amount of working memory; …  wherein the multiple batch sizes are based at least in part on the one or more resource constraints, and (These limitations are taught by the following citations. C. 8, L. 10-14, 17-20 and C. 9, L. 31-37 teaches that a working set includes input and output activation sizes based on an amount of memory, because the memory space reserved for the output of a layer corresponds to the memory space reserved for the input of the next layer. C. 14, L. 65 to C. 15, L. 13 teaches computing that there is 200 MB of memory available for storing inputs out of 500 MB of on-chip memory.)
wherein the determining a corresponding batch size is based at least in part on the computed set of statistics (C. 9, L. 12-22 teaches processing two batches 212, 214 simultaneously. C. 9, L. 51-C. 10, L. 14 teaches that processing both batches simultaneously exceeds available memory resources. C. 12, L. 20-31 teaches scheduling a single batch based on available memory resources. Evidence that a single batch is processed is found at C. 17, L. 20.)
However, neither Young nor Woo explicitly teaches: the set of statistics comprises: a time to process a layer for each of the multiple batch sizes; and energy to process each of the multiple batch sizes,
wherein the time and the energy are determined by executing the one or more deep neural networks with each of the multiple batch sizes; 
	But Rouhani teaches: the set of statistics comprises: amount of working memory, a time to process a layer for each of the multiple batch sizes; and energy to process each of the multiple batch sizes (All limitations are taught by [0043] and [0045]-[0049]. Specifically, the last 4 lines of [0045] and paragraph [0047] after equation (7) discloses a memory size constraint                         
                            
                                
                                    M
                                
                                
                                    u
                                
                            
                        
                    , a time to process a batch of size                         
                            
                                
                                    b
                                
                                
                                    s
                                
                            
                        
                     is runtime constraint                         
                            
                                
                                    T
                                
                                
                                    u
                                
                            
                        
                     and an energy to process a batch of size                         
                            
                                
                                    b
                                
                                
                                    s
                                
                            
                        
                     is power constraint                         
                            
                                
                                    P
                                
                                
                                    u
                                
                            
                        
                    . Regarding the feature of a layer, ¶ [0034] lines 1-6 teaches using a deep belief network, which includes layers.)
wherein the time and the energy are determined by executing the one or more deep neural networks with each of the multiple batch sizes; (¶ [0048], lines 9-14, “In some example embodiments…”)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated Rouhani’s memory, runtime and power constraints into Young/Woo’s system. A motivation for the combination is to accommodate the hardware constraints of the hardware implementing machine learning models ( Rouhani, [0032])

	Regarding Claim 2, the combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1,
Young teaches: wherein the inferencing model comprises a feed forward model. (¶ [0017] to ¶ [0019], line 4; ¶ [0066] and the corresponding Fig. 6)

	Regarding Claim 7, the combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1, 
However, Young does not explicitly teach: wherein the one or more resource constraints comprises total available memory, maximum latency for inferencing, and maximum energy for inferencing.
	But Woo teaches: wherein the one or more resource constraints comprises total available memory, (C. 14, L. 65 discloses a total available on-chip memory: “a storage capacity… of on-chip memory may be 500 megabyte (MB).”)
However, neither Young nor Woo explicitly teaches: wherein the one or more resource constraints comprises… maximum latency for inferencing, and maximum energy for inferencing.
But Rouhani teaches: wherein the one or more resource constraints comprises total available memory, maximum latency for inferencing, and maximum energy for inferencing. (The last 4 lines of [0045] and paragraph [0047] after equation 7, where total available memory is taught by the memory constraint, maximum latency for inferencing is taught by the runtime constraint and maximum energy for inferencing is taught by the power constraint.)

Regarding Claim 9, the combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1, 
Young teaches: wherein said determining comprises determining a sequence of variable batch sizes corresponding to the multiple layers of the one or more deep neural networks. (¶ [0061] - [0064] and the corresponding Fig. 6 teach determining a batch size for each layer from the set of multiple permissible batch sizes. Layer 1, Layer 2, Layer 3, and Layer 6 have different batch sizes from one another. Layers 4 and 5 have the same batch size as Layer 3.)

Regarding Claim 10, the combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1, 
Young teaches: wherein said determining increases one or more throughput values associated with the inferencing of the one or more deep neural networks. (¶ [0008], lines 7-12 and ¶ [0046], lines 1-6)

Regarding Claim 11, the combination of Young, Woo, and Rouhani teaches: The computer- implemented method of claim 1, 
Young teaches: wherein said determining decreases one or more energy values associated with the inferencing of the one or more deep neural networks. (¶ [0008], fourth-to-last line teaches the circuit avoids stalling of the circuit, which decreases energy consumption.)

Regarding Claim 12, the combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1,
Young teaches: wherein said determining decreases one or more latency values associated with the inferencing of the one or more deep neural networks. (¶ [0048], last 2 lines teaches the circuit minimizes memory access wait time.)

Regarding Claim 13, the combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1, 
Young teaches: wherein said determining decreases one or more memory values associated with the inferencing of the one or more deep neural networks. (¶ [0048], last 2 lines teaches the circuit minimizes memory access wait time. A memory access wait time fits within the broadest reasonable interpretation of a “memory value associated with the inferencing”.)

Claims 14-15 and 18 are directed to a product that implements the same features as the method of claims 1-2 and 9, respectively, and are therefore rejected for at least the same reasons therein. Claims 14-15 and 18 also recite the limitation: “A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computing device”. Young teaches this limitation on p. 8 in claim 19, lines 1-4.

Regarding Claim 16, the combination of Young, Woo, and Rouhani teaches: The computer program product of claim 14, 
However, Young does not explicitly teach: wherein the one or more resource constraints comprises at least one of total available memory, maximum latency for inferencing, and maximum energy for inferencing.
	But Woo teaches: wherein the one or more resource constraints comprises at least one of total available memory, (C. 14, L. 65 discloses a total available on-chip memory: “a storage capacity… of on-chip memory may be 500 megabyte (MB).”)
However, neither Young nor Woo explicitly teaches: wherein the one or more resource constraints comprises at least one of… maximum latency for inferencing, and maximum energy for inferencing.
But Rouhani teaches: wherein the one or more resource constraints comprises at least one of  total available memory, maximum latency for inferencing, and maximum energy for inferencing. (The last 4 lines of [0045] and paragraph [0047] after equation 7, where total available memory is taught by the memory constraint, maximum latency for inferencing is taught by the runtime constraint and maximum energy for inferencing is taught by the power constraint.)

Claim 19 is directed towards a system that implements the same features as the method of claim 1 and is therefore rejected for at least he same reasons therein. Claim 19 also recites the limitation: “A system comprising: a memory; and at least one processor operably coupled to the memory and configured for”. Young ¶ [0030] teaches a systolic array and ¶ [0059] teaches CPU and GPU, all of which are processors. Young teaches memory on p. 8 in claim 19, lines 1-4, where a computer is understood to comprise at least one processor.

Regarding Claim 20, Young teaches: A computer-implemented method, the method comprising: 
obtaining, as input for inferencing of one or more deep neural networks, an inferencing model, wherein the inferencing model comprises a feed forward model, and constraints 
computing, based at least in part on the obtained input, a set of statistics pertaining to resource utilization for each of multiple layers in the one or more deep neural networks, wherein the set of statistics comprises 
determining, for each respective one of the multiple layers of the one or more deep neural networks, a corresponding one of the multiple batch sizes, wherein the determining is based at least in part on the obtained input and the computed set of statistics, and wherein the batch size determined for a first one of the multiple layers is different than the batch size determined for a second one of the multiple layers; and (¶ [0061] - [0064] and the corresponding Fig. 6 teach determining a batch size for each layer from the set of multiple permissible batch sizes based on the weight reuse value. The weight reuse value incorporates a clock rate. Layer 1, Layer 2, Layer 3, and Layer 6 have different batch sizes from one another. Layers 4 and 5 have the same batch size as Layer 3.)
using the determined batch sizes for inferencing the multiple layers of the one or more deep neural networks; (Last sentence of ¶ [0064] and all of ¶ [0066])
wherein the method is carried out by at least one computing device. (¶ [0030] teaches a systolic array and ¶ [0059] teaches CPU and GPU)
Young teaches each of multiple batch sizes. However, Young does not explicitly teach: obtaining constraints comprising: total available memory; maximum latency for inferencing; and maximum energy for inferencing;
wherein the set of statistics comprises amount of working memory, input activation size and output activation size for each of multiple batch sizes, time to process a layer for each of the multiple batch sizes, and energy to process a layer for each of the multiple batch sizes, wherein the time to process a layer and the energy to process a layer are determined by executing the one or more deep neural networks with each of the multiple batch sizes;
	But Woo teaches: obtaining constraints comprising: total available memory; (Obtaining resource constraints as inputs for inferencing is taught by a circuit 100 determining total capacity of memory of a hardware circuit — C. 14, L. 53-59: “In some implementations, determining a partitioning of neural network layers into a sequence of superlayers includes: … ii) circuit 100 determining a particular aggregate input activation and parameter capacity of a memory of a hardware circuit;” C. 14, L. 65 discloses a total available on-chip memory: “a storage capacity… of on-chip memory may be 500 megabyte (MB).”)
computing a set of statistics comprises amount of working memory, input activation size and output activation size for each of multiple batch sizes, (These limitations are taught by the following citations C. 8, L. 10-14, 17-20 and C. 9, L. 31-37 teaches that a working set includes input and output activation sizes based on an amount of memory, because the memory space reserved for the output of a layer corresponds to the memory space reserved for the input of the next layer. C. 14, L. 65 to C. 15, L. 13 teaches computing that there is 200 MB of memory available for storing inputs out of 500 MB of on-chip memory.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated Woo’s available memory constraint into Young’s list of resource constraints in ¶ [0049] which are used to determine a weight reuse value, and it would have been obvious to have computed input and output activation sizes in Young’s networks based on this weight reuse value. A motivation for the combination is that improving the scheduling of inputs and
outputs in a neural network based on resource constraints maximizes efficient use of available on-chip
resources, reduces external communications, and leads to an increase in available system bandwidth
and an overall decrease in energy consumption by system components. (Woo, C. 11, L. 13-20)
However, neither Young nor Woo explicitly teaches: obtaining constraints comprising: maximum latency for inferencing; and maximum energy for inferencing;
wherein the set of statistics comprises time to process a layer for each of the multiple batch sizes, and energy to process a layer for each of the multiple batch sizes, wherein the time to process a layer and the energy to process a layer are determined by executing the one or more deep neural networks with each of the multiple batch sizes;
But Rouhani teaches: obtaining constraints comprising: total available memory; maximum latency for inferencing; and maximum energy for inferencing; (All limitations are taught by [0043] and [0045]-[0049]. Specifically, the last 4 lines of [0045] and paragraph [0047] after equation (7) discloses a memory size constraint                         
                            
                                
                                    M
                                
                                
                                    u
                                
                            
                        
                    , a time to process a batch of size                         
                            
                                
                                    b
                                
                                
                                    s
                                
                            
                        
                     is runtime constraint                         
                            
                                
                                    T
                                
                                
                                    u
                                
                            
                        
                     and an energy to process a batch of size                         
                            
                                
                                    b
                                
                                
                                    s
                                
                            
                        
                     is power constraint                         
                            
                                
                                    P
                                
                                
                                    u
                                
                            
                        
                    .)
wherein the set of statistics comprises amount of working memory, time to process a layer for each of the multiple batch sizes, and energy to process a layer for each of the multiple batch sizes, (All limitations are taught by [0043] and [0045]-[0049]. Specifically, the last 4 lines of [0045] and paragraph [0047] after equation (7) discloses a memory size constraint                         
                            
                                
                                    M
                                
                                
                                    u
                                
                            
                        
                    , a time to process a batch of size                         
                            
                                
                                    b
                                
                                
                                    s
                                
                            
                        
                     is runtime constraint                         
                            
                                
                                    T
                                
                                
                                    u
                                
                            
                        
                     and an energy to process a batch of size                         
                            
                                
                                    b
                                
                                
                                    s
                                
                            
                        
                     is power constraint                         
                            
                                
                                    P
                                
                                
                                    u
                                
                            
                        
                    . Regarding the feature of a layer, ¶ [0034] lines 1-6 teaches using a deep belief network, which includes layers.)
wherein the time to process a layer and the energy to process a layer are determined by executing the one or more deep neural networks with each of the multiple batch sizes; (¶ [0048], lines 9-14, “In some example embodiments…”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated Rouhani’s memory, runtime and power constraints into Young/Woo’s system. A motivation for the combination is to accommodate the hardware constraints of the hardware implementing machine learning models ( Rouhani, [0032])

Claims 3-6 are rejected under 35 U.S.C. 103 as being unpatentable over Young (US 20160342890
A1, cited in IDS filed 03/20/2019) in view of Woo et al. (US 10019668 B1, cited in PTO-892 filed
04/27/2021), Rouhani et al. (US 20200027016 A1), and Han et al. (“Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding” arXiv version 5, cited in the IDS filed 03/20/2018 as NPL doc. 11)

Regarding Claim  3, the combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1,
In C. 6, L. 55-57, Woo teaches compressing activation data by storing only non-zero activation values in memory. However, neither Young, Woo, nor Rouhani explicitly teaches: wherein the inferencing model comprises a compressed model generated through weight-based pruning. 
But Han teaches: wherein the inferencing model comprises a compressed model generated through weight-based pruning. (P. 2, last paragraph, last 6 lines)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have compressed Young/Woo/Rouhani’s neural network by pruning connections with weights below a threshold. A motivation for the combination is to reduce the number of parameters. (Han, p. 2, last sentence)

Regarding Claim 4, the combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1,
In C. 6, L. 55-57, Woo teaches compressing activation data by storing only non-zero activation values in memory. However, neither Young, Woo, nor Rouhani does not explicitly teach: wherein the inferencing model comprises a compressed model generated through at least one of quantization and weight sharing.
But Han teaches: wherein the inferencing model comprises a compressed model generated
through at least one of quantization and weight sharing. (Han teaches simultaneous network
quantization and weight sharing on p. 3, § 3, first 2 paragraphs)
It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to have compressed Young/Woo/Rouhani’s neural network by performing quantization and weight sharing. A motivation for the combination is to reduce the number of bits required to represent each weight. (Han, p. 3, § 3, first sentence)

Regarding Claim 5, the combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1,
However, neither Young, Woo, nor Rouhani explicitly teaches: wherein the inferencing model comprises a compressed model generated through relative indexing.
But Han teaches: wherein the inferencing model comprises a compressed model generated through relative indexing. (P. 3, second paragraph, and Fig. 2 and its caption)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have compressed Young/Woo/Rouhani’s neural network through relative indexing. A motivation for the combination is to achieve further compression. (Han, p. 3, second paragraph)

Regarding Claim 6, the combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1,
However, neither Young, Woo nor Rouhani explicitly teaches: wherein the inferencing model comprises a compressed model generated through encoding.
But Han teaches: wherein the inferencing model comprises a compressed model generated through encoding. (P. 5, §4 teaches Huffman code)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have compressed Young/Woo/Rouhani’s neural network through Huffman coding. A motivation for the combination is to save on network storage. (Han, p. 5, §4, last sentence)

Claims 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over Young (US 20160342890 A1, cited in IDS filed 03/20/2019) in view of Woo et al. (US 10019668 B1, cited in PTO-892 filed 04/27/2021), Rouhani et al. (US 20200027016 A1), and Bianco et al. (“Benchmark Analysis of Representative Deep Neural Network Architectures”)

Regarding Claim 21, the combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1, 
Although Young and Woo generally teaches multiple batch sizes, neither Young nor Woo explicitly teaches: wherein the determining comprises maintaining at least two dynamic program tables for each given one of the multiple batch sizes, wherein: a first one of the at least two dynamic program tables comprises a time to perform inferencing operations for a given sample across the multiple layers of the one or more deep neural networks, wherein each of the multiple layers uses a batch size that is not greater than the given batch size; and a second one of the at least two dynamic program tables comprises a time to perform inferencing operations for a given sample across the multiple layers of the one or more deep neural networks, wherein one of the multiple layers uses a batch size equal to the given batch size, and each of the other multiple layers uses a batch size that is not greater than the given batch size.
But Rouhani teaches: a second one of the at least two dynamic program tables comprises a time to perform inferencing operations for a given sample across the multiple layers of the one or more deep neural networks, wherein one of the multiple layers uses a batch size equal to the given batch size, and each of the other multiple layers uses a batch size that is not greater than the given batch size. (A dynamic program table amounts to data about a batch size. See [0043] and [0045]-[0049] where a time to process a batch of size                         
                            
                                
                                    b
                                
                                
                                    s
                                
                            
                        
                     is runtime constraint                         
                            
                                
                                    T
                                
                                
                                    u
                                
                            
                        
                    . Regarding the feature of a layer, ¶ [0034] lines 1-6 teaches using a deep belief network, which includes layers.)
However, neither Young, Woo, nor Rouhani explicitly teaches all the features: wherein the determining comprises maintaining at least two dynamic program tables for each given one of the multiple batch sizes, wherein: a first one of the at least two dynamic program tables comprises a time to perform inferencing operations for a given sample across the multiple layers of the one or more deep neural networks, wherein each of the multiple layers uses a batch size that is not greater than the given batch size; and a second one of the at least two dynamic program tables comprises a time to perform inferencing operations for a given sample across the multiple layers of the one or more deep neural networks, wherein one of the multiple layers uses a batch size equal to the given batch size, and each of the other multiple layers uses a batch size that is not greater than the given batch size.
But Bianco teaches: wherein the determining comprises maintaining at least two dynamic program tables for each given one of the multiple batch sizes, wherein: (Two dynamic program tables includes a table of throughput for multiple batch sizes for each of a NVIDIA Titan X Pascal workstation and a NVIDIA Jetson TX1 embedded system. See p. 1, col. 2, in the paragraph starting “The aim of this work”, lines 1-7; p. 3, § 4.5; p. 4, § 5.3; and table 1 and its corresponding caption on p. 5.)
a first one of the at least two dynamic program tables comprises a time to perform inferencing operations for a given sample across the multiple layers of the one or more deep neural networks, wherein each of the multiple layers uses a batch size that is not greater than the given batch size; and (P. 4, § 5.3 and table 1(a) and its corresponding caption on p. 5. The feature of multiple layers is taught by P. 3, § 3 in col. 1, lines 6-9.)
a second one of the at least two dynamic program tables comprises a time to perform inferencing operations for a given sample across the multiple layers of the one or more deep neural networks, wherein one of the multiple layers uses a batch size equal to the given batch size, and each of the other multiple layers uses a batch size that is not greater than the given batch size.(P. 4, § 5.3 and table 1(b) and its corresponding caption on p. 5. The feature of multiple layers is taught by P. 3, § 3 in col. 1, lines 6-9.)
	It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have incorporated Bianco’s tables of inferences time into the combination of Young, Woo, nor Rouhani’s system. A motivation for the combination is to provide an immediate and comprehensive tool for guiding in the selection of the appropriate architecture responding to resource constraints in practical deployments and applications. (P. 7, col. 1, lines 1-5)

	Claim 22 is directed to a system that implements the same features as the method of claim 21 and is therefore rejected for at least the same reasons therein.

Response to Arguments
Applicant's arguments filed 10/13/2022 have been fully considered but they are not persuasive. 
Applicant’s argument #1 (Top of remarks p. 10): “For example, there is no disclosure or suggestion in Woo of ‘computing, based at least in part on the obtained input ... output activation sizes,’ as recited by claim 1, and the Office Action does not appear to address these features of claim 1. Rather, the Office Action merely cites Woo at C. 14, L. 65 to C. 15, L. 13, which relates to memory available for storing inputs.” 
Examiner’s response #1: Examiner respectfully disagrees. The Office Action mailed 07/13/2022 cites Woo C. 8, L. 10-14, 17-20 and C. 14, L. 65 to C. 15, L. 13 for the limitation indicated by the Applicant. Woo C. 9, L. 31-37 further teaches the memory space reserved for the output of a layer corresponds to the memory space reserved for the input of the next layer. Woo clearly teaches the claim 1 limitation “computing, based at least in part on the obtained input, a set of statistics pertaining to resource utilization for each of multiple layers in the one or more deep neural networks, wherein, for each of the multiple layers, the set of statistics comprises: input and output activation sizes for each of multiple batch sizes”.
	Young teaches: A computer-implemented method, the method comprising: obtaining, as input for inferencing of one or more deep neural networks, an inferencing model and 
Woo teaches: obtaining, as input for inferencing of one or more deep neural networks, an inferencing model and one or more resource constraints; computing, based at least in part on the obtained input, a set of statistics pertaining to resource utilization for each of multiple layers in the one or more deep neural networks, wherein, for each of the multiple layers, the set of statistics comprises: input and output activation sizes for each of multiple batch sizes; an amount of working memory; …  wherein the multiple batch sizes are based at least in part on the one or more resource constraints, and wherein the determining a corresponding batch size is based at least in part on the computed set of statistics
Rouhani teaches: the set of statistics comprises: amount of working memory, a time to process a layer for each of the multiple batch sizes; and energy to process each of the multiple batch sizes, wherein the time and the energy are determined by executing the one or more deep neural networks with each of the multiple batch sizes;
The combination of Young, Woo, and Rouhani teaches each and every limitation of claim 1.

Applicant’s argument #2 (Remarks p. 11): “A combination of cited references fails to disclose or suggest amended claim 20 for similar reasons as noted above with respect to claim 1. Accordingly, reconsideration and withdrawal of the §103 rejection of claim 20 are respectfully requested.”
Examiner’s response #2: Applicant’s arguments with respect to claims 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Young teaches: A computer-implemented method, the method comprising: obtaining, as input for inferencing of one or more deep neural networks, an inferencing model, wherein the inferencing model comprises a feed forward model, and constraints obtained input, a set of statistics pertaining to resource utilization for each of multiple layers in the one or more deep neural networks, wherein the set of statistics comprises 
	Woo teaches: obtaining constraints comprising: total available memory; computing a set of statistics comprises amount of working memory, input activation size and output activation size for each of multiple batch sizes, 
Rouhani teaches: obtaining constraints comprising: total available memory; maximum latency for inferencing; and maximum energy for inferencing; wherein the set of statistics comprises amount of working memory, time to process a layer for each of the multiple batch sizes, and energy to process a layer for each of the multiple batch sizes, wherein the time to process a layer and the energy to process a layer are determined by executing the one or more deep neural networks with each of the multiple batch sizes;
The combination of Young, Woo, and Rouhani teaches each and every limitation of claim 20.

Applicant’s argument #3 (Remarks p. 11-12): “Therefore, new claims 21 and 22 are patentable over the cited references over and above their respective dependence from amended claims 1 and 19.”
Examiner’s response #3: Applicant’s arguments with respect to claims 21-22 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
The combination of Young, Woo, and Rouhani teaches: The computer-implemented method of claim 1, 
Rouhani teaches: a second one of the at least two dynamic program tables comprises a time to perform inferencing operations for a given sample across the multiple layers of the one or more deep neural networks, wherein one of the multiple layers uses a batch size equal to the given batch size, and each of the other multiple layers uses a batch size that is not greater than the given batch size.
Bianco teaches: wherein the determining comprises maintaining at least two dynamic program tables for each given one of the multiple batch sizes, wherein: a first one of the at least two dynamic program tables comprises a time to perform inferencing operations for a given sample across the multiple layers of the one or more deep neural networks, wherein each of the multiple layers uses a batch size that is not greater than the given batch size; and a second one of the at least two dynamic program tables comprises a time to perform inferencing operations for a given sample across the multiple layers of the one or more deep neural networks, wherein one of the multiple layers uses a batch size equal to the given batch size, and each of the other multiple layers uses a batch size that is not greater than the given batch size.
The combination of Young, Woo, Rouhani, and Bianco teaches each and every limitation of claims 21-22.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. “AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks” to Devarakonda et al. teaches adaptive batch sizes.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Asher H. Jablon whose telephone number is (571)270-7648. The examiner can normally be reached Monday - Friday, 9:00 am - 6:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Al Kawsar can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/A.H.J./Examiner, Art Unit 2127                                                                                                                                                                                                        


/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127