Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 15 objected to because of the following informalities:  Claim 15 is identical to claim 14.  Applicant is advised that should claim 14 be found allowable, claim 15 will be objected to under 37 CFR 1.75 as being a substantial duplicate thereof. When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a slight difference in wording, it is proper after allowing one claim to object to the other as being a substantial duplicate of the allowed claim. See MPEP § 608.01(m).

Claim Rejections - 35 USC § 101
101 Rejection
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 USC § 101 because the claimed invention is directed to non-statutory subject matter.

Regarding Claim 1:  Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to a method which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis:  Claim 1 recites a computer implemented method of optimizing neural networks, which, under its broadest reasonable interpretation is a series of mental processes and mathematical calculations.  For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing, including the following: 
	calculating, by the processor, performance metrics for executing the predefined convolution processing on the computing device, as functions of the predefined parameters, as proxy estimates of performance of different possible design choices to implement the predefined convolution processing (mathematical calculation).  
Therefore, claim 1 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis:  Claim 1 recites additional elements “A computing device”, and the “processor”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component.  An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application.  Claim 1 also recites additional insignificant extra-solution activity of inputting or gathering data. Claim 1 also recites additional elements “convolution mapping” and “convolution processing” which 
Step 2B Analysis:  Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 1 amount to no more than mere instructions to apply the judicial exception using a generic computer component.
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claims 10 and 19, which recite a system and a computer program product, respectively, as well as to dependent claims 2-9, 11-18, and 20.  The additional limitations of the dependent claims are addressed briefly below:
Dependent claim 2 recites additional insignificant extra-solution activity “wherein the possible convolution mappings are mappings onto a predetermined accelerator architecture configuration.” Which amounts to selection of a data type.
Dependent claims 3 and 11 recites additional mental processes of observation, evaluation, and judgement “determining an optimal configuration for implementing the predefined convolution processing.”
Dependent claims 4 and 12 recite additional insignificant extra-solution activity of data gathering “receiving input data defining one or more constraints” as well as additional mental processes of observation, identifying invalid convolution mapping options based on the constraints”
Dependent claims 5, 13, and 19 recite additional mental processes of observation, evaluation, and judgement “determining an optimal convolution mapping.”
Dependent claims 6, 14, and 15 recite additional generic computer components and selection of data source “as implemented on a computer different from the computing device that will execute the predefined convolution processing.”
Dependent claims 7 and 16 recite additional generic computer components “remote server” and “cloud service”.  Additionally, the claims recite insignificant extra-solution activity of selection of data source “as implemented on one of: a server remote from the computing device; and as a cloud service..”
Dependent claims 8 and 17 recite additional generic computer components “software tool”. Additionally, the claims recite insignificant extra-solution activity of selection of data source “as implemented as a software tool on the computing device that will execute the predefined convolution processing.”
Dependent claims 9 and 18 recite additional generic computer components “machine-readable instructions”. 




Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5, 8-13, and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Kung (“Mapping Systolic Arrays onto 3D Circuit Structures: Accelerating Convolutional Neural Network Inference”, 2018) and in view of Wang (“Enhanced Efficiency 3D Convolution Based on Optimal FPGA Accelerator”, 2017). 

Regarding claim 1, Kung teaches A method for improving performance of a predefined Deep Neural Network (DNN) convolution processing on a computing device, the method comprising: ([Abstract] "—In recent years, numerous designs 
inputting parameters as input data into a processor on a computer that formalizes a design space exploration of a convolution mapping on a predefined computer architecture that will execute the predefined convolution processing, ([p. 2 §B] "For a given convolutional layer, we use a wide systolic array to cover the entire width of the input channels and some subset of features (a block of rows in the feature matrix), as depicted in Figure 2 (a)" See FIG. 13 and §A on p. 6 for predefined computer architecture that will execute the predefined convolution processing.) 
wherein the parameters are predefined as guided by a specification for the predefined convolution processing to be implemented by the convolution mapping and by a microarchitectural specification for the processor that will execute the predefined convolution processing ([p. 2 §B] "Figure 3 summarizes the overall approach. A wide systolic array (a) is broken into a series of subarrays, the partition scheme (b) in Figure 3, each of which operating on a subset of the input channels of the large CNN layer" Partition scheme interpreted as synonymous with microarchitectural specification.  Input channels interpreted as synonymous with input parameters.). 
calculating, by the processor, performance metrics for executing the predefined convolution processing on the computing device, as functions of the predefined parameters, as proxy estimates of performance of different possible design choices to implement the predefined convolution processing.  

Wang, in the same field of endeavor, teaches calculating, by the processor, performance metrics for executing the predefined convolution processing on the computing device, as functions of the predefined parameters, as proxy estimates of performance of different possible design choices to implement the predefined convolution processing. ([p. 6913 §IV] "This section experimentally demonstrates the performance of the implemented accelerator. First, we show the resource usage of the accelerator under different pixel precisions. Then, we present the performance of the designed structure, and compare it with other proposed technologies." See Table IV for different input parameters and the relative performance.). 

Kung and Wang are both directed towards mapping 3D mappings of convolution operations to systolic arrays.  Therefore, Kung and Wang are analogous arts in the same field of endeavor. It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Kung with the teachings of Wang by implementing the 3D convolution operations in Wang with the 3D systolic array in Kung. Kung teaches that by arranging the systolic array in a “2.5D” or 3D layout that throughput is significantly improved,  Wang teaches 

Regarding claim 2, the combination of Kung, and Wang teaches The method of claim 1, wherein the possible convolution mappings are mappings onto a predetermined accelerator architecture configuration. (Wang [p. 6912] "FIGURE 6. The implementation architecture for the optimized 3D convolution accelerator, presenting 3 or 4 channel feature maps at one time depending on the length of the 3D convolution kernel." FIGURE 6 shows a predetermined architecture configuration for the possible convolution mappings). 
	The motivation for combination taught in claim 1 also applies to claim 2. 

Regarding claim 3, the combination of Kung, and Wang teaches The method of claim 2, further comprising determining an optimal configuration for implementing the predefined convolution processing. (Wang [p. 6910 §II] "In this paper, a 3D convolution operator based on optimal FPGA accelerator is presented. The optimal operation is achieved by taking advantages of the Intermediate Data Delay Lines (IDDL) to avoid pixels loading repetition." Optimal operation interpreted as 
The motivation for combination taught in claim 1 also applies to claim 3.

Regarding claim 4, the combination of Kung, and Wang teaches The method of claim 1, further comprising: receiving input data defining one or more constraints; and (Wang See FIGURE 2. Pseudo code of the valid 3D convolution.  With respect to the instant specification, constraint "row<M-P+1" interpreted as synonymous.)
identifying invalid convolution mapping options based on the constraints. (Wang Determining valid mappings based on the constraints implicitly teaches determining invalid mappings based on the constraints.  It would be obvious to one of ordinary skill in the art that the pseudo-code for determining the valid mappings does so by determining which mappings are not invalid, which requires determining which mappings are invalid.).
The motivation for combination taught in claim 1 also applies to claim 4.

Regarding claim 5, the combination of Kung, and Wang teaches The method of claim 1, further comprising determining an optimal convolution mapping. (Wang [p. 6910 §II] "Three convolution styles, valid, same and full, operate based on different boundary processing methods [20]. Here, we use a valid 3D convolution operation, whose convolution kernel is only allowed to visit the domains where the kernel is contained entirely within the input feature maps. For this 3D convolution style, 
The motivation for combination taught in claim 1 also applies to claim 5.

Regarding claim 8, the combination of Kung, and Wang teaches The method of claim 1, as implemented as a software tool on the computing device that will execute the predefined convolution processing. (Kung [p. 6913 §IV] "Figure 9 shows the photograph of the realized accelerator using the Xilinx ZC706 developing board. This board is composed of the Xilinx ZYNQ, itself consists of a Kintex-7 FPGA and a dual ARM Cortex-A9 Processor, and 1GB DDR3 memory with the frequency bandwidth up to 4.2GB/s. The results are achieved using the Xilinx Vivado-2016.1 developing software." Xilinx Vivado interpreted as synonymous with software tool on the computing device that will execute the predefined convolution processing.). 
The motivation for combination taught in claim 1 also applies to claim 8.

Regarding claim 9, the combination of Kung, and Wang teaches The method of claim 1, as embodied as a set of machine-readable instructions on a non-transitory memory device. (Kung [p. 6913 §IV] "Figure 9 shows the photograph of the realized accelerator using the Xilinx ZC706 developing board. This board
is composed of the Xilinx ZYNQ, itself consists of a Kintex-7 FPGA and a dual ARM Cortex-A9 Processor, and 1GB DDR3 memory with the frequency bandwidth up to 
The motivation for combination taught in claim 1 also applies to claim 9.

Regarding claim 10, Kung teaches With the exception of the additional element "2-dimensional (2D) systolic processor array" claim 10 effectively mirrors claim 1.  With respect to the cited elements, they are explicitly cited by Kung ([p. 7 § V] "We claim that these partitioning schemes are naturally supported by 3D circuit structures...we support this claim with empirical results on wire length for 2D and 2.5D layouts of systolic arrays implemented on a 2.5D FPGA. Additionally, we demonstrate the effectiveness of cross-layer pipelining when each layer is implemented using the PA scheme, leading to a 10x reduction in inference runtime over the baseline P scheme as shown in Figure 15. We hope that these results encourage further work to explore the landscape of efficient systolic array designs in conjunction with pipelining computation over entire CNNs as well as 2.5D and 3D implementations for both training and inference"). 
The motivation for combination taught in claim 1 also applies to claim 10.

Claims 11-13, and 17-18 are substantially similar to claims 3-5, and 8-9.  Therefore, the rejections applied to claims 3-5, and 8-9 also apply to claims 11-13, and 17-18.

Regarding claim 19, Kung teaches An apparatus, comprising: a processor; and ([Abstract] "systolic arrays can process partitioned data channels in parallel with reduced data skew for lowered inference latency" Systolic array interpreted as systolic processor array.)
a memory device accessible by the processor, the memory device storing a set of instructions that permit the processor to execute ([p. 6913 §IV] "Figure 9 shows the photograph of the realized accelerator using the Xilinx ZC706 developing board. This board is composed of the Xilinx ZYNQ, itself consists of a Kintex-7 FPGA and a dual ARM Cortex-A9 Processor, and 1GB DDR3 memory with the frequency bandwidth up to 4.2GB/s. The results are achieved using the Xilinx Vivado-2016.1 developing software." Xilinx Vivado interpreted as synonymous with software tool on the computing device that will execute the predefined convolution processing.)
onto a plurality of processing elements connected as a 2-dimensionsl (2D) systolic processor array, the method comprising: ([p. 7 § V] "We claim that these partitioning schemes are naturally supported by 3D circuit structures...we support this claim with empirical results on wire length for 2D and 2.5D layouts of systolic arrays implemented on a 2.5D FPGA. Additionally, we demonstrate the effectiveness of cross-layer pipelining when each layer is implemented using the PA scheme, leading to a
10x reduction in inference runtime over the baseline P scheme as shown in Figure 15. We hope that these results encourage further work to explore the landscape of efficient systolic array designs in conjunction with pipelining computation over entire CNNs as well as 2.5D and 3D implementations for both training and inference").
inputting parameter values into a processor on a computer from a microarchitecture specification that defines configuration aspects of the processing elements ([p. 2 §B] "For a given convolutional layer, we use a wide systolic array to cover the entire width of the input channels and some subset of features (a block of rows in the feature matrix), as depicted in Figure 2 (a)" See FIG. 13 and §A on p. 6 for predefined computer architecture that will execute the predefined convolution processing.)
inputting parameter values into the processor from a specification that defines a convolution processing ([p. 2 §B] "Figure 3 summarizes the overall approach. A wide systolic array (a) is broken into a series of subarrays, the partition
scheme (b) in Figure 3, each of which operating on a subset of the input channels of the large CNN layer" Partition scheme interpreted as synonymous with microarchitectural specification.  Input channels interpreted as synonymous with input parameters.).
determining an optimal mapping onto the 3D systolic processor array for the convolution processing. ([p. 7 § V] "We claim that these partitioning schemes are naturally supported by 3D circuit structures...we support this claim with empirical results on wire length for 2D and 2.5D layouts of systolic arrays implemented on a 2.5D FPGA. Additionally, we demonstrate the effectiveness of cross-layer pipelining when each layer is implemented using the PA scheme, leading to a 10x reduction in inference runtime over the baseline P scheme as shown in Figure 15. We hope that these results encourage further work to explore the landscape of efficient systolic array designs in conjunction with pipelining computation over entire CNNs as well as 2.5D and 3D implementations for both training and inference"). 
a method of optimizing a mapping of convolutional layers of a Deep Neural Network (DNN),
calculating, by the processor, performance metrics for executing the convolution processing on the 2D systolic processor array, as functions of the predefined parameters, as proxy estimates of performance of different possible design choices to implement the convolution processing; 
inputting one or more constraints that permit the processor to eliminate invalid design choices. 

Wang, in the same field of endeavor, teaches a method of optimizing a mapping of convolutional layers of a Deep Neural Network (DNN) ([p. 6910 §II] "Three convolution styles, valid, same and full, operate based on different boundary processing methods [20]. Here, we use a valid 3D convolution operation, whose convolution kernel is only allowed to visit the domains where the kernel is contained entirely within the input feature maps. For this 3D convolution style, the number of the output feature maps is (S-R+1), with the size of (M-P+1) (N-Q+1). Figure 2 presents the pseudo code of the valid 3D convolution." Valid mapping interpreted as synonymous with optimal convolution mapping.  See also §III for how valid mapping is used explicitly in the optimization scheme.)
calculating, by the processor, performance metrics for executing the convolution processing on the 2D systolic processor array, as functions of the predefined parameters, as proxy estimates of performance of different possible design choices to implement the convolution processing; ([p. 6913 §IV] "This 
inputting one or more constraints that permit the processor to eliminate invalid design choices (See FIGURE 2. Pseudo code of the valid 3D convolution.  With respect to the instant specification, constraint "row<M-P+1" interpreted as synonymous. Determining valid mappings based on the constraints implicitly teaches eliminating invalid mappings based on the constraints.  It would be obvious to one of ordinary skill in the art that the pseudo-code for determining the valid mappings does so by determining which mappings are not invalid, which requires determining which mappings are invalid.). 

Kung and Wang are both directed towards mapping 3D mappings of convolution operations to systolic arrays.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Kung with the teachings of Wang by implementing the 3D convolution operations in Wang with the 3D systolic array in Kung. Kung teaches that by arranging the systolic array in a “2.5D” or 3D layout that throughput is significantly improved,  Wang teaches that by limiting pixel operation repetition and simplifying elemental operations the system can be significantly optimized ([p. 6915 §V] “We presented an efficient 3D convolution operator based on the FPGA accelerator. The proposed structure 

Regarding claim 20, the combination of Kung, and Wang teaches The apparatus of claim 19, wherein the method is implemented as a software tool that automatically configures an optimal configuration for performing the convolution processing. (Kung [p. 6913 §IV] "Figure 9 shows the photograph of the realized accelerator using the Xilinx ZC706 developing board. This board is composed of the Xilinx ZYNQ, itself consists of a Kintex-7 FPGA and a dual ARM Cortex-A9 Processor, and 1GB DDR3 memory with the frequency bandwidth up to 4.2GB/s. The results are achieved using the Xilinx Vivado-2016.1 developing software." Xilinx Vivado interpreted as synonymous with software tool on the computing device that will execute the predefined convolution processing.). 

Claims 6, 7, and 14-16 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Kung, and Wang and in further view of Li (“A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks”, 2018).

Regarding claim 6, the combination of Kung and Wang teaches The method of claim 1.  However, the combination of Kung and Wang does not explicitly teach as implemented on a computer different from the computing device that will execute the predefined convolution processing.  

Li, in the same field of endeavor, teaches The method of claim 1, as implemented on a computer different from the computing device that will execute the predefined convolution processing. ([p. 1] "Fig. 1: (a) State-of-the-art hierarchical distributed training. (b) INCEPTIONN’s distributed training algorithm in the conventional hierarchy. (c) Hierarchical use of INCEPTIONN’s distributed algorithm" The aggregator nodes in FIG. 1 are interpreted as synonymous with the computing device to execute the processing.  The worker nodes shown in FIG. 1 are interpreted as synonymous with a computer different from the computing device.). 

Li, Kung, and Wang are all directed towards improving performance of convolutional neural networks.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Kung and Wang with the teachings of Li by aggregating tasks from a server to worker nodes. While Kung and Wang are mainly directed towards optimizing tasks for inference, Li is mainly directed towards distributed training.  Li explicitly teaches the advantage of implementing the taught system with known inference systems on systolic arrays ([p. 11 §IX] “Google proposes the TPU [22], which is an accelerator with the systolic array architecture for the inference of neural networks”) by teaching how the disclosed invention overcomes known deficiencies ([p. 11 §IX] “These ML training accelerators are either single-node solutions or accelerators deployed on the 

Regarding claim 7, the combination of Kung, Wang, and Li teaches The method of claim 6, as implemented on one of: a server remote from the computing device; and as a cloud service. (Li [p. 1] "Fig. 1: (a) State-of-the-art hierarchical distributed training. (b) INCEPTIONN’s distributed training algorithm in the conventional hierarchy. (c) Hierarchical use of INCEPTIONN’s distributed algorithm" Aggregator node interpreted as synonymous with server remote from computing device.). 
The motivation for the combination of Kung, Wang, and Li used in claim 6 also applies to claim 7.

Claims 14-16 are substantially similar to claims 6-7.  Therefore, the rejections applied to claims 6-7 also apply to claims 14-16.


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Franca-Neto (US20190244086A1).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126