Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on June 15, 2022, in which claims 1, 6, and 13-17 are currently amended. Claims 2-5, 7-12, and 18-24 are canceled. Claims 25-34 are newly added.  Claims 1, 6, 13-17, and 25-34 are currently pending. 

Specification
Applicant's amendments made to the specification are acknowledged. Examiner’s objection to the specification are hereby withdrawn, as necessitated by Applicant’s amendments made to the specification.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on February 15, 2022 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Arguments
	The claim interpretation related to claims 1-21 under 35 U.S.C. § 112(f) have been maintained without traverse. 
The rejections to claims 1-21 under 35 U.S.C. § 112(b) are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.
Applicant’s arguments with respect to rejection of claims 1-21 under 35 U.S.C. 103 based on amendment have been considered, however, have not been deemed persuasive.  
Applicant’s arguments with respect to rejection of claims 1-21 under 35 U.S.C. 101 based on amendment have been considered, however, have not been deemed persuasive.  The claims are seen as primarily mathematical calculations and mental processes applied at a high level of generality to particular fields of technology (neural networks and graphics processors).  This interpretation is outlined in more detail below. 
With respect to Applicant’s arguments that Caulfield alone does not teach the amended limitations of claim 1, Examiner asserts that the new claim reflects rolling claims 2-5 into the independent claim, and that a combination of Caulfield, Park, Palshatev, and Yan were used to teach those claims.  The rejection is maintained using a new combination of these arts.  With respect to Applicant’s arguments that these arts cannot be combined to teach the amended limitation, Examiner respectfully disagrees.  The arts are highly analogous and related to heterogeneous processing systems such that it would be obvious to one of ordinary skill in the art to combine the teachings to make a more commercially viable heterogeneous processing system.
Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims patentably distinguishes them from the references.

Claim Rejections - 35 USC § 101
101 Rejection
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-21 are rejected under 35 USC § 101 because the claimed invention is directed to non-statutory subject matter.

Regarding Claim 1:  Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to an apparatus which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis:  Claim 1 recites a computer implemented method of processing neural networks, which, under its broadest reasonable interpretation is a series of mental processes.  For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing, including the following: 
the CU and the GrPU are configured to perform a compute operation implemented via a dynamically compiled shader (mathematical calculation)
Therefore, claim 1 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis:  Claim 1 recites additional elements “processor”, “a local memory to store one or more graph representations”, and “graph processing unit (GrPU) to accelerate computations of the one or more graph representations”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component.  An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application.  Claim 1 also recites additional elements “compute logic to accelerate neural network computations”, “the GrPU supports multiple function pointers and threads to accelerate traversal of the one or more graph representations”, and “wherein the compute logic further comprises a compilation unit (CU) to compile shader kernel” which amounts to generally linking the judicial exception to a particular technology or field of use.  It would be implicit in the field of computer graphics that performing computations through a shader (such as a compute shader) would require compiling said shader.  Therefore, claim 1 is directed to a judicial exception.
Step 2B Analysis:  Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 1 amount to no more than mere instructions to apply the judicial exception using a generic computer component.
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claim 26 as well as dependent claims 6, 13-17, and 25-34. The additional limitations of the dependent claims are addressed briefly below:
Dependent claim 6 recites additional mathematical calculation “wherein the  GPU is configured to perform non-uniform quantization for the neural network”.
Dependent claim 13 recites additional generic computer components “circuity” as well as additional elements “to accelerate application of an activation function for an operation associated with the neural network.” Which amount to generally linking the judicial exception to a particular field or technology.  Accelerating the neural network is recited at a high level of generality.
Dependent claim 14 recites additional insignificant extra-solution activity “a fetch stage to receive input values” (gathering data), and “a writeback stage to pack and prepare results to be outputted” (outputting data). Claim 14 also recites additional mathematical calculations “an execute stage to perform computation operations on the input value”.
Dependent claim 15 recites additional mental processes “ to analyze and identify” which amounts to evaluation and judgement.
Dependent claim 16 recites additional generic computer components “first execute stage circuitry to implement a first set of activation functions; and second execute stage circuitry to implement a second set of activation functions”. 
Dependent claim 17 recites additional insignificant extra-solution activity “wherein the writeback is configured to: receive results from the first execute stage circuitry and the second execute stage circuitry; and output the results in a format associated with an output tensor.” Which amounts to gathering and outputting data (See Mayo, 566 U.S. at 79, 101 USPQ2d at 1968; OIP Techs., Inc. v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1092-93 (Fed. Cir. 2015) (presenting offers and gathering statistics amounted to mere data gathering)).
Dependent claim 25 recites additional observation, evaluation, and judgement “detect a condition associated with input data of a neural network computation;” as well as additional elements “compile a modified shader that is configures the GPU to perform a modified neural network computation” which amounts to generally linking the judicial exception to a particular field or technology.  Claim 25 also recites additional mathematical calculations “perform the modified neural network computation via the compiled modified shader.”
Dependent claim 27 recites additional mathematical calculations “to process an input image above a resolution threshold via the neural network” and “wherein to process the input image includes to divide the input image into two or more image batches and process the two or more image batches via the heterogenous processor.”
Dependent claim 28 recites additional mathematical calculations “the heterogenous processor is configured to process the two or more image batches” as well as additional elements “in parallel via two or more cores.” Which is well-understood, routine, and conventional (Squyres [p. 70 §2.1] "For large classes of image processing tasks, the input image data required to compute a given portion of the output is spatially localized. In the simplest case, an output image is computed simply by independently processing single pixels of the input image. More dependent generally, a neighborhood (or window) of pixels from the input image is used to compute an output pixel. Hence, the output pixels can be computed independently and in parallel. This high degree of natural parallelism can be easily exploited by parallel algorithms. In fact, many image processing routines can achieve near linear speedup with the addition of processing nodes (over a reasonable number of nodes)."). 
Dependent claim 29 recites additional mathematical calculations “process a first image batch via the CPU core”, “process a second image batch via the GPU core”, and “process a third image batch via the accelerator”
Dependent claim 30 recites additional mathematical calculations “wherein at least one core of the heterogenous processor is configured to perform non-uniform quantization for the neural network.”
Dependent claim 31 recites additional elements “wherein at least one core of the heterogenous processor includes circuitry to accelerate application of an activation function for an operation associated with the neural network.” Which amounts to generally linking the judicial exception to a particular field or technology.
Dependent claim 32 recites additional generic computer components “circuitry” as well as additional insignificant extra-solution activity “a fetch stage to receive input values” (gathering data), and “a writeback stage to pack and prepare results to be outputted” (outputting data). Claim 14 also recites additional mathematical calculations “an execute stage to perform computation operations on the input value”.
Dependent claim 33 recites additional observation, evaluation, and judgement “to analyze and identify a first operation to be implemented via first execute stage circuitry and a second operation to be implemented via second execution stage circuitry”
Dependent claim 34 recites additional mathematical calculations “first execute stage circuitry to implement a first set of activation functions; and second execute stage circuitry to implement a second set of activation functions” as well as insignificant extra-solution activity of gathering data “the writeback stage is configured to: receive results from the first execute stage circuitry and the second execute stage”

Therefore, when considering the elements separately and in combination, they do not do not add significantly more to the inventive concept. Accordingly, claims 1, 6, 13-17, and 25-34 are rejected under 35 U.S.C. § 101. 

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claim 27 rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. Regarding dividing an input image based on exceeding an image resolution threshold,  the specification is silent on a resolution threshold and how it is determined, and/or what the resolution threshold might be.  Dividing an image based on exceeding an image resolution threshold is seen as introducing new material.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: 
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


	Claim 1, 13-17, 25-26, 31-34 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Caulfield (US20190180499A1), and Yan (US20130173894A1) and in further view of Paltashev (US20180114290A1).

	Regarding claim 1, Caulfield teaches An apparatus to facilitate compute optimization, comprising: at least one processor to perform operations to implement a neural network; and ([¶0085] "a system, such as the system shown in FIG. 2, may be additionally provided with one or more hardware accelerators to implement and/or utilize convolutional neural networks (CNNs)...Neural network classifiers can run either exclusively using the hardware (HW) convolutional neural network (CNN) accelerator 207 or in a combination of processors and HW CNN accelerator 207")
	compute logic including circuitry configured to accelerate neural network computations. ([¶0085] "Neural network classifiers can run either exclusively using the hardware (HW) convolutional neural network (CNN) accelerator 207 or in a combination of processors and HW CNN accelerator 207")
	the circuitry comprising: a local memory to store one or more graph representations ([¶0080] "The apparatus depicted in FIG. 2 may include a host system composed on host CPU 200 and associated host memory 201" [¶0145] "FIG. 37 is a diagram showing how 2D Path-Finding on a 2D 2×2 bitmap can be accelerated in accordance with some embodiments...This approach prunes branches from the graph search algorithm" [¶0148] "In some instances, the volumetric data structure 4008 may be pre-loaded onto the local memory" FIG. 37 shows a bitmap for performing pathfinding which is a graph traversal method.  Therefore the bitmap is interpreted as synonymous with a graph representation. Caulfield also teaches that the path-finding algorithm may be scaled into three dimensions, and explicitly teaches that volumetric graph representations may be stored in memory.).
	and graph processing unit (GrPU) to accelerate computations of the one or more graph representations (¶0145] "FIG. 37 is a diagram showing how 2D Path-Finding on a 2D 2×2 bitmap can be accelerated in accordance with some embodiments...This approach prunes branches from the graph search algorithm").
	and threads to accelerate traversal of the one or more graph representations; ([¶0239] "FIG. 64 is an example illustration of a processor according to an embodiment. Processor 6400 is an example of a type of hardware device that can be used in connection with the implementations above...Processor 6400 may be a single-threaded core or, for at least one embodiment, the processor 6400 may be multi-threaded in that it may include more than one hardware thread context (or “logical processor”) per core.").
	However, Caulfield does not explicitly teach wherein the GrPU supports multiple function pointers and compilation unit (CU) configured to compile shader kernels 
	and wherein the CU and the GrPU are configured to perform a compute operation implemented via a dynamically compiled shader 
	and the dynamically compiled shader is dynamically compiled and executed in response to a detected condition.  

Yan, in the same field of endeavor, teaches wherein the GrPU supports multiple function pointers ([¶0059] " In one embodiment, the CPU vtable may include function pointers such as vfunc1 and vfunc2 and the GPU vtable may include function pointers such as vfunc1' and vfunc2'."). 

Yan and Caulfield are both directed towards heterogeneous processor systems.  Therefore, Yan and Caulfield are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the function pointer mapping taught in Yan with the multi-purpose processor in Caulfield. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Caulfield that a graphics processor can be capable of both processing graphics and neural networks, and that Yan teaches examples and advantages of shared virtual memory including storing function pointers on the GPU ([¶0060] “In one embodiment, the CPU vtable may include function pointers such as vfunc1 and vfunc2 and the GPU vtable may include function pointers such as vfunc1′ and vfunc2′. In one embodiment, the function pointers (vfunc1 and vfunc2) and (vfunc1′ and vfunc2′) may be different. In one embodiment, saving the CPU vtable and the GPU vtable in the shared non-coherent region 860 may enable the CPU 110 and the GPU 180 to, respectively, see the CPU vtable and the GPU vtable at the same address location”).  

However, the combination of Caulfield and Yan does not explicitly teach, compilation unit (CU) configured to compile shader kernels 
	and wherein the CU and the GrPU are configured to perform a compute operation implemented via a dynamically compiled shader 
	and the dynamically compiled shader is dynamically compiled and executed in response to a detected condition.  

Paltashev, in the same field of endeavor, teaches and compilation unit (CU) configured to compile shader kernels ([¶0041] "Implementing the graphics pipeline 201 as a monolithic pipeline object, allows a reduction in API overhead by enabling up-front shader optimization at compile time. Embodiments of the graphics pipeline 201 also make the CPU performance of the pipeline driver more predictable, since shader compilation is not kicked off by the driver at draw time outside of the application's control.")
	and wherein the CU and the GrPU are configured to perform a compute operation implemented via a dynamically compiled shader ([¶0066] "To support implementations of a reconfigurable GPU, the graphics processing system 600 also includes shared fixed function hardware blocks 641, 642, 643, 644, 645...For another example, the tessellator 634 can transmit a request to the dedicated fixed function hardware block 642 to perform an operation and the results of the operation can be returned to the kernel domain shader 635" Tesselator interpreted as dynamically compiled shader configured to perform a compute operation.)
	and the dynamically compiled shader is dynamically compiled and executed in response to a detected condition. ([¶0041] "Implementing the graphics pipeline 201 as a monolithic pipeline object, allows a reduction in API overhead by enabling up-front shader optimization at compile time. Embodiments of the graphics pipeline 201 also make the CPU performance of the pipeline driver more predictable, since shader compilation is not kicked off by the driver at draw time outside of the application's control." Since shader compilation is not kicked off by the driver outside of the applications control it is interpreted as synonymous with being executed in response to a detected condition.). 

	Caulfield, Yan, and Paltashev are all directed towards heterogeneous processor systems.  Therefore, Caulfield, Yan, and Paltashev are all analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, that runtime shader compilation is inherent in graphics processing. The combination of Caulfield, Yan, and Paltashev supports the inherency of shader compilation for GPU’s.  The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Paltashev ([¶0068] “Some embodiments of the graphics processing system 600 have a number of advantages over conventional graphics pipelines. For example, the graphics processing system 600 utilizes fixed-function hardware within the compute domain and many compute shaders or virtual GPUs can be scheduled concurrently and load balanced by the asynchronous compute engines”).

	Regarding claim 13, the combination of Caulfield, Yan, and Paltashev teaches The apparatus of claim 1, wherein the GPU includes circuitry to accelerate application of an activation function for an operation associated with the neural network. (Caulfield [¶0085] "Neural network classifiers can run either exclusively using the hardware (HW) convolutional neural network (CNN) accelerator 207 or in a combination of processors and HW CNN accelerator 207" neural network classifier is interpreted as synonymous with deep learning function.). 

	Regarding claim 14, the combination of Caulfield, Yan, and Paltashev teaches The apparatus of claim 13, wherein the GPU includes: circuitry to provide a fetch stage to receive input values; (Caulfield [¶0082] "Continuing with the example of FIG. 2, in some implementations the synthetic voxel geometry 202 may be combined with measured geometry voxels 227 constructed using a simultaneous localization and mapping (SLAM) pipeline 217. The SLAM pipeline may use active sensors and/or passive image sensors 214 (e.g., 214.1 and 214.2)")
	circuitry to provide an execute stage to perform computation operations on the input values; and (Caulfield [¶0082] "which are first processed using an image signal processing (ISP) pipeline 215")
	and circuitry to provide a writeback stage to pack and prepare results to be outputted. (Caulfield [¶0082] "to produce an output 225" [¶0135] "FIG. 27 illustrates logic to generate a 6-bit address triplet to control the multiplexers in accordance with some embodiments, which perform voxel insertion, deletion and retrieval...In this example the 16-bit x, y and z addresses of the voxel to be inserted, retrieved, tested for, etc. in a sparse voxel tree are presented to the address formatting logic 2705 as a packed 64-bit input value 2700"). 

	Regarding claim 15, the combination of Caulfield, Yan, and Paltashev teaches The apparatus of claim 14, wherein the fetch stage is configured to analyze and identify a first operation to be implemented via first execute stage circuitry and a second operation to be implemented via second execution stage circuitry. (Caulfield [¶0102] "a neural network includes an initial convolutional processing layer 1100, followed by pooling processing 1110, and finally an activation function processing, such as rectified linear unit (ReLU) function 1120. The output of the ReLU unit 1120, which provides ReLU output vector 1131, may be connected to a following convolutional processing layer 1180 (e.g., possibly via delay 1132), which receives ReLU output vector 1131... a ReLU bitmap 1130 may also be generated in parallel with the connection of the ReLU unit 1120 to the following convolution unit 1180, the ReLU bitmap 1130 denoting which elements in the ReLU output vector 1131 are zeroes and which are non-zeroes." See also FIG. 11 1120 for identified first operation.). 

	Regarding claim 16, the combination of Caulfield, Yan, and Paltashev teaches The apparatus of claim 15, wherein the execute stage comprises: first execute stage circuitry to implement a first set of activation functions; and second execute stage circuitry to implement a second set of activation functions. (Caulfield [¶0102] "a neural network includes an initial convolutional processing layer 1100, followed by pooling processing 1110, and finally an activation function processing, such as rectified linear unit (ReLU) function 1120. The output of the ReLU unit 1120, which provides ReLU output vector 1131, may be connected to a following convolutional processing layer 1180 (e.g., possibly via delay 1132), which receives ReLU output vector 1131." See also FIG. 11.  Caulfield explicitly shows a first set of ReLU activation circuits for performing activation for a first stage (layer N-1) followed by activations for a second stage (layer N) of the neural network.). 

	Regarding claim 17, the combination of Caulfield, Yan, and Paltashev teaches The apparatus of claim 16, wherein the writeback stage is configured to: receive results from the first execute stage circuitry and the second execute stage circuitry; and output the results in a format associated with an output tensor. (Caulfield [¶0102] "The output of the ReLU unit 1120, which provides ReLU output vector 1131, may be connected to a following convolutional processing layer 1180 (e.g., possibly via delay 1132), which receives ReLU output vector 1131" ReLU output vector interpreted as synonymous with output tensor.). 

	Regarding claim 25, the combination of Caulfield, Yan, and Paltashev teaches The apparatus of claim 1, wherein the circuitry of the GPU is configured to: detect a condition associated with input data of a neural network computation; ( [¶0105] "CNN ReLU layers can produce high numbers of output zeroes corresponding to negative inputs." Detecting negative inputs associated with neural network input interpreted as synonymous with detecting a condition associated with input data of a neural network computation.)
	compile a modified shader that is configures the GPU to perform a modified neural network computation; (Paltashev [¶0041] "Implementing the graphics pipeline 201 as a monolithic pipeline object, allows a reduction in API overhead by enabling up-front shader optimization at compile time. Embodiments of the graphics pipeline 201 also make the CPU performance of the pipeline driver more predictable, since shader compilation is not kicked off by the driver at draw time outside of the application's control." [¶0043] "virtual graphics pipelines can be configured to support deep learning neural networks that are implemented on a GPU platform.virtual graphics pipelines can be configured to support deep learning neural networks that are implemented on a GPU platform.")
	and perform the modified neural network computation via the compiled modified shader. (Paltashev [¶0066] "To support implementations of a reconfigurable GPU, the graphics processing system 600 also includes shared fixed function hardware blocks 641, 642, 643, 644, 645...For another example, the tessellator 634 can transmit a request to the dedicated fixed function hardware block 642 to perform an operation and the results of the operation can be returned to the kernel domain shader 635"  [¶0043] "virtual graphics pipelines can be configured to support deep learning neural networks that are implemented on a GPU platform.virtual graphics pipelines can be configured to support deep learning neural networks that are implemented on a GPU platform."). 

	Regarding claim 26, claim 26 is substantially similar to claim 1.  Therefore, the rejection applied to claim 1 also applies to claim 26.

	Regarding claim 31, the combination of Caulfield, Yan, and Paltashev teaches The data processing system of claim 26, wherein at least one core of the heterogenous processor includes circuitry to accelerate application of an activation function for an operation associated with the neural network. (Caulfield [¶0085] "Neural network classifiers can run either exclusively using the hardware (HW) convolutional neural network (CNN) accelerator 207 or in a combination of processors and HW CNN accelerator 207" [¶0102] "The hardware may include one or more processors, one or more microprocessors, one or more circuits, one or more computers, and the like. In this particular example, a neural network includes an initial convolutional processing layer 1100, followed by pooling processing 1110, and finally an activation function processing," neural network classifier is interpreted as synonymous with deep learning function.). 

	Regarding claim 32, the combination of Caulfield, Yan, and Paltashev teaches The data processing system of claim 31, wherein at least one core of the heterogenous processor includes: circuitry to provide a fetch stage to receive input values; (Caulfield [¶0082] "Continuing with the example of FIG. 2, in some implementations the synthetic voxel geometry 202 may be combined with measured geometry voxels 227 constructed using a simultaneous localization and mapping (SLAM) pipeline 217. The SLAM pipeline may use active sensors and/or passive image sensors 214 (e.g., 214.1 and 214.2)").
	circuitry to provide an execute stage to perform computation operations on the input values; (Caulfield [¶0082] "which are first processed using an image signal processing (ISP) pipeline 215")
	and circuitry to provide a writeback stage to pack and prepare results to be output. ( [¶0082] "to produce an output 225" [¶0135] "FIG. 27 illustrates logic to generate a 6-bit address triplet to control the multiplexers in accordance with some embodiments, which perform voxel insertion, deletion and retrieval...In this example the 16-bit x, y and z addresses of the voxel to be inserted, retrieved, tested for, etc. in a sparse voxel tree are presented to the address formatting logic 2705 as a packed 64-bit input value 2700"). 

	Regarding claim 33, the combination of Caulfield, Yan, and Paltashev teaches The data processing system of claim 32, wherein the fetch stage is configured to analyze and identify a first operation to be implemented via first execute stage circuitry and a second operation to be implemented via second execution stage circuitry. (Caulfield [¶0102] "a neural network includes an initial convolutional processing layer 1100, followed by pooling processing 1110, and finally an activation function processing, such as rectified linear unit (ReLU) function 1120. The output of the ReLU unit 1120, which provides ReLU output vector 1131, may be connected to a following convolutional processing layer 1180 (e.g., possibly via delay 1132), which receives ReLU output vector 1131... a ReLU bitmap 1130 may also be generated in parallel with the connection of the ReLU unit 1120 to the following convolution unit 1180, the ReLU bitmap 1130 denoting which elements in the ReLU output vector 1131 are zeroes and which are non-zeroes." See also FIG. 11 1120 for identified first operation.  Activation operation at layer N interpreted as second operation.). 

	Regarding claim 34, the combination of Caulfield, Yan, and Paltashev teaches The data processing system of claim 33, wherein the execute stage comprises: first execute stage circuitry to implement a first set of activation functions; and second execute stage circuitry to implement a second set of activation functions, and wherein the writeback stage is configured to: receive results from the first execute stage circuitry and the second execute stage (Caulfield [¶0102] "The output of the ReLU unit 1120, which provides ReLU output vector 1131, may be connected to a following convolutional processing layer 1180 (e.g., possibly via delay 1132), which receives ReLU output vector 1131" ReLU output vector interpreted as synonymous with output tensor.). 

	Claims 6 and 30 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Caulfield, Yan and Paltashev and in further view of Park (“Weighted-Entropy-based Quantization for Deep Neural Networks”, 2017).

	Regarding claim 6, the combination of Caulfield, Yan and Paltashev teaches The apparatus of claim 1.
	However, the combination of Caulfield, Yan and Paltashev does not explicitly teach, wherein the GPU is configured to perform non-uniform quantization for the neural network.  

Park, in the same field of endeavor, teaches The apparatus of claim 1, wherein the GPU is configured to perform non-uniform quantization for the neural network. ([Abstract] "Unlike recent work on binary-weight neural networks, our approach is multi-bit quantization, in which weights and activations can be quantized by any number of bits depending on the target accuracy" [p. 5459 Sec. 4.1] "Maximizing the weighted entropy optimizes the quantization result towards maximizing entropy while considering the importance of data. Thus, our method groups many near-zero values into a large cluster by considering their lower importance. Large, but infrequent values are also grouped into a cluster that covers a wide range of weight values." [p. 5461 §5.1] "For image classification tasks, we evaluate the proposed method by quantizing two widely used CNNs for ImageNet tasks [6]: AlexNet [15] GoogLeNet [21] (both from Caffe framework [13]) and ResNet3 [11]. In order to apply our quantization scheme into these networks, we perform fine tuning combined with our weight/activation quantization schemes under the batch size of 256 (for AlexNet), 64 (for GoogLeNet), or 16 (for ResNet-50/101). In the cases of GoogLeNet and ResNet, the batch size is limited due to insufficient GPU memory capacity" Park explicitly teaches that the quantization is performed on the GPU.).  This motivation for combination also applies to the claims depending on this combination. 

	Caulfield, Yan, Paltashev, and Park are all directed towards heterogeneous processor systems.  Therefore, Caulfield, Yan, Paltashev, and Park are all analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the non-uniform quantization taught in Park with the processing system capable of processing graphics and neural networks taught in the combination of Caulfield, Yan, and Paltashev. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Park that the non-uniform quantization not only speeds up training and classification, but also outperforms other well-known neural network quantization methods by allowing greater control over the quantizing.  From Park ([p. 5456 Col. 2 Sec. 1] "Such aggressive quantization methods are promising in that they can achieve significant reductions in the execution time, energy consumption, and memory capacity requirements of neural networks during the inference by exploiting the benefits of dedicated hardware accelerators, e.g. NVIDIA P40 and P4 [2] which support 8-bit integer arithmetic or Stripes [14] which provides execution time and energy consumption proportional to the bitwidth").

	Regarding claim 30, the combination of Caulfield, Yan, and Paltashev teaches The data processing system of claim 26.
	However, the combination of Caulfield, Yan, and Paltashev does not explicitly teach wherein at least one core of the heterogenous processor is configured to perform non-uniform quantization for the neural network.  

Park, in the same field of endeavor, teaches The data processing system of claim 26, wherein at least one core of the heterogenous processor is configured to perform non-uniform quantization for the neural network. ([Abstract] "Unlike recent work on binary-weight neural networks, our approach is multi-bit quantization, in which weights and activations can be quantized by any number of bits depending on the target accuracy" [p. 5459 Sec. 4.1] "Maximizing the weighted entropy optimizes the quantization result towards maximizing entropy while considering the importance of data. Thus, our method groups many near-zero values into a large cluster by considering their lower importance. Large, but infrequent values are also grouped into a cluster that covers a wide range of weight values."). 

	Caulfield, Yan, Paltashev, and Park are all directed towards heterogeneous processor systems.  Therefore, Caulfield, Yan, Paltashev, and Park are all analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the non-uniform quantization taught in Park with the processing system capable of processing graphics and neural networks taught in the combination of Caulfield, Yan, and Paltashev. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Park that the non-uniform quantization not only speeds up training and classification, but also outperforms other well-known neural network quantization methods by allowing greater control over the quantizing.  From Park ([p. 5456 Col. 2 Sec. 1] "Such aggressive quantization methods are promising in that they can achieve significant reductions in the execution time, energy consumption, and memory capacity requirements of neural networks during the inference by exploiting the benefits of dedicated hardware accelerators, e.g. NVIDIA P40 and P4 [2] which support 8-bit integer arithmetic or Stripes [14] which provides execution time and energy consumption proportional to the bitwidth").

	Claim 27 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Caulfield, Yan, and Paltashev and in further view of He (US10262190B2).

	Regarding claim 27, the combination of Caulfield, Yan, and Paltashev teaches The data processing system of claim 26.
	However, the combination of Caulfield, Yan, and Paltashev does not explicitly teach, wherein the [heterogenous] processor is configured to process an input image above a resolution threshold via the neural network, 
	wherein to process the input image includes to divide the input image into two or more image batches and process the two or more image batches via the [heterogenous] processor.  

He, in the same field of endeavor, teaches the [heterogenous] processor is configured to process an input image above a resolution threshold via the neural network, ([Col. 8 l. 21-30] "the face recognition method detects the first feature point using the off-line trained feature point classifier in a learning way. Particularly, the face recognition method scales the face image into different scales based on the detection result of the face region image, and performs the detection using the off-line trained Convolutional Neural Network (CNN)" [Col. 1 l. 65-68] "in the method according to the embodiment of the present disclosure, a resolution of the image is larger than a preset resolution threshold." The primary reference Caulfield teaches the heterogeneous processor.)
	wherein to process the input image includes to divide the input image into two or more image batches and process the two or more image batches via the [heterogenous] processor. ([Col. 8 l. 21-37] "the face recognition method scales the face image into different scales based on the detection result of the face region image, and performs the detection using the off-line trained Convolutional Neural Network (CNN) classifier in a way such as a slide window, or the like, in each scale, so that the location and the size of the first feature point are detected." [Col. 9 l. 0-10] "L(x,y,σ) is a Gaussian image in scale σ which is obtained by a convolution of the image I(x,y) with the Gaussian core G  ( x , y , σ) = 1 2  πσ 2 * exp ( - ( x 2 + y 2) / 2 σ 2) , and x and y are the horizontal coordinate and the vertical coordinate of each pixel in the image to be recognized, respectively." Sliding window interpreted as synonymous with dividing the input image into two or more image batches and processing the two or more image batches. See Gaussian core Eqn. on Col. 9 for how sliding window divides input image.). 

	the combination of Caulfield, Yan, and Paltashev as well as He are directed towards processing images.  Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the heterogeneous processor systems of Caulfield, Yan, and Paltashev with the teachings of He by requiring the input image to be above a threshold resolution, and by dividing the input image into processing batches.  The combination of Caulfield, Yan, and Paltashev explicitly teaches performing convolution calculations on images, and one of ordinary skill in the art would recognize that convolution calculations are regularly performed using a sliding window based on the kernel size.  This is reinforced by He.  He is specifically directed towards using a convolutional neural network for face recognition and explains as a motivation for combination ([Col. 13 l. 17-33] "in order to make the recognition result more accurate, the match feature point pair calculation module 2430 may screen the preliminary match result obtained by the match feature point pair detection module 2420, based on the RANSAC method, to obtain the matched feature point pairs. Of course, the above two ways are only examples. The match feature point pair calculation module 2430 may screen the result obtained by the match feature point pair detection module 2420 as described above according to other rules, to make the final recognition result more accurate.").  

	Claims 28-29 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Caulfield, Yan, Paltashev, and He and in further view of Fraser (US20190080223A1).

	Regarding claim 28, the combination of Caulfield, Yan, Paltashev, and He teaches The data processing system of claim 27.
	However, the combination of Caulfield, Yan, Paltashev, and He does not explicitly teach, wherein the heterogenous processor is configured to process the two or more image batches in parallel via two or more cores.  

Fraser, in the same field of endeavor, teaches The data processing system of claim 27, wherein the heterogenous processor is configured to process the two or more image batches in parallel via two or more cores. ([¶0040] "a backpropagation learning method may be used to calculate the error contribution of each neuron after a batch of data (e.g., in image recognition, multiple images) is processed." [¶0079] " In some embodiments, each forward path PE corresponds to a neuron of the layer 204-i. As such, the number P of forward path PEs in the forward path unit 300 may control the number of neurons of the layer 204-i that may be computed in parallel in the forward path process." Forward processing element interpreted as synonymous with compute node.  Fraser explicitly teaches that images may be batched at the input of the neural network system and processed in parallel using the processing elements.). 

The combination of Caulfield, Yan, Paltashev, and He, as well as Fraser are all directed towards heterogeneous processor systems.  Therefore, the combination of Caulfield, Yan, Paltashev, and He, and Fraser are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural networks of the combination of Caulfield, Yan, Paltashev, and He with that of Fraser by implementing parallel batching. Fraser teaches that batching is a method of removing dependencies and allowing for greater parallelization ([¶0042] “It has been discovered that various techniques may be used to remove the dependencies in a training algorithm, which may prevent or reduce stalling and allow implementations using multiple accelerators...Such a delayed model adaptation allows the forward path and the backward path of neural network training to be implemented in parallel. This is achieved by introducing a delay between the output activations and the weight update and gradient calculations in each layer. This allows the calculations for each layer to be performed independently without different batches of inputs.”).

	Regarding claim 29, the combination of Caulfield, Yan, Paltashev, He, and Fraser teaches The data processing system of claim 28, wherein the heterogenous processor is configured to: process a first image batch via the CPU core; (Fraser [¶0002] "neural networks need to be trained on a sufficiently large dataset, and the training is performed on the basis of floating point arithmetic using general purpose graphics processing units (GPGPUs)." [¶0040] "a backpropagation learning method may be used to calculate the error contribution of each neuron after a batch of data (e.g., in image recognition, multiple images) is processed"  [¶0041] "a neural network typically needs to be trained on a sufficiently large training dataset. The training dataset may include a plurality of subsets (batches)")
	process a second image batch via the GPU core; (Fraser [¶0002] "neural networks need to be trained on a sufficiently large dataset, and the training is performed on the basis of floating point arithmetic using general purpose graphics processing units (GPGPUs)." [¶0040] "a backpropagation learning method may be used to calculate the error contribution of each neuron after a batch of data (e.g., in image recognition, multiple images) is processed"  [¶0041] "a neural network typically needs to be trained on a sufficiently large training dataset. The training dataset may include a plurality of subsets (batches)")
	and process a third image batch via the accelerator core. (Fraser [¶0002] "neural networks need to be trained on a sufficiently large dataset, and the training is performed on the basis of floating point arithmetic using general purpose graphics processing units (GPGPUs)." [¶0040] "a backpropagation learning method may be used to calculate the error contribution of each neuron after a batch of data (e.g., in image recognition, multiple images) is processed"  [¶0041] "a neural network typically needs to be trained on a sufficiently large training dataset. The training dataset may include a plurality of subsets (batches)"). 
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126