Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 9 objected to because of the following informalities:  “wherein the compute logic comprises circuitry provide” contains grammatical errors.  “wherein the computer logic comprising circuitry provides” is suggested.
Claim 12 objected to because of the following informalities: “The abbreviation for long short-term memory network is given as “LTSM”, but “LSTM” is expected.”
 Appropriate correction is required.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 

Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: 
“compute logic…configured to” in claim 1.
“compilation unit to compile shader kernels” in claim 4.
“OpenCL implementation is configured to” in claim 10.
“simple lane to” in claim 16.
“complex lane to” in claim 16.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-21 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding claim 1, “compute logic including circuitry” is indefinite.  One of ordinary skill in the art would expect that circuitry would include compute logic and not the other way around.  Since circuitry is generally understood as containing compute logic, compute logic containing circuitry that contained compute logic would be self-contradictory. Where applicant acts as his or her own lexicographer to specifically define a term of a claim contrary to its ordinary meaning, the written description must clearly redefine the claim term and set forth the uncommon definition so as to put one reasonably skilled in the art on notice that the applicant intended to so redefine that claim term. Process Control Corp. v. HydReclaim Corp., 190 F.3d 1350, 1357, 52 USPQ2d 1029, 1033 (Fed. Cir. 1999). The term is indefinite because the specification does not clearly redefine the term. 
This part of the rejection can be overcome by amending “compute logic including circuitry” to “circuitry including compute logic”.  In the interest of further examination, the instant claim language has been interpreted to cover the suggested potential revision.

Regarding claim 2, “The graph representation” lacks antecedent basis.  Claim 2 discloses one or more graph representations, but does not distinguish a singular graph representation.  Examiner suggests to clarify claim as “accelerate processing of the one or more graph presentations”.  

Regarding claim 4, compilation unit is not supported by the specification.  It is not clear whether a compilation unit is a hardware, software, or other logical component.  Similarly it is not clear whether or not the compilation unit is intrinsic to the invention.  One of ordinary skill in the art would not be able to determine what a compilation unit to compile shader kernels is from the instant specification.  With respect to the instant specification a compilation unit is seen as merely a black box to compile shader kernels.

Regarding claims 9 and 10, “OpenCL” is a registered trademark for a changing standard, and is therefore indefinite.  Ex parte Simpson, 218 USPQ 1020 (Bd. App. 1982).

The remaining claims are rejected with respect to their dependence on the rejected claims.

Claim Rejections - 35 USC § 101
101 Rejection
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-21 are rejected under 35 USC § 101 because the claimed invention is directed to non-statutory subject matter.

Regarding Claim 1:  Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to an apparatus which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis:  Claim 1 recites a computer implemented method of processing neural networks, which, under its broadest reasonable interpretation is a series of mental processes.  For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing.  Therefore, claim 1 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis:  Claim 1 recites additional elements “processor”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component.  An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application.  Claim 1 also recites additional elements “compute logic to accelerate neural network computations” which amounts to generally linking the judicial exception to a particular technology or field of use.  Therefore, claim 1 is directed to a judicial exception.
Step 2B Analysis:  Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 1 amount to no more than mere instructions to apply the judicial exception using a generic computer component.
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to dependent claims 2-21. The additional limitations of the dependent claims are addressed briefly below:
Dependent claim 2 recites additional generic computer components “a local memory to store one or more graph representations”, and “graph processing unit (GrPU) to accelerate computations of the graph representation”
Dependent claim 3 recites additional elements “wherein the GrPU supports multiple function pointers and threads to accelerate traversing the one or more graph representations” which amounts to applying the judicial exception to a particular field or technology.
Dependent claim 4 recites additional elements “wherein the compute logic further comprises a compilation unit (CU) to compile shader kernels.” which amounts to applying the judicial exception to a particular field or technology.
Dependent claim 5 recites additional mathematical calculation “wherein the CU and the GrPU are implemented to compute an optimized shader operation”.
Dependent claim 6 recites additional mathematical calculation “wherein the compute logic performs non-uniform quantization for the neural network”.
Dependent claim 7 recites additional mathematical calculation “wherein performing the non-uniform quantization comprises providing a lower error percentage to weight values that have a significant impact for accuracy of the neural network”.
Dependent claim 8 recites additional mental processes “wherein discrete points are selected to have lower error percentage for large absolute value numbers, and selected to have higher error percentage for small absolute value numbers” which amounts to evaluation and judgement.
Dependent claim 9 recites additional elements “wherein the compute logic comprises an Computing Language (OpenCL) to accelerate workloads on the neural network” which amounts to applying the judicial exception to a particular field or technology.
Dependent claim 10 recites additional elements “wherein the OpenCL shares weights across hidden layers of the neural network” which amounts to applying the judicial exception to a particular field or technology.
Dependent claim 11 recites additional elements “wherein the neural network is a Recurrent neural network (RNN)” which amounts to applying the judicial exception to a particular field or technology.
Dependent claim 12 recites additional elements “wherein the neural network is a long short-term memory network (LTSM)” which amounts to applying the judicial exception to a particular field or technology.
Dependent claim 13 recites additional generic computer components “compute architecture”.
Dependent claim 14 recites additional insignificant extra-solution activity “a fetch stage to receive input values” (gathering data), and “a writeback stage to pack and prepare results to be outputted” (outputting data). Claim 14 also recites additional mathematical calculations “an execute stage to perform computation operations on the input value”.
Dependent claim 15 recites additional mental processes “analyzing and identifying values” which amounts to evaluation and judgement.
Dependent claim 16 recites additional generic computer components “simple lanes”, and “complex lanes”. 
Dependent claim 17 recites additional insignificant extra-solution activity “wherein the writeback stage receives results from the one or more simple lanes and the one or more complex lanes and places the results in a layout format of a tensor output.” Which amounts to gathering data.
Dependent claim 18 recites additional elements “wherein the compute logic processes a high-resolution input image via the neural network by cropping the input image into two more image batches and processing the image batches at the at least one processor” which amounts to applying the judicial exception to a particular field or technology.
Dependent claim 19 recites additional generic computer components “distributed architecture” and “compute nodes”. 
Dependent claim 20 recites additional elements “wherein the two or more image batches are processed in parallel at the plurality of compute nodes” which amounts to applying the judicial exception to a particular field or technology.
Dependent claim 21 recites additional generic computer components “graphics processing unit”, “central graphics processing unit”, and “accelerator”.

Therefore, when considering the elements separately and in combination, they do not do not add significantly more to the inventive concept. Accordingly, claims 1-8 are rejected under 35 U.S.C. § 101. 

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-2, 13-14 are rejected under 35 U.S.C. 102 as being unpatentable over Caulfield (US 2019/0180499 A1). 

Regarding claim 1, Caulfield teaches An apparatus to facilitate compute optimization, comprising: at least one processor to perform operations to implement a neural network; and ([¶0085] "a system, such as the system shown in FIG. 2, may be additionally provided with one or more hardware accelerators to implement and/or utilize convolutional neural networks (CNNs)...Neural network classifiers can run either exclusively using the hardware (HW) convolutional neural network (CNN) accelerator 207 or in a combination of processors and HW CNN accelerator 207").
compute logic including circuitry configured to accelerate neural network computations. ([¶0085] "Neural network classifiers can run either exclusively using the hardware (HW) convolutional neural network (CNN) accelerator 207 or in a combination of processors and HW CNN accelerator 207"). 

Regarding claim 2, Caulfield teaches The apparatus of claim 1, wherein the compute logic comprises: a local memory to store one or more graph representations; and ([¶0080] "The apparatus depicted in FIG. 2 may include a host system composed on host CPU 200 and associated host memory 201" [¶0145] "FIG. 37 is a diagram showing how 2D Path-Finding on a 2D 2×2 bitmap can be accelerated in accordance with some embodiments...This approach prunes branches from the graph search algorithm" [¶0148] "In some instances, the volumetric data structure 4008 may be pre-loaded onto the local memory" FIG. 37 shows a bitmap for performing pathfinding which is a graph traversal method.  Therefore the bitmap is interpreted as synonymous with a graph representation. Caulfield also teaches that the path-finding algorithm may be scaled into three dimensions, and explicitly teaches that volumetric graph representations may be stored in memory.).
graph processing unit (GrPU) to accelerate computations of the graph representation. (¶0145] "FIG. 37 is a diagram showing how 2D Path-Finding on a 2D 2×2 bitmap can be accelerated in accordance with some embodiments...This approach prunes branches from the graph search algorithm"). 

Regarding claim 13, Caulfield teaches The apparatus of claim 1, wherein the compute logic comprises a compute architecture to activate deep learning functions in the neural network. ([¶0085] "Neural network classifiers can run either exclusively using the hardware (HW) convolutional neural network (CNN) accelerator 207 or in a combination of processors and HW CNN accelerator 207" neural network classifier is interpreted as synonymous with deep learning function.). 

Regarding claim 14, Caulfield teaches The apparatus of claim 13, wherein the compute logic comprises: a fetch stage to receive input values; ([¶0082] "Continuing with the example of FIG. 2, in some implementations the synthetic voxel geometry 202 may be combined with measured geometry voxels 227 constructed using a simultaneous localization and mapping (SLAM) pipeline 217. The SLAM pipeline may use active sensors and/or passive image sensors 214 (e.g., 214.1 and 214.2)").
an execute stage to perform computation operations on the input values; and ([¶0082] "which are first processed using an image signal processing (ISP) pipeline 215").  A writeback stage to pack and prepare results to be outputted. ([¶0082] "to produce an output 225" [¶0135] "FIG. 27 illustrates logic to generate a 6-bit address triplet to control the multiplexers in accordance with some embodiments, which perform voxel insertion, deletion and retrieval...In this example the 16-bit x, y and z addresses of the voxel to be inserted, retrieved, tested for, etc. in a sparse voxel tree are presented to the address formatting logic 2705 as a packed 64-bit input value 2700").

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 3-5 are rejected under 35 U.S.C. 103 as being unpatentable over Caulfield and in view of Yan (US 20130173894 A1). 

Regarding claim 3, Caulfield teaches the apparatus of claim 2, and threads to accelerate traversing the one or more graph representations. ([¶0239] "FIG. 64 is an example illustration of a processor according to an embodiment. Processor 6400 is an example of a type of hardware device that can be used in connection with the implementations above...Processor 6400 may be a single-threaded core or, for at least one embodiment, the processor 6400 may be multi-threaded in that it may include more than one hardware thread context (or “logical processor”) per core."). However, Caulfield does not explicitly teach wherein the GrPU supports multiple function pointers  

Yan who teaches a related art of communication between a generic CPU and a generic GPU teaches wherein the GrPU supports multiple function pointers ([¶0059] "In one embodiment, the CPU vtable may include function pointers such as vfunc1 and vfunc2 and the GPU vtable may include function pointers such as vfunc1' and vfunc2'.").

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the function pointer mapping taught in Yan with the multi-purpose processor in Caulfield. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Caulfield that a graphics processor can be capable of both processing graphics and neural networks, and that Yan teaches examples and advantages of shared virtual memory including storing function pointers on the GPU ([¶0060] “In one embodiment, the CPU vtable may include function pointers such as vfunc1 and vfunc2 and the GPU vtable may include function pointers such as vfunc1′ and vfunc2′. In one embodiment, the function pointers (vfunc1 and vfunc2) and (vfunc1′ and vfunc2′) may be different. In one embodiment, saving the CPU vtable and the GPU vtable in the shared non-coherent region 860 may enable the CPU 110 and the GPU 180 to, respectively, see the CPU vtable and the GPU vtable at the same address location”).  

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Caulfield and in view of Paltashev (US 2018/0114290 A1). 

Regarding claim 4, Caulfield teaches The apparatus of claim 2.  However, Caulfield does not explicitly teach, wherein the compute logic further comprises circuitry to provide a compilation unit (CU) to compile shader kernels.  

Paltashev who teaches a related art of using a GPU capable of processing neural networks teaches wherein the compute logic further comprises circuitry to provide a compilation unit (CU) to compile shader kernels. ([¶0041] "Implementing the graphics pipeline 201 as a monolithic pipeline object, allows a reduction in API overhead by enabling up-front shader optimization at compile time. Embodiments of the graphics pipeline 201 also make the CPU performance of the pipeline driver more predictable, since shader compilation is not kicked off by the driver at draw time outside of the application's control."). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, that a runtime shader compilation is inherent in graphics processing. The combination of Paltashev and Caulfield supports the inherency of shader compilation for GPU’s.  The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Paltashev ([¶0068] “Some embodiments of the graphics processing system 600 have a number of advantages over conventional graphics pipelines. For example, the graphics processing system 600 utilizes fixed-function hardware within the compute domain and many compute shaders or virtual GPUs can be scheduled concurrently and load balanced by the asynchronous compute engines”). 

Regarding claim 5, the combination of Caulfield, and Paltashev teaches The apparatus of claim 4, wherein the CU and the GrPU are implemented to compute an optimized shader operation (Paltashev [¶0066] "To support implementations of a reconfigurable GPU, the graphics processing system 600 also includes shared fixed function hardware blocks 641, 642, 643, 644, 645...For another example, the tessellator 634 can transmit a request to the dedicated fixed function hardware block 642 to perform an operation and the results of the operation can be returned to the kernel domain shader 635").
wherein the optimized shader operation is a run-time adapted operation provided by a dynamically compiled shader and the dynamically compiled shader is dynamically compiled and executed in response to a detected condition. (Paltashev [¶0041] "Implementing the graphics pipeline 201 as a monolithic pipeline object, allows a reduction in API overhead by enabling up-front shader optimization at compile time. Embodiments of the graphics pipeline 201 also make the CPU performance of the pipeline driver more predictable, since shader compilation is not kicked off by the driver at draw time outside of the application's control." Since shader compilation is not kicked off by the driver outside of the applications control is interpreted as synonymous with executed in response to a detected condition.). 

Claims 6-8 are rejected under 35 U.S.C. 103 as being unpatentable over Caulfield and in view of Park (“Weighted-Entropy-based Quantization for Deep Neural Networks”, 2017). 

Regarding claim 6, Caulfield teaches The apparatus of claim 1.  However, Caulfield does not explicitly teach wherein the compute logic performs non-uniform quantization for the neural network.  

Park who teaches a related art of quantizing and training a neural network on the GPU teaches wherein the compute logic performs non-uniform quantization for the neural network. ([Abstract] "Unlike recent work on binary-weight neural networks, our approach is multi-bit quantization, in which weights and activations can be quantized by any number of bits depending on the target accuracy" [p. 5459 Sec. 4.1] "Maximizing the weighted entropy optimizes the quantization result towards maximizing entropy while considering the importance of data. Thus, our method groups many near-zero values into a large cluster by considering their lower importance. Large, but infrequent values are also grouped into a cluster that covers a wide range of weight values."). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the non-uniform quantization taught in Park with the processing system capable of processing graphics and neural networks taught in Caulfield. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Park that the non-uniform quantization not only speeds up training and classification, but also outperforms other well-known neural network quantization methods by allowing greater control over the quantizing.  From Park ([p. 5456 Col. 2 Sec. 1] "Such aggressive quantization methods are promising in that they can achieve significant reductions in the execution time, energy consumption, and memory capacity requirements of neural networks during the inference by exploiting the benefits of dedicated hardware accelerators, e.g. NVIDIA P40 and P4 [2] which support 8-bit integer arithmetic or Stripes [14] which provides execution time and energy consumption proportional to the bitwidth").

Regarding claim 7, the combination of Caulfield, and Park teaches The apparatus of claim 6, wherein performing the non-uniform quantization comprises providing a lower error percentage to weight values that have a significant impact for accuracy of the neural network (Park [p. 5457 Sec. 2] "Near-zero values dominate the total frequency of values in both weight and activation distribution; however, their impact on the output is small (e.g., errors in a very small weight may not affect much to the result of convolution" [p. 5458 Sec. 4.1] "Since larger weights have a higher impact on the output quality, we empirically define the importance i(n,m) of m-th weight in n-th cluster, i.e., w(n,m) to be quadratically proportional to the magnitude of the weight, i.e., i(n,m) = w(n,m)^2. Based on this importance value of each weight, we derive a metric for evaluating the quality of a clustering result (i.e., quantization result) based on weighted entropy"). 

Regarding claim 8, the combination of Caulfield, and Park teaches The apparatus of claim 7, wherein discrete points are selected to have lower error percentage for large absolute value numbers, and selected to have higher error percentage for small absolute value numbers. (Park [p. 5457 Sec. 2] "Near-zero values dominate the total frequency of values in both weight and activation distribution; however, their impact on the output is small (e.g., errors in a very small weight may not affect much to the result of convolution" [p. 5458 Sec. 4.1] "Since larger weights have a higher impact on the output quality, we empirically define the importance i(n,m) of m-th weight in n-th cluster, i.e., w(n,m) to be quadratically proportional to the magnitude of the weight, i.e., i(n,m) = w(n,m)^2. Based on this importance value of each weight, we derive a metric for evaluating the quality of a clustering result (i.e., quantization result) based on weighted entropy"). 

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Caulfield and in view of Guan (“FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates”, 2017). 

Regarding claim 9, Caulfield teaches The apparatus of claim 1.  However, Caulfield does not explicitly teach wherein the compute logic comprises circuitry provide an Open Computing Language (OpenCL) implementation to accelerate workloads on the neural network.  

Guan who teaches a related art of accelerating a neural network teaches wherein the compute logic comprises circuitry provide an Open Computing Language (OpenCL) implementation to accelerate workloads on the neural network. ([p. 155 Sec. III C] "we use RTL for designing a high-performance computation engine, and we use the OpenCL-based HLS framework to implement the control logics for the RTL part"). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to implement the neural network acceleration through OpenCL as suggested by Guan. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Guan that OpenCL is well-known in the art for performing parallelized operations, and that similarly using OpenCL to accelerate neural network workloads is well-known in the art.  Guan further supports this at ([p. 156 Col. 1] “This operation is known as Im2col (image to column), which is widely applied in prior CPU and GPU studies [7]. With the input features being in a matrix form, we can do similar conversions for convolution kernels by partitioning the corresponding 3-D cubes into a single column of kernel matrix”).

Claims 10-12 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Caulfield, and Guan and in further view of Li (“Acceleration of Deep Learning on FPGA”, 2017).

Regarding claim 10, the combination of Caulfield and Guan teaches The apparatus of claim 9.  However, the combination of Caulfield and Guan does not explicitly teach wherein the OpenCL  implementation is configured to share weights across hidden layers of the neural network.  

Li teaches wherein the OpenCL implementation is configured to share weights across hidden layers of the neural network. ([p. 17-18] "At each layer, the output of convolutional layer is: [Eqn. 4] Here, σ is the activation function mentioned in the common neural network and b is the bias value. The shared weight array f is normally smaller than 5x5. These shared parameters along with local connectivity make ConvNet more computationally efficient than Feedforward neural network."  [p. 23 Sec. 2.7] "In this chapter, we describe preliminary background on FPGA Architecture, High level Synthesis, OpenCL framework, Intel FPGA SDK for OpenCL tool, and supervised machine learning. Then we reviewed state-of-art implementations of ConvNet on various hardware including GPUs, ASIC, and FPGAs" Li shows that shared weights are a common feature of convolutional neural networks, and that said convolutional neural networks have been shown to be readily produced using OpenCL on FPGA devices.). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, that sharing weights across layers is inherent in convolutional neural networks. The combination would have been obvious because a person of ordinary skill in the art would be able to determine that the combination of Guan, Caulfield, and Li teaches accelerating a convolutional neural network with a variety of processors using OpenCL and Li further teaches that weight sharing is well known in the art ([p. 7-8 “Convolution Layer”] “The shared weight array f is normally smaller than 5x5. These shared parameters along with local connectivity make ConvNet more computationally efficient than Feedforward neural network.”).  

Regarding claim 11, the combination of Caulfield, Guan, and Li teaches The apparatus of claim 10, wherein the neural network is a Recurrent neural network (RNN). (Guan [Abstract] "We implement CNNs, LSTM-RNNs, and Residual Nets with FP DNN"). 

Regarding claim 12, the combination of Caulfield, Guan, and Li teaches The apparatus of claim 10, wherein the neural network is a long short-term memory network (LTSM). (Guan [Abstract] "We implement CNNs, LSTM-RNNs, and Residual Nets with FP DNN"). 

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Caulfield and in view of Donthi (“A Survey of Dynamically Reconfigurable FPGA Devices”, 2003). 

Regarding claim 15, Caulfield teaches However, Caulfield does not explicitly teach wherein the fetch stage analyzes and identifies values that are to be computed by fast operations and values to be computed by complex operations.  

Donthi in the same field of endeavor discloses an FPGA with logic blocks of differing complexity.  Donthi teaches wherein the fetch stage analyzes and identifies values that are to be computed by fast operations and values to be computed by complex operations. ([p. 423 §B] "The PFU is the basic logic element of the PLC, containing elements for both combinational and sequential logic. The PFU uses two sets of four LUTs and FFs that can be controlled independently. The twin-quad architecture of an LUT provides a facility to implement from one to eight independent combinational logic functions and a large number of complex logic functions using multiple LUTs. The  flexibility of the LUT to handle wide input functions, as well  as multiple smaller input functions, maximizes the gate count per PFU while increasing the speed" LUT facility to implement complex or combinatorial logic interpreted as synonymous with fetch stage to identify values to be computed by fast or complex operations.). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network system taught in Caulfield with the heterogeneous FPGA in Donthi by implementing the ORCA FPGA mentioned in Donthi with the neural network architecture disclosed in Caulfield. Caulfield explicitly mentions the use of an FPGA [¶0240] in the processor architecture and Donthi’s teaching facilitates determination of routing of differing complexity in a heterogeneous FPGA ([p. 422] “out of aforementioned considerations, selecting a target FpGA device is important in all dynamically reconfigurable system applications, The chosen target FPGA device should provide large amounts of hardware resources and it should be highly flexible so as to yield optimal performance”). 

Regarding claim 16, the combination of Caulfield, and Donthi teaches The apparatus of claim 15, wherein the execute stage comprises: one or more simple lanes to perform computation operations on the fast operations; and one or more complex lanes to perform computation operations on the complex operations. (Donthi [p. 423 §II.B] "The PFU is the basic logic element of the PLC, containing elements for both combinational and sequential logic. The PFU uses two sets of four LUTs and FFs that can be controlled independently. The twin-quad architecture of an LUT provides a facility to implement from one to eight independent combinational logic functions and a large number of complex logic functions using multiple LUTs. The flexibility of the LUT to handle wide input functions, as well as multiple smaller input functions, maximizes the gate count per PFU while increasing the speed" Elements for combinatorial logic interpreted as synonymous with complex lanes, elements for sequential logic interpreted as synonymous with simple lanes.). 

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Caulfield, and Donthi and in further view of Fowers (US 2018/0341484 A1).

Regarding claim 17, the combination of Caulfield an Donthi teaches The apparatus of claim 16.  However, the combination of Caulfield and Donthi does not explicitly teach wherein the writeback stage receives results from the one or more simple lanes and the one or more complex lanes and places the results in a layout format of a tensor output.  

Fowers, who teaches a related art of accelerating a neural network teaches wherein the writeback stage receives results from the one or more simple lanes and the one or more complex lanes and places the results in a layout format of a tensor output. (Fowers [¶0003] "an apparatus comprises logic configured to: access a first and a second machine instruction in a set of machine instructions. The second machine instruction is missing a tensor operand needed to execute the second machine instruction. The logic is further configured to execute the first machine instruction, resulting in a tensor. The logic is further configured to execute the second machine instruction using the resultant tensor as the missing tensor operand."). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Caulfield with the teachings of Fowers by using a combinatorial instruction set to create tensors. Fowers teaches that a single system could contain multiple instruction sets of differing complexity relative to particular operations.  Fowers further discloses the advantage this presents to a neural network accelerator ([¶0025] "One embodiment of a hardware accelerator has an instruction set in which some machine instructions are missing a tensor operand needed to execute the machine instruction...the hardware accelerator may take advantage of an observation that a tensor that results from execution of one machine instruction may frequently be used as an input tensor to another machine instruction"). 

Claims 18-21 are rejected under 35 U.S.C. 103 as being unpatentable over Caulfield and in view of Fraser (US 2019/0080223 A1). 

Regarding claim 18, Caulfield teaches The apparatus of claim 1.  However, Caulfield does not explicitly teach wherein the compute logic processes a high-resolution input image via the neural network by cropping the input image into two more image batches and processing the image batches at the at least one processor.  

Fraser who teaches a related art of a neural network accelerator teaches The apparatus of claim 1, wherein the compute logic processes a high-resolution input image via the neural network by cropping the input image into two more image batches and processing the image batches at the at least one processor. ([¶0072] "In some embodiments, a preprocessing unit 202 may receive an input training set 220, artificially augment batches in the input training set 220 (e.g., by performing distorting, shading, rotating, scaling, cropping, and other applicable processes)" Neural network system interpreted as synonymous with processor.). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural networks of Caulfield and Fraser by implementing batching. Fraser teaches that batching is a method of removing dependencies and allowing for greater parallelization ([¶0042] “It has been discovered that various techniques may be used to remove the dependencies in a training algorithm, which may prevent or reduce stalling and allow implementations using multiple accelerators...Such a delayed model adaptation allows the forward path and the backward path of neural network training to be implemented in parallel. This is achieved by introducing a delay between the output activations and the weight update and gradient calculations in each layer. This allows the calculations for each layer to be performed independently without different batches of inputs.”).

Regarding claim 19, the combination of Caulfield and Fraser teaches The apparatus of claim 18, wherein the at least one processor comprises a distributed architecture having a plurality of compute nodes. (Fraser [¶0078] "Referring to FIGS. 3, 4, and 5, an exemplary layer 204-i of the neural network system 200 is illustrated. FIG. 3 illustrates a layer 204-i including a forward path unit 300 for performing the forward path process of the backpropagation algorithm at the layer 204-i" Forward and backward path processing elements of neural network system 200 are interpreted as synonymous with compute nodes.). 

Regarding claim 20, the combination of Caulfield and Fraser teaches The apparatus of claim 19, wherein the two or more image batches are processed in parallel at the plurality of compute nodes. (Fraser [¶0079] "In some embodiments, each forward path PE corresponds to a neuron of the layer 204-i. As such, the number P of forward path PEs in the forward path unit 300 may control the number of neurons of the layer 204-i that may be computed in parallel in the forward path process." Forward processing element interpreted as synonymous with compute node.  Fraser explicitly teaches that images may be batched at the input of the neural network system and processed using the processing elements.). 

Regarding claim 21, the combination of Caulfield, and Fraser teaches The apparatus of claim 20, wherein the plurality of compute nodes comprises: one or more graphics processing units to process a first image batch; (Fraser [¶0089] "Referring to FIG. 6, in some embodiments, a neural network system 600 may use a delayed model adaptation scheme to remove the dependencies between the layers, thereby enabling the efficient usage of multiple accelerators (e.g., multiple GPUs, multiple FPGAs, a single FPGA including multiple systolic arrays)").
one or more central graphics processing units to process a second image batch; and (Fraser [¶0089] "Referring to FIG. 6, in some embodiments, a neural network system 600 may use a delayed model adaptation scheme to remove the dependencies between the layers, thereby enabling the efficient usage of multiple accelerators (e.g., multiple GPUs, multiple FPGAs, a single FPGA including multiple systolic arrays)").
one or more accelerators to process a third batch. (Fraser [¶0089] "Referring to FIG. 6, in some embodiments, a neural network system 600 may use a delayed model adaptation scheme to remove the dependencies between the layers, thereby enabling the efficient usage of multiple accelerators (e.g., multiple GPUs, multiple FPGAs, a single FPGA including multiple systolic arrays)"). 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Hapala (“Efficient Stack-less BVH Traversal for Ray Tracing”, 2011).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720.  The examiner can normally be reached on M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124