DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 8-10, 14, 19, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Garbin et al. (IDS: US 20180144240 A1) in view of Nurvitadhi et al. (“Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC”, 2016).

Regarding claim 1, Garbin et al. disclose a neural network accelerator comprising: a binary convolutional neural network (binarized neural network (BNN) software implementation, [0092]) comprising: a plurality of execution units (dot-product engine and post-processing units, [0111], array of semiconductor cells on which different sizes of dot product layers, reconfigurable control units 801 that are placed in-between the memory cell arrays, [0115]); and a near memory latch array comprising a plurality of sets of latches finely interleaved with the plurality of execution units to reduce energy consumption and enable a high bandwidth memory access to the plurality of execution units (neural network GPU kernels, multiply-accumulate operation i.e., dot-product that is typically encountered in NNs boils down to a popcount of element-wise XNOR or XOR operations, [0006], “In other embodiments, as illustrated in FIG. 5a, FIG. 5b and FIG. 5c, a semiconductor cell 400 comprises a memory unit 401 of the volatile type, e.g., an SRAM cell, a latch and a flip-flop, respectively, for storing a first operand, an input port unit 402 for receiving a second operand, a switch unit configured for implementing a logic XNOR or XOR operation on the stored first operand and the received second operand, for instance an XNOR gate 403, and a readout port 404 for providing an output of the logic operation”, [0084], semiconductor cells 100, 400 are organized in an array, in which they are logically organized in rows and columns, [0085], “It is an advantage of an array of semiconductor cells according to embodiments of the disclosed technology that it reduces energy consumption of classification operations, by letting input-dependent values (NN activations) flow through arrays of pre-trained binary weights, with arithmetic operations performed as close to their operands as possible”, [0086], XNOR (or multiplication), [0093], organize the layers of the dot-product arrays and the interleaving logic, [0112]), a set of latches communicatively coupled to each of the plurality of execution units to store a result of a matrix multiplication in the execution unit (the output can be latched by any suitable latch element 211 to a final output node 212, [0081], latch, readout port 404 for providing an output of the logic operation, [0084] receiving the outputs of the XNOR or XOR operations from the output ports [0085], in-place operations for the dot-product stages of a classifier and post-processing units, such as for instance simple logic, to interconnect between classifier layers with simple math operations, XNOR/XOR readout can also be multi-level, hence allowing to encode scalar weight/output values, [0100], dot-product engine and post-processing units, [0111], reconfigurable control units 801 that are placed in-between the memory cell arrays, [0115]) [interpretation for execution units as encompassing units performing XNOR and dot product operations supported by applicant’s specification].

Garbin et al. do not explicitly disclose matrix multiplication.

Nurvitadhi et al. teach store a result of a matrix multiplication in the execution unit (“In binarized neural networks, the matrix x vector operation to compute each network layer can be replaced by xnor and bit counting because weights and neurons are constrained to either +1 or -1, each representable in 1-bit”, IIA, “each network weight and neuron value is constrained to
be of only two possible values, +1 or -1. As such, it can be represented using a single bit. Therefore, BNNs require significantly less storage than standard DNNs, “Binarized Matrix x Binarized Vector. Since activation function in BNN [1][2] produces a +1 or -1 value, neurons (vi and vo) after the first BNN layer would be representable as 1- bit values. As such, the computation multiplies a binarized vector of input neurons (vib) against a binarized weight matrix Wb. Such operation can be done using xnor and a variant of a population count (pcnt), thereby eliminating the need for full precision operations”, part IIB).

Garbin et al. and Nurvitadhi et al. are in the same art of binarized neural networks (Garbin et al., [0092]; Nurvitadhi et al., IIA). The combination of Nurvitadhi et al. with Garbin et al. will enable the use of a matrix multiplication. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the multiplication of Nurvitadhi et al. with the invention of Garbin et al. as this was known at the time of filing, the combination would have predictable results, and as Nurvitadhi et al. indicate this eliminates the need for full precision operations (part IIB), thus “This leads to dramatic algorithm efficiency improvement, due to reduction in the memory and computational demands. This paper evaluates the opportunity to further improve the execution efficiency of BNNs through hardware acceleration” (abstract), and improving efficiency is the goal also stated in Garbin et al. helping further achieve these stated purposes of Garbin et al. in combination. 

Regarding claims 8 and 19, Garbin et al. and Nurvitadhi et al. disclose the neural network accelerator and system of Claims 1 and 14. Nurvitadhi et al. further teach the binary convolutional neural network to use lightweight pipelining to optimize for energy efficiency (“In Figure 4(a), we also report peak throughput as terra operations per second (TOP/sec). This represents 1-bit multiply and accumulation operations on network weights and neurons. It is calculated as follows. As an example, the FPGA1024 design contains 1024 PEs and each PE does 32-bit packed weights calculation in parallel in a pipelined fashion to retire 32 new results each cycle. So, at 150MHz frequency, the peak throughput is 1024 PEs x (32 bits packed x (1 multiply + 1 accumulate)/PE) x 150M operations per second. This results in 9.8 TOP/sec. Such a high peak throughput is feasible due to the significant efficiency benefit of binarization, IIIB).

Regarding claims 9 and 20, Garbin et al. and Nurvitadhi et al. disclose the neural network accelerator and system of Claims 1 and 14. Garbin et al. and Nurvitadhi et al. further indicate the binary convolutional neural network further comprising: a memory to store an input image, a number of operations performed per bit read from the memory is greater than or equal to 128 to optimize for energy efficiency by amortizing cost of memory access and data movement across many binary neural network operations (Garbin et al., “The signal, e.g., current or voltage, at the readout port 104 can be sensed using a sense amplifier 201, such as for instance, but not limited thereto, the one disclosed in S. Cosemans, W. Dehaene and F. Catthoor, “A 3.6 pJ/access 480 MHz, 128 Kbit on-Chip SRAM with 850 MHz boost mode in 90 nm CMOS with tunable sense amplifiers to cope with variability,” in Solid-State Circuits Conference, 2008. ESSCIRC 2008. 34th European, 2008. The relevant disclosure associated with the sense amplifier in Cosemans et al. is incorporated herein in its entirety. A representative schematic is illustrated in FIG. 3 for the implementation of the sense amplifier with a non-volatile memory unit, according to embodiments. Similarly, a sensing unit as illustrated in FIG. 3 may be implemented in case of a semiconductor cell with a volatile memory unit”, [0088]; “On our evaluated GTX Titan X platform, 32 32-bit population count operations can be issued every cycle per Streaming Multiprocessor (SM) – yielding 1024 “binary ops” per cycle. As GTX Titan X can issue up to 128 32-bit floatingpoint operations every cycle per SM, the performance roofline of “binary ops” over FP32 operations is 4x”, part IV C).

Regarding claim 10, Garbin et al. disclose a method comprising: interleaving a plurality of sets of latches with a plurality of execution units to reduce energy consumption and enable a high bandwidth memory access to the plurality of execution units (neural network GPU kernels, multiply-accumulate operation i.e., dot-product that is typically encountered in NNs boils down to a popcount of element-wise XNOR or XOR operations, [0006], “In other embodiments, as illustrated in FIG. 5a, FIG. 5b and FIG. 5c, a semiconductor cell 400 comprises a memory unit 401 of the volatile type, e.g., an SRAM cell, a latch and a flip-flop, respectively, for storing a first operand, an input port unit 402 for receiving a second operand, a switch unit configured for implementing a logic XNOR or XOR operation on the stored first operand and the received second operand, for instance an XNOR gate 403, and a readout port 404 for providing an output of the logic operation”, [0084], semiconductor cells 100, 400 are organized in an array, in which they are logically organized in rows and columns, [0085], “It is an advantage of an array of semiconductor cells according to embodiments of the disclosed technology that it reduces energy consumption of classification operations, by letting input-dependent values (NN activations) flow through arrays of pre-trained binary weights, with arithmetic operations performed as close to their operands as possible”, [0086], XNOR (or multiplication), [0093], organize the layers of the dot-product arrays and the interleaving logic, [0112]); and communicatively coupling a set of latches to each of the plurality of execution units to store a result of a matrix multiplication in the execution unit (the output can be latched by any suitable latch element 211 to a final output node 212, [0081], latch, readout port 404 for providing an output of the logic operation, [0084] receiving the outputs of the XNOR or XOR operations from the output ports [0085], in-place operations for the dot-product stages of a classifier and post-processing units, such as for instance simple logic, to interconnect between classifier layers with simple math operations, XNOR/XOR readout can also be multi-level, hence allowing to encode scalar weight/output values, [0100], dot-product engine and post-processing units, [0111], reconfigurable control units 801 that are placed in-between the memory cell arrays, [0115]) [interpretation for execution units as encompassing units performing XNOR and dot product operations supported by applicant’s specification].

Yang et al. do not explicitly disclose it is a matrix multiplication.

Nurvitadhi et al. teach store a result of a matrix multiplication in the execution unit (“In binarized neural networks, the matrix x vector operation to compute each network layer can be replaced by xnor and bit counting because weights and neurons are constrained to either +1 or -1, each representable in 1-bit”, IIA, “each network weight and neuron value is constrained to
be of only two possible values, +1 or -1. As such, it can be represented using a single bit. Therefore, BNNs require significantly less storage than standard DNNs, “Binarized Matrix x Binarized Vector. Since activation function in BNN [1][2] produces a +1 or -1 value, neurons (vi and vo) after the first BNN layer would be representable as 1- bit values. As such, the computation multiplies a binarized vector of input neurons (vib) against a binarized weight matrix Wb. Such operation can be done using xnor and a variant of a population count (pcnt), thereby eliminating the need for full precision operations”, part IIB).

Garbin et al. and Nurvitadhi et al. are in the same art of binarized neural networks (Garbin et al., [0092]; Nurvitadhi et al., IIA). The combination of Nurvitadhi et al. with Garbin et al. will enable the use of a matrix multiplication. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the multiplication of Nurvitadhi et al. with the invention of Garbin et al. as this was known at the time of filing, the combination would have predictable results, and as Nurvitadhi et al. indicate this eliminates the need for full precision operations (part IIB), thus “This leads to dramatic algorithm efficiency improvement, due to reduction in the memory and computational demands. This paper evaluates the opportunity to further improve the execution efficiency of BNNs through hardware acceleration” (abstract), and improving efficiency is the goal also stated in Garbin et al. helping further achieve these stated purposes of Garbin et al. in combination. 


Regarding claim 14, Garbin et al. disclose a system comprising: a neural network accelerator (binarized neural network (BNN) software implementation, [0092]) comprising: a binary convolutional neural network (binarized neural network (BNN) software implementation, [0092]) comprising: a plurality of execution units(dot-product engine and post-processing units, [0111], array of semiconductor cells on which different sizes of dot product layers, reconfigurable control units 801 that are placed in-between the memory cell arrays, [0115]); and a near memory latch array comprising a plurality of sets of latches finely interleaved with the plurality of execution units to reduce energy consumption and enable a high bandwidth memory access to the plurality of execution units (neural network GPU kernels, multiply-accumulate operation i.e., dot-product that is typically encountered in NNs boils down to a popcount of element-wise XNOR or XOR operations, [0006], “In other embodiments, as illustrated in FIG. 5a, FIG. 5b and FIG. 5c, a semiconductor cell 400 comprises a memory unit 401 of the volatile type, e.g., an SRAM cell, a latch and a flip-flop, respectively, for storing a first operand, an input port unit 402 for receiving a second operand, a switch unit configured for implementing a logic XNOR or XOR operation on the stored first operand and the received second operand, for instance an XNOR gate 403, and a readout port 404 for providing an output of the logic operation”, [0084], semiconductor cells 100, 400 are organized in an array, in which they are logically organized in rows and columns, [0085], “It is an advantage of an array of semiconductor cells according to embodiments of the disclosed technology that it reduces energy consumption of classification operations, by letting input-dependent values (NN activations) flow through arrays of pre-trained binary weights, with arithmetic operations performed as close to their operands as possible”, [0086], XNOR (or multiplication), [0093], organize the layers of the dot-product arrays and the interleaving logic, [0112]), a set of latches communicatively coupled to each of the plurality of execution units to store a result of a matrix multiplication in the execution unit (the output can be latched by any suitable latch element 211 to a final output node 212, [0081], latch, readout port 404 for providing an output of the logic operation, [0084] receiving the outputs of the XNOR or XOR operations from the output ports [0085], in-place operations for the dot-product stages of a classifier and post-processing units, such as for instance simple logic, to interconnect between classifier layers with simple math operations, XNOR/XOR readout can also be multi-level, hence allowing to encode scalar weight/output values, [0100], dot-product engine and post-processing units, [0111], reconfigurable control units 801 that are placed in-between the memory cell arrays, [0115])
 [interpretation for execution units as encompassing units performing XNOR and dot product operations supported by applicant’s specification] [higher bandwidth implied by the closer operations and lower power consumption indicated in paragraph 86, Nurvitadhi et al. below further addresses bandwidth]

Garbin et al. do not disclose a display communicatively coupled to a processor to display an input image. However, as Garbin et al. indicate “The second operand is a value fed to the semiconductor cell 100, which may be variable, and which may depend on the current input to the semiconductor cell 100, for instance a frame such as an image frame to be classified” ([0071]), displaying this image would be obvious to try, a design choice, and displaying images is commonly performed and routine in the art.

Garbin et al. do not explicitly disclose matrix multiplication.

Nurvitadhi et al. teach enable a high bandwidth memory access to the plurality of execution units (DRAM memory, which is power consuming and has much lower bandwidth than on-chip RAMs, thereby imposing performance constraints, Binarized neural networks (BNNs) have the potential to address this issue, part IIB, many onchip RAMs deliver sufficient bandwidth to the PEs to achieve high throughput at extreme efficiency, part III, The compact binarized weights for interesting problem sizes can fit in many distributed on-chip FPGA RAMs that deliver abundance of onchip bandwidth to the reconfigurable fabric and DSPs to perform high- throughput computation on packed binarized neuron and weight values, part IIIB) and store a result of a matrix multiplication in the execution unit (“In binarized neural networks, the matrix x vector operation to compute each network layer can be replaced by xnor and bit counting because weights and neurons are constrained to either +1 or -1, each representable in 1-bit”, IIA, “each network weight and neuron value is constrained to be of only two possible values, +1 or -1. As such, it can be represented using a single bit. Therefore, BNNs require significantly less storage than standard DNNs, “Binarized Matrix x Binarized Vector. Since activation function in BNN [1][2] produces a +1 or -1 value, neurons (vi and vo) after the first BNN layer would be representable as 1- bit values. As such, the computation multiplies a binarized vector of input neurons (vib) against a binarized weight matrix Wb. Such operation can be done using xnor and a variant of a population count (pcnt), thereby eliminating the need for full precision operations”, part IIB).

Garbin et al. and Nurvitadhi et al. are in the same art of binarized neural networks (Garbin et al., [0092]; Nurvitadhi et al., IIA). The combination of Nurvitadhi et al. with Garbin et al. will enable the use of a matrix multiplication. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the multiplication of Nurvitadhi et al. with the invention of Garbin et al. as this was known at the time of filing, the combination would have predictable results, and as Nurvitadhi et al. indicate this eliminates the need for full precision operations (part IIB), thus “This leads to dramatic algorithm efficiency improvement, due to reduction in the memory and computational demands. This paper evaluates the opportunity to further improve the execution efficiency of BNNs through hardware acceleration” (abstract), and improving efficiency is the goal also stated in Garbin et al. helping further achieve these stated purposes of Garbin et al. in combination. 

Claim(s) 2, 11, and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Garbin et al. (IDS: US 20180144240 A1) and Nurvitadhi et al. (“Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC”, 2016) as applied to claims 1, 10, and 14 above, further in view of Cohen et al. (US 20190102671 A1).

Regarding claims 2, 11, and 15, Garbin et al. and Nurvitadhi et al. disclose the neural network accelerator and method and system of claims 1, 10, and 14. While Garbin et al. and Nurvitadhi et al. disclose finding a dot product, Garbin et al. and Nurvitadhi et al. do not explicitly disclose each of the plurality of execution units to perform the matrix multiplication using an inner product.

Cohen et al. teach each of the plurality of execution units to perform the matrix multiplication using an inner product (system and method of providing an inner product convolutional neural network accelerator, [0001], feature matrix may be convoluted with the weight matrix in either an inner product fashion or an outer product fashion, matrix multiplications, [0028], “embodiments of the present specification provide for a system wherein a CNN may perform an inner product operation without the need of a lowering operation to reformat the data.  Thus, an inner product operation may be applied between the IFM vector and the weight vector. Each vector may keep many elements”, [0029],  “A system and method for an inner product convolutional neural network accelerator will now be described with more particular reference to the attached FIGURE”, [0044], computer vision logic may be implemented in software and/or hardware logic, applicable to an inner-product convolutional neural network accelerator, [0060], “Note that CNN 500 could be a high precision neural network, or could be a lower precision neural network such as an INT1 or INT2 network. Because a 1-bit neural network represents two values, it may be referred to as a binary neural network (BNN), while a ternary neural network may have three values, namely −1, 0, or +1”, [0095], “The execution cluster(s) 1260 includes a set of one or more execution units 1262 and a set of one or more memory access units 1264. The execution units 1262 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution units dedicated to specific functions or sets of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions”, [0232]).

Garbin et al. and Nurvitadhi et al. and Cohen et al. are in the same art of binarized neural networks (Garbin et al., [0092]; Nurvitadhi et al., IIA; Cohen et al., [0095]). The combination of Cohen et al. with Garbin et al. and Nurvitadhi et al. will enable the use of an inner product. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the inner product calculation of Cohen et al. with the invention of Garbin et al. and Nurvitadhi et al. as this was known at the time of filing, the combination would have predictable results, and as Cohen et al. indicate “Using an inner product method, the ratio between operations and accumulators can be highly improved, and therefore less area and power may be allocated for accumulation and more area and power consumed for actual execution” ([0030]), thereby further achieving the stated efficiency and power saving goals indicated by Garbin et al. and Nurvitadhi et al. when combined.

Claim(s) 3, 4, 12, 13, 16 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Garbin et al. (IDS: US 20180144240 A1) and Nurvitadhi et al. (“Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC”, 2016) as applied to claims 1, 10, and 14 above, further in view of Fiandrotti et al. (US 20210142175 A1).

Regarding claims 3, 12 and 16, Garbin et al. and Nurvitadhi et al. disclose the neural network accelerator, method, and system of Claims 1, 10 and 14. Garbin et al. and Nurvitadhi et al. do not explicitly disclose a convolutional operation to shift a weight filter by a stride over an input image in a snake like pattern.

Fiandrotti et al. teach a convolutional operation to shift a weight filter by a stride over an input image in a snake like pattern (“a kernel comprising k.sup.2 weights w arranged in a matrix having k×k elements, the convolution between the input image and the kernel provides for processing the image for generating a so-called featuremap comprising a plurality of features, each feature being associated to a corresponding area of k×k pixels (source pixels) of the input image, by carrying out the following operations: the k×k kernel is “overlapped” over a corresponding k×k portion of the input image in order to have each source pixel of said portion of the input image that is associated with a corresponding element of the kernel matrix, with the center element of the kernel matrix which is associated with a central source pixel of said portion; the pixel value of each pixel included in said portion of the input image is weighted by multiplying it with the weight w corresponding to the associated element of the kernel matrix; the weighted pixel values are summed to each other, and a corresponding bias is added; an activation function is applied, obtaining this way a feature associated to the examined portion of the input image; such feature is saved in a position of the featuremap corresponding to the central source pixel. the filter is shifted, horizontally and vertically, by a stride corresponding equal to an integer value (e.g., 1 pixel) the above steps are repeated to cover all the pixels of the input image, in order to obtain a complete featuremap [0006]-[0012]) [pattern of horizontal and vertical shift interpreted as a “snake like” pattern]

Garbin et al. and Nurvitadhi et al. and Fiandrotti et al. are in the same art of neural networks (Garbin et al., [0092]; Nurvitadhi et al., IIA; Fiandrotti et al., abstract). The combination of Fiandrotti et al. with Garbin et al. and Nurvitadhi et al. will enable the shifting of a weight filter by a stride. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the shifting a weight filter by a stride of Fiandrotti et al. with the invention of Garbin et al. and Nurvitadhi et al. as this was known at the time of filing, the combination would have predictable results, and as Fiandrotti et al. indicate this is a simple solution existing in the art 
to reduce the memory requirements for deploying a trained CNN, so as to allow the CNN to be deployed also in devices having low memory capabilities ([0049]), thus expanding possible applications of the invention of Garbin et al. and Nurvitadhi et al..

Regarding claims 4, 13 and 17, Garbin et al. and Nurvitadhi et al. and Fiandrotti et al. disclose the neural network accelerator, method, and system of Claims 3, 12 and 16. Fiandrotti et al. further teach a max-pooling operation to be performed in-between convolutional operations (based on CNN usually comprises several convolutional layers, typically interleaved with subsampling layers (e.g., the so-called max-pooling layers), followed by a sequence of final, fully-connected (i.e., non-convolutional) layers acting as final classifier, [0019]).

Claim(s) 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Garbin et al. (IDS: US 20180144240 A1) and Nurvitadhi et al. (“Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC”, 2016) and Fiandrotti et al. (US 20210142175 A1) as applied to claim 4 above, further in view of Takamaeda et al. (US 20210232899 A1).

Regarding claim 5, Garbin et al. and Nurvitadhi et al. and Fiandrotti et al. disclose the neural network accelerator of Claim 4. Nurvitadhi et al. partly teach the max-pooling operation uses a sign-bit to select a maximum number for a subregion (This operation can efficiently be done
by adjusting the sign bit of vi against the 1-bit weight of Wb. I.e., if they are of the same sign, the output should maintain the sign bit. Otherwise, the output should have the opposite sign, IIB).

Takamaeda et al. teach max-pooling operation uses a sign-bit to select a maximum number for a subregion (“binary neural network circuit” has been proposed in which each of the input data and the weighting coefficient is one bit, [0003], XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, [0004], storage unit that stores a logarithmic weighting coefficient, in which a value obtained by logarithmizing a weighting coefficient corresponding to input data that is input is expressed in multiple bits, and outputs the logarithmic weighting coefficient bit by bit; a first electronic circuit unit that outputs a multiplication result of the input data and the weighting coefficient; and a second electronic circuit unit that realizes addition and application functions for adding up the multiplication results from the first electronic circuit units, applying an activation function to the addition result, and outputting output data, [0008], process element unit Pe corresponds only to the sign bit, flip-flop pe12 determines the sign of the adder pe11 and the sign of the flip-flop pe8, and the XOR element pe20 determines whether the adder pe11 serving as a counter increments the count by +1 or −1, [0132],  The maximum pooler ac7 has a function of receiving a plurality of output results and selecting only one piece of data. The maximum pooler ac7 has a register (for example, four bits), and compares the previous value with the current input value and outputs the larger one. The maximum pooler ac7 transmits information of the neuron with the strongest reaction, thereby enabling robust inference with a small amount of calculation. In addition, when this function is not used, the addition activation unit Act may be constructed so as to spool the maximum pooling function, [0142]).

Garbin et al. and Nurvitadhi et al. and Takamaeda et al. are in the same art of binarized neural networks (Garbin et al., [0092]; Nurvitadhi et al., IIA; Takamaeda et al., [0003]). The combination of Takamaeda et al. with Garbin et al. and Nurvitadhi et al. will enable the use of max pooling. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the max pooling of Takamaeda et al. with the invention of Garbin et al. and Nurvitadhi et al. as this was known at the time of filing, the combination would have predictable results, and as Takamaeda et al. indicate this enables robust inference with a small amount of calculation ([0142]) thereby further improving efficiency, the stated goal of Garbin et al. and Nurvitadhi et al., thereby useful when combined with said system.


Claim(s) 6 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Garbin et al. (IDS: US 20180144240 A1) and Nurvitadhi et al. (“Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC”, 2016) as applied to claims 1 and 14 above, further in view of Idgunji et al. (US 20200050920 A1).

Regarding claims 6 and 18, Garbin et al. and Nurvitadhi et al. disclose the neural network accelerator and system of Claims 1 and 14. Garbin et al. and Nurvitadhi et al. do not explicitly disclose the binary convolutional neural network further comprising: a memory to store an input image, a first power rail from a power source to provide a first power to the memory, the first power rail separate from a second power rail to provide a second power to the binary convolutional neural network.

Idgunji et al. teach a memory to store an input image, a first power rail from a power source to provide a first power to the memory, the first power rail separate from a second power rail to provide a second power to the binary convolutional neural network (Many GPU's are massively parallel—meaning they contain many computing elements (e.g., programmable streaming multi-processors (“SM”s)) operating in parallel. This massively parallel architecture allows developers to break down complex computation into smaller parallel pieces that, because they are being performed concurrently, will complete much faster. While exceedingly fast, such an array of parallel computing elements can consume lots of power and generate lots of heat. Therefore, power management has become an important aspect of GPU (and other) complex integrated circuit design and operation, [0004], As shown in FIG. 1, the power management 140 monitors a signal 114 of current and/or voltage sample information received from a signal conditioner and multiplexer (MUX) 112, which samples one or more power rails of the power distribution network 110. The signal 114 indicates the current power provided to the GPU 102. In some embodiments, the GPU 102 may, via a signal 115, select particular power rails to monitor, [0053],  self-learning may be performed in part by a warp of one the parallel processors 116 using one or more deep neural networks that are accelerated by a hardware-based deep learning accelerator (DLA) 141 included as part of GPU 102, [0054], system 100 may be a board comprising one or more GPUs, one or more control processors such as CPUs, and associated memory and/or memory management circuitry, [0056]).

Garbin et al. and Nurvitadhi et al. and Idgunji et al. are in the same art of neural networks (Garbin et al., [0092]; Nurvitadhi et al., IIA; Idgunji et al., [0054]). The combination of Idgunji et al. with Garbin et al. and Nurvitadhi et al. will enable the use of several power rails. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the power rails of Idgunji et al. with the invention of Garbin et al. and Nurvitadhi et al. as this was known at the time of filing, the combination would have predictable results, and as Idgunji et al. indicate “While much work has been done in the past, there is a need for further improved solutions that provide an adaptive but tunable system that can handle sudden load step and load releases, without impacting overall performance” ([0011]) which the invention of Idgunji et al. uses to improve parallel computations such as those described by Garbin et al. and Nurvitadhi et al.. 

Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Garbin et al. (IDS: US 20180144240 A1) and Nurvitadhi et al. (“Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC”, 2016) and Idgunji et al. (US 20200050920 A1) as applied to claim 6 above, further in view of Sanghvi et al. (US 20190005656 A1).

Regarding claim 7, Garbin et al. and Nurvitadhi et al. and Idgunji et al. disclose the neural network accelerator of Claim 6. Garbin et al. and Nurvitadhi et al. and Idgunji et al. do not explicitly disclose the memory has a plurality of banks to store the input image with each input in a window mapped to a different memory bank to enable a single cycle access to the window.

Sanghvi et al. teach a memory has a plurality of banks to store the input image with each input in a window mapped to a different memory bank to enable a single cycle access to the window (The shared memory 212 stores input and output data for the dense optical flow engine 202. The shared memory 212 includes four banks of static random access memory. The shared memory interconnect 210 is a crossbar with pipelined command and response handling. The DMA 108 is connected to the shared memory interconnect 210 and is used to move data for processing by the DOFE 202 into the shared memory and to move the optical flow data produced by the DOFE 202 out of the optical flow accelerator 112 for consumption by other components on the SOC 100, [0026], For a pixel in the current image, the search for the best matching pixel in the reference frame is restricted to a search window in the current frame, [0033] In this hierarchy, the L3 memory stores the reference and current images, the L2 memory stores a subset of concurrent pixel rows of each of the images, and the L1 memory stores a search window extracted from the reference image rows in the L2 memory., [0042]
Further, the tiles are “striped” across multiple memory banks in the L1 memory. Any suitable number of memory banks may be used. Each memory bank is sized to store multiples of whole tiles and a tile is stored in a memory bank such that it can be accessed in a single cycle. The particular arrangement of the tiles across the memory banks may depend, for example, on the number of memory banks available and the size of the search window, [0045]).


Garbin et al. and Nurvitadhi et al. and Sanghvi et al. are in the same art of computer vision and classification (Garbin et al., [0003]; Nurvitadhi et al., part I; Sanghvi et al., abstract, [0021]). The combination of Sanghvi et al. with Garbin et al. and Nurvitadhi et al. and Idgunji et al. will enable the use of several power rails. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the memory banks of Sanghvi et al. with the invention of Garbin et al. and Nurvitadhi et al. and Idgunji et al. as this was known at the time of filing, the combination would have predictable results, and as Sanghvi et al. indicate “Embodiments of the disclosure provide for dense optical flow processing in an embedded computer vision system that meets real time performance requirements. In some embodiments, a hardware accelerator for dense optical flow map calculation is provided. The hardware accelerator includes novel features that improve the performance of dense optical flow computation such as a paxel based search for matching pixels that reduces search time, a hierarchical data organization with tiling to manage data bandwidth, and/or advanced predictor evaluation that avoids refetching of data” ([0020]), thereby indicating the improvement to the data heavy processing tasks described by Garbin et al. and Nurvitadhi et al. and Idgunji et al..

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: “Interleaved Logic-in-Memory Architecture for Energy-Efficient Fine-Grained Data Processing” 2017: (“Rather than implementing logic and memory as two physically separate modules, it is feasible to design a single, monolithic
logic-in-memory (LIM) framework. Such a framework can drastically reduce the need for data transfer between the CPU and memory, improving system power and performance”, p409, “6T SRAM array in a “memory-logic-memory-latch” configuration. Specifically, one logic row is inserted between every two SRAM memory rows, and latches are used to hold either
intermediate or final results”, “consists of three basic components: the SRAM cell, lookup table implemented with transmission gates, and modified RS-latch”, 
    PNG
    media_image1.png
    477
    438
    media_image1.png
    Greyscale
, have implemented both LUT based and XOR based LIM units into the architecture and designed a complete 16x128 
MISK fabric, applications with mixed operations could be implemented with low overhead, p410).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M ENTEZARI HAUSMANN whose telephone number is (571)270-5084. The examiner can normally be reached 10-7 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, VINCENT M RUDOLPH can be reached on (571)272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHELLE M ENTEZARI/Primary Examiner, Art Unit 2661