DETAILED ACTION
Response to Arguments
Applicant’s arguments with respect to claim 1 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-2, 7, 11-13, 17 and 19-20 are rejected under 35 U.S.C. 103 as being obvious over Martin et al. EP 3480745 A1(“Martin”) in view of Kuo et al. US 2019/0220742 Al(“Kuo”).
Regrading claim 1, Martin teaches a neural processor, comprising: a plurality of neural engine circuits, at least one of the neural engine circuits configured to perform a convolution operation of first input data with one or more kernels to generate a first output(Martin, paras. 0023-0027, see also fig. 3,“ The hardware implementation 300 comprises a plurality of convolution engines 302[a plurality of neural engine circuits], a plurality of accumulators 304, an accumulation buffer 306, a coefficient buffer 308, and an input buffer 310. Each convolution engine 302 [at least one of the neural engine circuits] comprises hardware logic configured[configured to perform a convolution operation] to receive a set of weights {w1 ... w8}[with one or more kernels] that represent all or a portion of a filter, and a set of input data values {d1 .. d8}[ of first input data] that represent all or a portion of a window, and perform a multiply-accumulate calculation on the received weights and input data values… [e]ach accumulator 304 receives the output of one convolution engine 302[to generate a first output] and adds the output to the previous convolution engine output that relates to the same filter. Since the convolution engine may not generate or produce outputs that relate to the same filter in consecutive cycles the partial results of one or more filters may be stored in an accumulation buffer 306 and then the appropriate partial result may be provided to the accumulator each cycle by the accumulation buffer 306. In some examples, the accumulation buffer 306 may be able to store partial results related to 128 different filters.” ); wherein: in the pooling mode, the planar engine circuit is configured to reduce a spatial size of a version of second input data received by the planar engine, the second input data corresponding to the first output or a version of input data of the neural processor(Martin, para. 0052-0065, fig. 8, “The normalisation module 810 is configured to receive one of the following as input data:…the accumulation output (via the element-wise operations module 806) (e.g. when a convolution layer is processed in the current hardware pass and neither an element-wise layer nor an activation layer is processed in the current hardware pass)[the second input data corresponding to the first output or a version of input data of the neural processor]… [t]he pooling module 812 may receive the normalised data from the normalisation module 810…[t]he pooling module 812 is configured to perform a pooling function, such as, but not limited to, a max or mean function, on the received data to produce pooled data. The purpose of a pooling layer is to reduce the spatial size of the representation [in the pooling mode, the planar engine circuit is configured to reduce a spatial size of a version of second input data received by the planar engine] to reduce the number of parameters and computation in the network, and hence to also control overfitting. In some examples, the pooling operation is performed over a sliding window that is defined per pooling layer.”), and in the elementwise mode, the planar engine circuit configured to perform an elementwise operation on the second input data, the second input data corresponding to the first output or a version of input data of the neural processor(Martin, para. 0052-0065, fig. 8, “The element-wise operations module 806 is configured to receive…the input data…[from] the accumulated result from the accumulation buffer 306 (e.g. when a convolution layer is processed in the current hardware pass)[the second input data corresponding to the first output or a version of input data of the neural processor]…[w]hen the element-wise operations module 806 is configured to process the received input data the element-wise operations module 806 performs an element-wise operation on the received data (optionally with another data set (which may be obtained from external memory))[ in the elementwise mode, the planar engine circuit configured to perform an elementwise operation on the second input data]. The element-wise operations module 806 may be configured to perform any suitable element-wise operation such as, but not limited to add, multiply, maximum, and minimum.”).  
Martin does not teach: a planar engine circuit coupled to the plurality of neural engine circuits and configured to operate in parallel with the plurality of neural engine circuits, the planar engine circuit operable in one of two or more modes that include a pooling mode and an elementwise mode configured to generate a second output.
However Kuo teaches: a planar engine circuit coupled to the plurality of neural engine circuits and configured to operate in parallel with the plurality of neural engine circuits(Kuo, para. 0031, see also fig. 2, “A block (e.g., block 211) is a basic
unit of computation. For example, an engine (e.g., the convolution engine 111) may include an array of multiply and-accumulate (MAC) circuits, and the size of a block may be equal to the size of the MAC array. Thus, operations on a block can be performed in parallel within an engine[a planar engine circuit coupled to the plurality of neural engine circuits and configured to operate in parallel with the plurality of neural engine circuits]. The size of an input tile may be determined by the size of the buffer (e.g., the convolution buffer 151). For example, an entire input tile should fit into the convolution buffer 151.”), the planar engine circuit operable in one of two or more modes that include a pooling mode and an elementwise mode configured to generate a second output(Kuo, para. 0024, see also fig. 1 “The DLA 100 includes multiple engines, each of which performs one type of neural network operations. Each engine includes hardware circuits ( e.g., multipliers, adders, accumulators, etc.) for performing mathematical computations. In this example, the DLA 100 includes a convolution
engine 111 for performing convolution operations, an activation engine 112 for performing element-wise mathematical operations (e.g., rectification (ReLU), batch normalization
(BN), etc.), a pooling engine 113 for performing downsampling operations[the planar engine circuit operable in one of two or more modes that include a pooling mode and an elementwise mode configured to generate a second output], and a mathematical function engine 114 (e.g., for computing trigonometry functions, max/min functions, absolute values, etc.).”).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Martin with the teachings of Kuo the motivation to do so would be to have a hardware accelerator that is able to tile the input to fit in buffers for fast operational access(Kuo, paras. 0018-0021, “[I]nput data to the DLA[deep learning accelerator] is retrieved from a system memory external to the DLA, and stored in a buffer memory internal to the DLA. Due to the limited buffer size, only a fraction of the input
data can be stored in the buffer memory at any given point of time. Thus, the input data may be partitioned into multiple tiles, and the buffer memory may store one or more tiles at the same time… [t]he DLA includes multiple different engines performing different types of
neural network computations. Each engine processes the input feature map on a tile-by-tile basis… [t]hus, the engines may process the tiles in parallel, passing data from one engine to another via the buffer memory to reduce system memory access.”).
Regrading claim 2, Martin in view of Kuo teaches the neural processor of claim 1, further comprising a data processor circuit coupled to the plurality of neural engine circuits and to the planar engine circuit, the data processor circuit configured to buffer the first output for sending the planar engine circuit or the second output for sending to the plurality of neural engines(Martin, para. 0027, see also fig. 3(306),“Since the convolution engine may not generate or produce outputs that relate to the same filter in consecutive cycles the partial results of one or more filters may be stored in an accumulation buffer 306[a data processor circuit coupled to the plurality of neural engine circuits and to the planar engine circuit] and then the appropriate partial result may be provided to the accumulator each cycle by the accumulation buffer 306. In some examples, the accumulation buffer 306 may be able to store partial results related to 128 different filters.” & see also Martin, para. 0053, “The accumulation buffer 306 also outputs the accumulated result to the element-wise operations module 806[the data processing circuit configured to buffer the first output for sending the planar engine circuit]….”).1  
Regrading claim 7, Martin in view of Kuo teaches the neural processor of claim 1, wherein the convolution operation is one of a plurality of operations for implementing a machine learning model(Martin, para. 0020 see also fig. 2, “A DNN may comprise one or more convolution layers each of which is associated with a plurality of filters each of which comprises a plurality of weights. Each filter has a dimension m x n x P (i.e. each filter comprises a set of m x n x P weights w) and is applied to the input data according to a convolution operation across several steps in direction s and t, (which are referred to as windows)[ wherein the convolution operation is one of a plurality of operations] as illustrated in FIG 2. Each filter produces one output plane. The number of filters and the number of weights per filter may vary between convolution layers. A convolutional neural network (CNN), which is a specific type of DNN that is effective for image recognition and classification, generally comprises a plurality of convolution layers[for implementing a machine learning model].”).  
Regrading claim 11, Martin in view of Kuo teaches the neural processor of claim 1, wherein the elementwise operation includes one or more of tensor addition, elementwise maximum, elementwise minimum, or elementwise multiplication(Martin, para. 0054, “The element-wise operations module 806 may be configured to perform any suitable element-wise operation such as, but not limited to add, multiply, maximum, and minimum.”).2  
Regrading claim 12, Martin in view of Kuo teaches the neural processor of claim 1, wherein circuitry of the planar engine circuit is reconfigured when switched from the pooling mode to the elementwise mode(Martin, para. 0054, “The element-wise operations module 806 is configured to receive either the input data for the current hardware pass (e.g. when a convolution layer is not processed in the current hardware pass)[ circuitry of the planar engine circuit is reconfigured when switched from the pooling mode to the elementwise mode]…[t]he element-wise operations module 806 may be configured to perform any suitable element-wise operation such as, but not limited to add, multiply, maximum, and minimum.”).
 Referring to independent claim 13 it is rejected on the same basis as independent claim 1 since they are analogous claims.
 Referring to dependent claim 17 it is rejected on the same basis as
dependent claim 7 since they are analogous claims.


Regrading claim 19, Martin teaches an electronic device, comprising: a memory storing a machine learning model(Martin, para. 0021, “A hardware implementation of a convolution layer may comprise a hardware module or block (which may be referred to herein as a convolution engine) that is configured to calculate the sum of the products between the weights
forming all or portion of a filter and input data values forming all or portion of a window (which may be referred to as a filter-window calculation)[a machine learning model]…[p]reparing each convolution engine to perform a filter-window calculation involves reading the appropriate input data and weights for each filter-window calculation from memory[a memory storing] and providing it to one of the convolution engines.”); and a neural processor, comprising: a plurality of neural engine circuits, at least one of the neural engine circuits configured to perform a convolution operation of first input data with one or more kernels to generate a first output(Martin, paras. 0023-0027, see also fig. 3,“ The hardware implementation 300 comprises a plurality of convolution engines 302[a plurality of neural engine circuits], a plurality of accumulators 304, an accumulation buffer 306, a coefficient buffer 308, and an input buffer 310. Each convolution engine 302 [at least one of the neural engine circuits] comprises hardware logic configured[configured to perform a convolution operation] to receive a set of weights {w1 ... w8}[with one or more kernels] that represent all or a portion of a filter, and a set of input data values {d1 .. d8}[ of first input data] that represent all or a portion of a window, and perform a multiply-accumulate calculation on the received weights and input data values… [e]ach accumulator 304 receives the output of one convolution engine 302[to generate a first output] and adds the output to the previous convolution engine output that relates to the same filter. Since the convolution engine may not generate or produce outputs that relate to the same filter in consecutive cycles the partial results of one or more filters may be stored in an accumulation buffer 306 and then the appropriate partial result may be provided to the accumulator each cycle by the accumulation buffer 306. In some examples, the accumulation buffer 306 may be able to store partial results related to 128 different filters.”); wherein: in the pooling mode, the planar engine circuit is configured to reduce a spatial size of a version of second input data received by the planar engine, the second input data corresponding to the first output or a version of input data of the neural processor(Martin, para. 0052-0065, fig. 8, “The normalisation module 810 is configured to receive one of the following as input data:…the accumulation output (via the element-wise operations module 806) (e.g. when a convolution layer is processed in the current hardware pass and neither an element-wise layer nor an activation layer is processed in the current hardware pass)[the second input data corresponding to the first output or a version of input data of the neural processor]… [t]he pooling module 812 may receive the normalised data from the normalisation module 810…[t]he pooling module 812 is configured to perform a pooling function, such as, but not limited to, a max or mean function, on the received data to produce pooled data. The purpose of a pooling layer is to reduce the spatial size of the representation [in the pooling mode, the planar engine circuit is configured to reduce a spatial size of a version of second input data received by the planar engine] to reduce the number of parameters and computation in the network, and hence to also control overfitting. In some examples, the pooling operation is performed over a sliding window that is defined per pooling layer.”), and in the elementwise mode, the planar engine circuit is configured to perform an elementwise operation on the second input data, the second input data corresponding to the first output or a version of input data of the neural processor(Martin, para. 0052-0065, fig. 8, “The element-wise operations module 806 is configured to receive…the input data…[from] the accumulated result from the accumulation buffer 306 (e.g. when a convolution layer is processed in the current hardware pass)[the second input data corresponding to the first output or a version of input data of the neural processor]…[w]hen the element-wise operations module 806 is configured to process the received input data the element-wise operations module 806 performs an element-wise operation on the received data (optionally with another data set (which may be obtained from external memory))[in the elementwise mode, the planar engine circuit is configured to perform an elementwise operation on the second input data]. The element-wise operations module 806 may be configured to perform any suitable element-wise operation such as, but not limited to add, multiply, maximum, and minimum.”).  
Martin does not teach: a planar engine circuit coupled to the plurality of neural engine circuits and configured to operate in parallel with the plurality of neural engine circuits, the planar engine circuit operable in one of two or more modes that include a pooling mode and an elementwise mode configured to generate a second output.
However Kuo teaches: a planar engine circuit coupled to the plurality of neural engine circuits and configured to operate in parallel with the plurality of neural engine circuits(Kuo, para. 0031, see also fig. 2, “A block (e.g., block 211) is a basic
unit of computation. For example, an engine (e.g., the convolution engine 111) may include an array of multiply and-accumulate (MAC) circuits, and the size of a block may be equal to the size of the MAC array. Thus, operations on a block can be performed in parallel within an engine[a planar engine circuit coupled to the plurality of neural engine circuits and configured to operate in parallel with the plurality of neural engine circuits]. The size of an input tile may be determined by the size of the buffer (e.g., the convolution buffer 151). For example, an entire input tile should fit into the convolution buffer 151.”), the planar engine circuit operable in one of two or more modes that include a pooling mode and an elementwise mode configured to generate a second output(Kuo, para. 0024, see also fig. 1 “The DLA 100 includes multiple engines, each of which performs one type of neural network operations. Each engine includes hardware circuits ( e.g., multipliers, adders, accumulators, etc.) for performing mathematical computations. In this example, the DLA 100 includes a convolution
engine 111 for performing convolution operations, an activation engine 112 for performing element-wise mathematical operations (e.g., rectification (ReLU), batch normalization
(BN), etc.), a pooling engine 113 for performing downsampling operations[the planar engine circuit operable in one of two or more modes that include a pooling mode and an elementwise mode configured to generate a second output], and a mathematical function engine 114 (e.g., for computing trigonometry functions, max/min functions, absolute values, etc.).”).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Martin with the teachings of Kuo the motivation to do so would be to have a hardware accelerator that is able to tile the input to fit in buffers for fast operational access(Kuo, paras. 0018-0021, “[I]nput data to the DLA[deep learning accelerator] is retrieved from a system memory external to the DLA, and stored in a buffer memory internal to the DLA. Due to the limited buffer size, only a fraction of the input
data can be stored in the buffer memory at any given point of time. Thus, the input data may be partitioned into multiple tiles, and the buffer memory may store one or more tiles at the same time… [t]he DLA includes multiple different engines performing different types of
neural network computations. Each engine processes the input feature map on a tile-by-tile basis… [t]hus, the engines may process the tiles in parallel, passing data from one engine to another via the buffer memory to reduce system memory access.”).
Regrading claim 20, Martin in view of Kuo teaches the electronic device of claim 19, wherein the convolution operation is one of a plurality of operations for implementing a machine learning model(Martin, para. 0020 see also fig. 2, “A DNN may comprise one or more convolution layers each of which is associated with a plurality of filters each of which comprises a plurality of weights. Each filter has a dimension m x n x P (i.e. each filter comprises a set of m x n x P weights w) and is applied to the input data according to a convolution operation across several steps in direction s and t, (which are referred to as windows)[ wherein the convolution operation is one of a plurality of operations] as illustrated in FIG 2. Each filter produces one output plane. The number of filters and the number of weights per filter may vary between convolution layers. A convolutional neural network (CNN), which is a specific type of DNN that is effective for image recognition and classification, generally comprises a plurality of convolution layers[for implementing a machine learning model].”). 

Claims 3-6 and 14-16 are rejected under 35 U.S.C. 103 as being unpatentable over Martin et al. EP 3480745 A1(“Martin”) in view of Kuo et al. US 2019/0220742 Al (“Kuo”) and further in view of  Zhang et al. "Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. (2017)(“Zhang”) .
Regrading claim 3, Martin in view of Kuo teaches the neural processor of claim 1, but does not teach: wherein the planar engine circuit comprises:  a first filter circuit configured to, in the pooling mode, reduce a first size of a first dimension of the version of the second input data to generate an intermediate data, and a second filter circuit configured to, in the pooling mode, reduce a second size of a second dimension of the intermediate data to generate a version of the second output.  
However, Zhang teaches: a first filter circuit configured to, in the pooling mode, reduce a first size of a first dimension of the version of the second input data to generate an intermediate data, and a second filter circuit configured to, in the pooling mode, reduce a second size of a second dimension of the intermediate data to generate a version of the second output(Zhang, pg. 31, see also fig. 8, “The pooling layer outputs the average or the maximum value of a local area of the input feature map. Pooling layers can be expressed
as Equation (12),                         
                            o
                            u
                            t
                            
                                
                                    
                                        
                                            f
                                        
                                        
                                            o
                                        
                                    
                                    ,
                                     
                                    x
                                    ,
                                     
                                    y
                                
                            
                        
                                             
                            =
                        
                                                
                            
                                
                                    
                                        
                                            max
                                        
                                        
                                            0
                                            <
                                            
                                                
                                                    
                                                        
                                                            k
                                                        
                                                        
                                                            x
                                                        
                                                    
                                                    ,
                                                     
                                                    
                                                        
                                                            k
                                                        
                                                        
                                                            y
                                                        
                                                    
                                                
                                            
                                            <
                                            k
                                        
                                    
                                
                                ⁡
                                
                                    i
                                    n
                                    (
                                    
                                        
                                            f
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                
                            
                        
                                               
                            x
                            +
                            
                                
                                    k
                                
                                
                                    x
                                
                            
                        
                    ,                         
                            y
                            +
                            
                                
                                    k
                                
                                
                                    y
                                
                            
                            )
                        
                      where k  We implement a similar line buffer as in Section 5.2.1, which uses the connections between different register stages to accomplish the window selection. In our design, we use a 4-input comparator to get the maximum value of a 2x2 window.” Zhang teaches:                         
                            x
                            +
                            
                                
                                    k
                                
                                
                                    x
                                
                            
                        
                    (a first filter circuit configured to, in the pooling mode, reduce a first size of a first dimension of the version of the second input data) line buffer and fig. 8 (to generate an intermediate data)                         
                            y
                            +
                            
                                
                                    k
                                
                                
                                    y
                                
                            
                        
                    (and a second filter circuit configured to, in the pooling mode, reduce a second size of a second dimension) line buffer and fig. 8 (of the intermediate data)                         
                            o
                            u
                            t
                            
                                
                                    
                                        
                                            f
                                        
                                        
                                            o
                                        
                                    
                                    ,
                                     
                                    x
                                    ,
                                     
                                    y
                                
                            
                        
                     (to generate a version of the second output)).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Martin in view of Kuo with the teachings of Zhang the motivation to do so would be to construct a FPGA accelerator for CNN classifiers (Zhang, pg. 25, “Convolutional Neural Networks (CNNs) are widely used in computer vision, speech recognition, natural language processing and text classification. Over the past decade, the accuracy and the performance of CNN has improved significantly, mainly due to the enhanced neural network structures enabled by massive datasets and increased computational resources benefits from the CMOS scaling to train the models in reasonable time. In recent years, FPGA has become an attractive solution to accelerate CNN classification…[i]n this work, to achieve a high performance CNN accelerator, we first propose an analytic model to guide our kernel design to achieve a better mapping from OpenCL kernels to FPGA hardware.”)  
Regrading claim 4, Martin in view of Kuo and in view of Zhang teaches the neural processor of claim 3, wherein the planar engine circuit further comprises a line buffer circuit between first filter circuit and the second filter circuit, the line buffer circuit configured to store the intermediate data for sending to the second filter circuit(Zhang, pg. 31, see also figs.7 and 8, “As shown in Figure 8, we implement a line buffer...between local memory and external memory to flatten and rearrange data. The goal is to minimize the random data access penalty from external memory and to improve on-chip data reuse. The line buffer streams data from external memory which has a continuous address and converts it into the data order for 2D convolution… we fill the line buffer[a line buffer circuit between] using a ping-pong mechanism to pipeline the data access[first filter circuit] and computation [and the second filter circuit]. More specifically, we choose to fill the 256 memory locations at one time, as 256 is not only half of the Altera M20K memory depth but also the maximum data burst size of the DDR4 interface[the line buffer circuit configured to store the intermediate data for sending to the second filter circuit].”).  
Regrading claim 5, Martin in view of Kuo and in view of Zhang teaches the neural processor of claim 3, wherein at least one of the first filter circuit or the second filter circuit is configured to perform, in the elementwise mode, the elementwise operation on a version of the second input data(Martin, para. 0052-0065, fig. 8, “The element-wise operations module 806 is configured to receive either the input data for the current hardware pass (e.g. when a convolution layer is not processed in the current hardware pass) or the accumulated result from the accumulation buffer 306 (e.g. when a convolution layer is processed in the current hardware pass). The element-wise operations module 806 may either process the received input data or pass the received input data to another module…[w]hen the element-wise operations module 806 is configured to process the received input data the element-wise operations module 806 performs an element-wise operation on the received data (optionally with another data set (which may be obtained from external memory)). The element-wise operations module 806 may be configured to perform any suitable element-wise operation such as, but not limited to add, multiply, maximum, and minimum.”).3  
Regrading claim 6, Martin in view of Kuo and in view of Zhang teaches the neural processor of claim 3, wherein the planar engine circuit further comprises a format converter coupled to the first filter circuit, the format converter configured to perform one or more format conversions on the second input data to generate the version of the second input data(Zhang, pg. 31, see also fig. 7(a) and 8, “As shown in Figure 8, we implement a line buffer…between local memory and external memory to flatten and rearrange data. The goal is to minimize the random data access penalty from external memory and to improve on-chip data reuse. The line buffer  streams data from external memory which has a continuous address and converts it into the data order for 2D convolution [the format converter configured to perform one or more format conversions on the second input data to generate the version of the second input data]…we fill the line buffer using a ping-pong mechanism to pipeline the data access….[a format converter coupled to the first filter circuit].”).  
Referring to dependent claims 14-16 they are rejected on the same basis as
dependent claims 3-5 since they are analogous claims.

Claims 8-10 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Martin et al. EP 3480745 A1(“Martin”) in view of Kuo et al. US 2019/0220742 Al(“Kuo”) and further in view of Bai et al. "A CNN accelerator on FPGA using depthwise separable convolution." IEEE Transactions on Circuits and Systems II: Express Briefs 65.10 (2018)(“ Bai”).
Regrading claim 8, Martin in view of Kuo teaches, the neural processor of claim 1, but does not teach wherein the planar engine circuit is further configured to, in a reduction mode, reduce the rank of a tensor.
However, Bai teaches:  wherein the planar engine circuit is further configured to, in a reduction mode, reduce the rank of a tensor(Bai, pgs. 1417-1418, see also figs. 3,5, 9 and 10, “Pointwise convolution is actually standard convolution with kernel size 1 × 1 (Fig. 9). To fully take advantage of all the multipliers in MME, the input feature map is divided into several M × M × 32 sub-matrices, and these sub-matrices are shifted into line buffers one after another. This idea comes from divide and conquer algorithm in large matrix multiplication illustrated in Fig. 10, which consists in dividing large matrix into several small matrices [in a reduction mode, reduce the rank of a tensor].”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Martin in view of Kuo with the teachings of Bai the motivation to do so would be to implement an accelerator that speeds up matrix multiplications and decreases memory latency(Bai, pg. 1415, “The key contributions of this brief are: (1) A high performance CNN hardware accelerator framework is proposed where all layers are processed in a computing unit named matrix multiplication engine. (2) The utilization of hierarchical memory structure and ping-pong on-chip buffer reduces the bandwidth limitation of off-chip memory.(3) A methodology for scalable design is proposed, so that this framework can be implemented in various FPGAs, through balancing the on-chip resources and performance.”).  
Regrading claim 9, Martin in view of Kuo and in view of Bai teaches the neural processor of claim 8, wherein the planar engine circuit comprises a filter circuit configured to: reduce the spatial size of the second data received in the pooling mode(Martin, para. 0052-0065, see also fig. 8, “The pooling module 812 may receive the normalised data from the normalisation module 810…[t]he pooling module 812 is configured to perform a pooling function, such as, but not limited to, a max or mean function, on the received data to produce pooled data. The purpose of a pooling layer is to reduce the spatial size of the representation to reduce the number of parameters and computation in the network, and hence to also control overfitting. In some examples, the pooling operation is performed over a sliding window that is defined per pooling layer.”), perform the elementwise operation of versions of one or more tensors in the elementwise mode(Martin, para. 0052-0065, see also fig. 8, “The element-wise operations module 806 may either process the received input data or pass the received input data to another module…[w]hen the element-wise operations module 806 is configured to process the received input data the element-wise operations module 806 performs an element-wise operation on the received data (optionally with another data set (which may be obtained from external memory)). The element-wise operations module 806 may be configured to perform any suitable element-wise operation such as, but not limited to add, multiply, maximum, and minimum.”), and generate a scalar value in the reduction mode(Bai, pgs. 1417-1418, see also figs. 3, 5, 9 and 10, For one MME, it is able to do M2 × 32 and 32 × 9 multiplication at once. The adder tree sums up the 32 products in each cell as revealed by Fig. 9. Thus the output channel number is 9 [and generate a scalar value in the reduction mode].).
Regrading claim 10, Martin in view of Kuo teaches the neural processor of claim 1, but does not teach: wherein the first input data represent data across a plurality of channels and the second input data represents data in one of the channels.
However, Bai teaches: wherein the first input data represent data across a plurality of channels(Bai, pg. 1417, “To avoid losing too much information, standard convolution is adopted to do the first layer convolution. Therefore, this accelerator is adapted to be able
to do the standard convolution with input feature map channel is 3 [the first input data represent data across a plurality of channels].”) and the second input data represents data in one of the channels(Bai, pg. 1418, “Average pooling and max pooling are treated differently. As pixels of a feature map channel are output one by one [and the second input data represents data in one of the channels], average pooling could be easily calculated by adding one more multiply-accumulate stage by a factor of 1/S, where S is average pooling size. On the other hand, max pooling needs one more comparison stage.”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Martin in view of Kuo with the teachings of Bai the motivation to do so would be to implement an accelerator that speeds up matrix multiplications and decreases memory latency(Bai, pg. 1415, “The key contributions of this brief are: (1) A high performance CNN hardware accelerator framework is proposed where all layers are processed in a computing unit named matrix multiplication engine. (2) The utilization of hierarchical memory structure and ping-pong on-chip buffer reduces the bandwidth limitation of off-chip memory.(3) A methodology for scalable design is proposed, so that this framework can be implemented in various FPGAs, through balancing the on-chip resources and performance.”).  
Referring to dependent claim 18 it is rejected on the same basis as
dependent claims 8 since they are analogous claims.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Adam Clark Standke whose telephone number is (571)270-1806. The examiner can normally be reached 10AM-7PM M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



Adam Clark Standke
Assistant Examiner
Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        2 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        3 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.