Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

DETAILED ACTION

Reasons for Allowance

Claims 1-20    are Allowed over prior art.

The following is an examiner’s statement of reasons for allowance: 
Prior art made of record fails to teach the underline limitations within the independent claims,

Regarding Claim 1,
A method for mapping an input map of a convolutional neural network layer to an output map using a set of convolution kernels, comprising: providing an array of locally interconnected processing elements, the processing elements being arranged on a regular two-dimensional grid, the grid defining at least three different spatial directions of the array along which unidirectional dataflows on the array are supported, each processing element being adapted to: store, during at least one computation cycle, a value carried by each unidirectional dataflow traversing or originating at the processing element, and combine values carried by unidirectional dataflows traversing the processing element in different spatial directions into a new value carried by at least one of the supported unidirectional dataflows; 
providing, for each data entry in the output map to be computed, a plurality of products from pairs of weights of a selected convolution kernel and data entries in the input map addressed by the weights, a data entry position in the output map determining the selected convolution kernel and its position relative to the input map; arranging, for each data entry in the output map, the plurality of products into a plurality of partial sums to be computed, each partial sum including at least the products associated with a first dimension of the selected convolution kernel, different partial sums being associated with at least a second dimension of the selected convolution kernel; computing the data entries in the output map by performing at least once the steps of:  determining at least one parallelogram set of processing elements, corresponding to at least one data entry in the output map, each parallelogram set comprising a first side pair, being parallel to a first spatial direction of the array, and a second side pair, being parallel to second spatial direction of the array, one side of the second side pair defining a parallelogram base, storing each product of the plurality of products associated with at least one data entry in the output map in a processing element of the at least one corresponding parallelogram set, stored products associated with a same partial sum being distributed along a first spatial direction of the array, different partial sums being distributed along a second spatial direction of the array, accumulating, in a first accumulation phase, products associated with a same partial sum on the array by performing the steps of: starting first unidirectional dataflows by moving, once per computation cycle, the values stored in the processing elements associated with each parallelogram base to the next connected processing element along a pre-determined first flow direction; starting second unidirectional dataflows by moving, once per computation cycle, the values stored in the remaining processing elements of each parallelogram set to the next connected processing element along a pre-determined second flow direction; and combining, once per computation cycle, an incomplete partial sum, corresponding to a value of a first unidirectional dataflow passing through a processing element of the array, with a product, corresponding to a value of a second unidirectional dataflow also passing through the same processing element, into a new value for the first unidirectional dataflow if the product belongs to and further completes the partial sum, wherein partial sums are completed if the first and second unidirectional dataflows have collapsed each parallelogram set to its base, and wherein the first and second flow direction are selected from the third spatial direction and one of the first or second spatial direction; accumulating, in a second accumulation phase, partial sums on the array into at least one data entry in the output map by performing the steps of: starting third unidirectional dataflows by moving, once per computation cycle, the values stored in the processing elements associated with one of the parallelogram base vertices of each collapsed parallelogram to the next connected processing element along a pre-determined third flow direction; starting fourth unidirectional dataflows by moving, once per computation cycle, the values stored in the remaining processing elements of each collapsed parallelogram to the next connected processing element along a pre-determined fourth flow direction; combining, once per computation cycle, an incomplete data entry in the output map, corresponding to a value of a third unidirectional dataflow passing through a processing element of the array, with a partial sum, corresponding to a value of a fourth unidirectional dataflow also passing through the same processing element, into a new value for the third unidirectional dataflow if the partial sum belongs to and further completes the data entry in the output map, wherein each data entry in the output map is completed if the third and fourth unidirectional dataflows have reduced each collapsed parallelogram to one of its vertices, and wherein the third and fourth flow direction are selected from the third spatial direction and the other one of the first or second spatial direction, not selected for the first and second flow direction. 

Regarding Claim 11,
A hardware system for performing mappings in convolutional neural network layers  , comprising: a synchronized, two-dimensional array of locally interconnected processing elements regularly organized on a grid  , the grid defining three different flow directions of unidirectional dataflows between connected neighboring processing elements on the array   , each processing element comprising: a first logical level comprising:  three inputs for receiving partial results of incoming unidirectional dataflows, an addition unit adapted for accumulating received partial results of two different unidirectional dataflows, thereby providing updated partial results, at least three synchronized output registers for temporarily storing partial results during a computation cycle, stored partial results of three output registers corresponding to values of outgoing unidirectional dataflows, and output selection means for selecting, for each output register, a partial result to be stored from one of a received partial result, an updated partial result or a generated partial result, a second logical level comprising a storage element for selectively storing a received weight and selectively storing a data entry in the input map, the storage element being adapted to propagate a stored weight and/or a data entry in the input map to the storage element of a neighboring connected processing element, and a multiplication unit for generating a partial result based on the stored weight and the stored data entry in the input map, a global control logic for generating synchronization signals and control signals for each processing element, global input means for receiving, at most once per neural network layer, data entries in the input map and weights of selected convolution kernels and for applying them to a subset of processing elements at a boundary of the array, the applied weights being stored in the storage element of the processing elements for reuse for as long as new data entries of the input map are applied, received data entries in the input map being reused for a plurality of data entries in the output depending thereon, global output means for reading out, from a subset of processing elements at a boundary of the array, fully accumulated results as data entries in the output map of a convolutional neural network layer.


3.	Regarding Claim 1: VENKATARAMANI et al.  ( USPUB  20190303743) teaches A method for mapping an input map of a convolutional neural network layer to an output map using a set of convolution kernels( Paragraph [0078]- “…Coarse-grained Data Instructions 1004: e.g., compute dominant instructions such as convolutions (nD-convolutions), etc. They may be executed on the PE arrays of a compute intensive tile….”and Paragraphs [0076] and [0080]), comprising: providing an array of locally interconnected processing elements (Paragraph [0041] –“…A processor (e.g., processing system) to process a neural network may be a computing system made up of a number of (e.g., highly) interconnected processing elements,…” AND Paragraph [0101]- “…multiple convolutional layer and fully connected layer chips are interconnected in a two-tiered hierarchy to form a compute node. This subsection described the interconnectivity at the node-level….”) , the processing elements being arranged on a regular two-dimensional grid( FIG. 9 and Paragraph [0074]- “…The processing tiles in FIG. 9 are arranged as a multiple dimensional (e.g., 2D) grid, with alternating columns (e.g., or rows) of compute intensive tiles and memory intensive tiles. Chip 900 includes a plurality (e.g., 3) compute intensive tiles per memory intensive tile,…” AND Paragraph [0053]-“…The neurons in a convolutional layer may be arranged as multiple dimensional grids (e.g., two, three, four dimensional, etc.) called features. In one embodiment, the layer takes multiple input features and produces multiple feature outputs….”), 
Within analogous art , Yu Hsin Chen (NPL Doc. : “Eyeriss: A Spatial Architecture for Energy Efficient Data flow for Convolutional Neural Networks,” 18th June 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture, ACM SIGARCH Computer Architecture News,Volume 44,Issue 3,June 2016, Pages 367-376 ) teaches the grid defining at least three different spatial directions of the array along which unidirectional dataflows on the array are supported (Page 373- Col. 1- “Weight Stationary …PE at each cycle is either passed to its neighbor PE…” AND Page 375- Col. 1 & 2- D. Dataflow Modeling Side Note ..  as communication with a neighbor PE…” AND Figure 8 and 9 ), each processing element being adapted to: store, during at least one computation cycle, a value carried by each unidirectional dataflow traversing or originating at the processing element ( Page 373- Col. 1- “ A. Dataflow Implementation … Weight Stationary: Each PE holds a single weight in theRF at a time. The psum generated in a PE at each cycle is either passed to its neighbor PE or stored back to the global buffer, and the PE array operates as a systolic array with little local control. This also leaves a large area for the global buffer, which is crucial to the operation of WS dataflow….”) , and combine values carried by unidirectional dataflows traversing the processing element in different spatial directions into a new value carried by at least one of the supported unidirectional dataflows ( Page 370 – Col. 2- “…Each pixel in an ifmap plane from the same channel is broadcast to the same R×R PEs sequentially, and the psums generated by each PE are further accumulated spatially across these PEs. Multiple planes of R×R weights from different filters and/or channels can be deployed either across multiple R×R PEs in the array or onto the same R×R PEs….” AND Page 371- Col. 1- V. ENERGY-EFFICIENT DATAFLOW: ROW STATIONARY…”)  ; 

Within analogous art, Fengbin Tu ( NPL Doc: “Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns,” 12th April 2017,   IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( Volume: 25, Issue: 8, Aug. 2017),Pages 2220-2228) teaches providing, for each data entry in the output map to be computed, a plurality of products from pairs of weights of a selected convolution kernel and data entries in the input map addressed by the weights (Fig. 3  AND  Page 2221- Col. 1- DCNN model … Fig. 2(a) shows a CONV layer in DCNNs. It takes N×H×L feature maps as the inputs, and has M 3-D convolutional kernels (K ×K ×N). Each kernel performs a 3-D convolution on the input maps with a sliding stride of S, which generates an R×C output map. Therefore, the output map number equals to the kernel number M. The computation of a CONV layer can be expressed as the multilayer loop in Fig. 2(b), where matrices I , O, and W stand for the input maps, output maps, and kernel weights….”) , a data entry position in the output map determining the selected convolution kernel and its position relative to the input map(Page 2221-Fig. 3  AND Page 2223- Col. 1- lines 1- 6 and 12-22) ; arranging, for each data entry in the output map, the plurality of products into a plurality of partial sums to be computed ( Page 2227- Col.1 –lines 1- 14) , each partial sum including at least the products associated with a first dimension of the selected convolution kernel, different partial sums being associated with at least a second dimension of the selected convolution kernel ( Page 2223- Col. 2- “…In IR and OR, repeated weight access to DRAM would occur if TW exceeds the weight buffer size. So we propose a reuse pattern named weight reuse (WR), to minimize DRAM access for weights. Shown in Fig. 5(e), WR has four stages: 1) the core loads Tn tiled input maps to the Input REGs; 2) those input maps are used to update Tm corresponding partial sums; 3) the Tn × Tm kernel weights in the weight buffer are fully reused to compute Tm maps of R × C; and 5) the partial sums are fetched…”) ; 
 Within analogous art , Elmegreen et al. ( USPUB 20110119215)  teaches computing the data entries in the output map by performing at least once the steps of:  determining at least one parallelogram set of processing elements( Paragraph [0097]) , corresponding to at least one data entry in the output map, each parallelogram set comprising a first side pair ( Fig. 6  and Paragraphs [0097-0098]) , being parallel to a first spatial direction of the array, and a second side pair, being parallel to second spatial direction of the array, one side of the second side pair defining a parallelogram base ( Paragraph [0103]) , 
within claim 1, but does not teach the limitations, nor render obvious the following limitations : “storing each product of the plurality of products associated with at least one data entry in the output map in a processing element of the at least one corresponding parallelogram set, stored products associated with a same partial sum being distributed along a first spatial direction of the array, different partial sums being distributed along a second spatial direction of the array, accumulating, in a first accumulation phase, products associated with a same partial sum on the array by performing the steps of: starting first unidirectional dataflows by moving, once per computation cycle, the values stored in the processing elements associated with each parallelogram base to the next connected processing element along a pre-determined first flow direction; starting second unidirectional dataflows by moving, once per computation cycle, the values stored in the remaining processing elements of each parallelogram set to the next connected processing element along a pre-determined second flow direction; and combining, once per computation cycle, an incomplete partial sum, corresponding to a value of a first unidirectional dataflow passing through a processing element of the array, with a product, corresponding to a value of a second unidirectional dataflow also passing through the same processing element, into a new value for the first unidirectional dataflow if the product belongs to and further completes the partial sum, wherein partial sums are completed if the first and second unidirectional dataflows have collapsed each parallelogram set to its base, and wherein the first and second flow direction are selected from the third spatial direction and one of the first or second spatial direction; accumulating, in a second accumulation phase, partial sums on the array into at least one data entry in the output map by performing the steps of: starting third unidirectional dataflows by moving, once per computation cycle, the values stored in the processing elements associated with one of the parallelogram base vertices of each collapsed parallelogram to the next connected processing element along a pre-determined third flow direction; starting fourth unidirectional dataflows by moving, once per computation cycle, the values stored in the remaining processing elements of each collapsed parallelogram to the next connected processing element along a pre-determined fourth flow direction; combining, once per computation cycle, an incomplete data entry in the output map, corresponding to a value of a third unidirectional dataflow passing through a processing element of the array, with a partial sum, corresponding to a value of a fourth unidirectional dataflow also passing through the same processing element, into a new value for the third unidirectional dataflow if the partial sum belongs to and further completes the data entry in the output map, wherein each data entry in the output map is completed if the third and fourth unidirectional dataflows have reduced each collapsed parallelogram to one of its vertices, and wherein the third and fourth flow direction are selected from the third spatial direction and the other one of the first or second spatial direction, not selected for the first and second flow direction.”

Regarding Claim 11: VENKATARAMANI et al.  ( USPUB  20190303743) teaches A hardware system for performing mappings in convolutional neural network layers ( Paragraph [0078]- “…Coarse-grained Data Instructions 1004: e.g., compute dominant instructions such as convolutions (nD-convolutions), etc. They may be executed on the PE arrays of a compute intensive tile….”and Paragraphs [0076] and [0080]) , comprising: a synchronized, two-dimensional array of locally interconnected processing elements regularly organized on a grid ( FIG. 9 and Paragraph [0074]- “…The processing tiles in FIG. 9 are arranged as a multiple dimensional (e.g., 2D) grid, with alternating columns (e.g., or rows) of compute intensive tiles and memory intensive tiles. Chip 900 includes a plurality (e.g., 3) compute intensive tiles per memory intensive tile,…” AND Paragraph [0053]-“…The neurons in a convolutional layer may be arranged as multiple dimensional grids (e.g., two, three, four dimensional, etc.) called features. In one embodiment, the layer takes multiple input features and produces multiple feature outputs….”) , 
Within analogous art , Yu Hsin Chen (NPL Doc. : “Eyeriss: A Spatial Architecture for Energy Efficient Data flow for Convolutional Neural Networks,” 18th June 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture, ACM SIGARCH Computer Architecture News,Volume 44,Issue 3,June 2016, Pages 367-376 ) teaches the grid defining three different flow directions of unidirectional dataflows between connected neighboring processing elements on the array (Page 373- Col. 1- “Weight Stationary …PE at each cycle is either passed to its neighbor PE…” AND Page 375- Col. 1 & 2- D. Dataflow Modeling Side Note ..  as communication with a neighbor PE…” AND Figure 8 and 9 )  , each processing element comprising: a first logical level comprising:  three inputs for receiving partial results of incoming unidirectional dataflows ( logical array and dataflow taught within Page 372- Col. 1 – “…B. Two-Step Primitive Mapping .. Logical Mapping : …Physical Mapping …”) , an addition unit adapted for accumulating received partial results of two different unidirectional dataflows ( Page  367- Col. 1- “Abstract -minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage…” AND Page 378-Col. 1- “…a novel dataflow, called row stationary (RS), is presented that minimizes energy consumption by maximizing input data reuse (filters and feature maps) and minimizing partial sum accumulation cost simultaneously, and by accounting for the energy cost of different storage levels. Compared with existing dataflows such as the output stationary (OS), weight stationary (WS), and no local reuse (NLR) dataflows…”)  , thereby providing updated partial results ( Page 376- Col. 2- “…The results for other PE array sizes show a similar trend. While the WS and OS dataflows are most energy efficient at weight and psum accesses, respectively, they sacrifice the reuse of other data types: WS is inefficient at ifmap reuse, and the OS dataflows cannot reuse ifmaps and weights as efficiently as RS since they focus on generating psums that are reducible…”) ,
within claim 1, but does not teach the limitations, nor render obvious the following limitations : “at least three synchronized output registers for temporarily storing partial results during a computation cycle, stored partial results of three output registers corresponding to values of outgoing unidirectional dataflows, and output selection means for selecting, for each output register, a partial result to be stored from one of a received partial result, an updated partial result or a generated partial result, a second logical level comprising a storage element for selectively storing a received weight and selectively storing a data entry in the input map, the storage element being adapted to propagate a stored weight and/or a data entry in the input map to the storage element of a neighboring connected processing element, and a multiplication unit for generating a partial result based on the stored weight and the stored data entry in the input map, a global control logic for generating synchronization signals and control signals for each processing element, global input means for receiving, at most once per neural network layer, data entries in the input map and weights of selected convolution kernels and for applying them to a subset of processing elements at a boundary of the array, the applied weights being stored in the storage element of the processing elements for reuse for as long as new data entries of the input map are applied, received data entries in the input map being reused for a plurality of data entries in the output depending thereon, global output means for reading out, from a subset of processing elements at a boundary of the array, fully accumulated results as data entries in the output map of a convolutional neural network layer.”

4.	The examiner found no suggestions or motivations to combine similar teachings from prior art made of record to overcome the limitations as discussed above. 

5.	Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion

6. 	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Refer to PTO-892, Notice of Reference Cited for a listing of analogous art.
7. 	Any inquiry concerning this communication or earlier communications from the examiner should be directed to OMAR S ISMAIL whose telephone number is (571) 272-9799 and FAX number (571) 273-9799.  The examiner can normally be reached on M-F 9:00am-6:00pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David C. Payne can be reached on (571) 272-3024.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-3024.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/OMAR S ISMAIL/
Primary Examiner, Art Unit 2637