DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Amendment
This Office Action is in response to applicant’s communication filed 23 August 2021, in response to the Office Action mailed 7 June 2021.  The applicant’s remarks and any amendments to the claims or specification have been considered, with the results that follow.

The rejection of claims 14-15 under 35 U.S.C. 112 has been withdrawn due to the amendments filed.

The objection to the title has been withdrawn due to the amendment filed.


Information Disclosure Statement
As required by M.P.E.P. 609(c), the applicant's submission of the Information Disclosure Statement, dated 17 August 2021, is acknowledged by the examiner and the cited references have been considered in the examination of the claims now pending.  M.P.E.P 609 C(2), a copy of the PTOL-1449 initialed and dated by the examiner is attached to the instant office action.


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-3, 6, 9-13, 16-19, 22, and 23 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Dally (US 2018/0046900).

As per claim 1, Dally teaches a neural network apparatus, the apparatus comprising: a plurality of node buffers connected to a node lane and configured to store input node data by a predetermined bit size [a sparse convolutional neural network (SCNN) accelerator includes an array of processing elements (PEs) in multiple dimensions (fig. 2A, etc.) where each PE in each row and column includes at least one input activations buffer and weight buffer, storing a set size of weights and activations (figs. 2C and 3A, paras. 0063-65, etc.)]; a plurality of weight buffers connected to a weight lane and configured to store weights [a sparse convolutional neural network (SCNN) accelerator includes an array of processing elements (PEs) in multiple dimensions (fig. 2A, etc.) where each PE in each row and column includes at least one input activations buffer and weight buffer, storing a set size of weights and activations (figs. 2C and 3A, paras. 0063-65, etc.)]; and one or more processors [a sparse convolutional neural network (SCNN) accelerator includes an array of processing elements (PEs) in multiple dimensions (fig. 2A, etc.)] configured to: generate first and second split data by splitting the input node data by the predetermined bit size [a tiling strategy is used to partition the weights and inputs into smaller tiles that are distributed from memory to the buffers in the PEs (paras. 0069-73, see also 0045-48 for the memory interface, etc.)], store the first and second split data in the node buffers [a tiling strategy is used to partition the weights and inputs into smaller tiles that are distributed to the buffers in the PEs (paras. 0069-73, etc.)], output the first split data to an operation circuit for a neural network operation on an index-by-index basis [the data is output from the input and weight buffers to a multiplier and then accumulator arrays using index information from position buffers (fig. 3A, paras. 0071-74, etc.)], shift the second split data [outputting the zero-compressed data from weight and input buffers include shift and mask operations to index the data and account for the compression (paras. 0125-127, etc.)], output the second split data to the operation circuit on the index-by-index basis [outputting the zero-compressed data from weight and input buffers include shift and mask operations to index the data and account for the compression (paras. 0125-127, etc.)], not split and store either one or both of the first and second split data, in response to the either one or both of the first and second split data comprising all zero values [only the non-zero elements of weights and input activations are provided as operands to the multipliers (para. 0037, etc.)], and perform a replacement on the either one or both of the first and second split data using partial data of next input node data having a same index as an index of the input node data [inputs from the same index may be reused to reduce data accesses (para. 0050, see also paras. 0062, 0126, etc.)].

As per claim 2, Dally teaches wherein the operation circuit is configured to respectively convolute the first split data and the shifted second split data based on and the weights [the PEs are configured to perform convolution operation on the data from the buffers (para. 0050, etc.)].

As per claim 3, Dally teaches wherein the one or more processors further comprise: a multiplexer configured to perform the outputting of the first split data to the operation circuit for the neural network operation on the index-by-index basis [the crossbar for distributing data to the operations units (including accumulators) includes multiplexers (figs. 3C, 3E, etc.)]; and a shifter configured to perform the shifting of the second split data, and perform the outputting of the second split data to the operation circuit on the index-by-index basis [outputting the zero-compressed data from weight and input buffers include shift and mask operations to index the data and account for the compression, which requires a unit capable of shifting (paras. 0125-127, etc.)].

As per claim 6, Dally teaches wherein: the input node data has an N-bit size, N is a natural number, N/D is an integer; the one or more processors are configured to split the input node data in units of N/D bits; and N/D is greater than 1 [a tiling strategy is used to partition the weights and inputs into smaller tiles that are distributed from memory to the buffers in the PEs (paras. 0069-73, see also 0045-48 for the memory interface, etc.) where each holds a set number of bits (paras. 0063-65, etc.); as the claim does not describe how D is determined and the division of tiles is into an integer number of buffers holding a set number of  bits].

As per claim 9, Dally teaches wherein the one or more processors are configured to not split and store the input node data, in response to the input node data having a total value of 0 [only the non-zero elements of weights and input activations are provided as operands to the multipliers (para. 0037, etc.)].

As per claim 10, Dally teaches wherein the one or more processors are configured to not fetch a region of the input node data including a zero value from memory, in response to the zero value being including in units of size by which a bit size [the values and associated positions of non-zero values are received from memory (para. 0039, etc.)].

As per claim 11, Dally teaches a neural network apparatus, the apparatus comprising: a plurality of node buffers connected to a node lane and configured to store input node data by a predetermined bit size [a sparse convolutional neural network (SCNN) accelerator includes an array of processing elements (PEs) in multiple dimensions (fig. 2A, etc.) where each PE in each row and column includes at least one input activations buffer and weight buffer, storing a set size of weights and activations (figs. 2C and 3A, paras. 0063-65, etc.)]; a plurality of weight buffers connected to a weight lane and configured to store weights [a sparse convolutional neural network (SCNN) accelerator includes an array of processing elements (PEs) in multiple dimensions (fig. 2A, etc.) where each PE in each row and column includes at least one input activations buffer and weight buffer, storing a set size of weights and activations (figs. 2C and 3A, paras. 0063-65, etc.)]; and one or more processors [a sparse convolutional neural network (SCNN) accelerator includes an array of processing elements (PEs) in multiple dimensions (fig. 2A, etc.)] configured to: generate first and second split data by splitting the input node data by the predetermined bit size [a tiling strategy is used to partition the weights and inputs into smaller tiles that are distributed from memory to the buffers in the PEs (paras. 0069-73, see also 0045-48 for the memory interface, etc.) which store a set size of data (paras. 0063-65, etc.)], store the first and second split data in the node buffers [a tiling strategy is used to partition the weights and inputs into smaller tiles that are distributed from memory to the buffers in the PEs (paras. 0069-73, see also 0045-48 for the memory interface, etc.)], output the weights from the weight buffers [the data is output from the input and weight buffers to a multiplier and then accumulator arrays using index information from position buffers (fig. 3A, paras. 0071-74, etc.)], output the first split data to an operation circuit for a neural network operation on an index-by-index basis [the data is output from the input and weight buffers to a multiplier and then accumulator arrays using index information from position buffers (fig. 3A, paras. 0071-74, etc.)], shift and output the second split data from the node buffers on an index-by-index basis [outputting the zero-compressed data from weight and input buffers include shift and mask operations to index the data and account for the compression (paras. 0125-127, etc.)]; and an operation circuit configured to convolute the first split data and the shifted second split data output from the node buffers and the weights output from the weight buffers [the PEs are configured to perform convolution operation on the data from the buffers using multiplier and accumulator arrays (para. 0050, figs. 2C-3A, etc.)], not split and store either one or both of the first and second split data, in response to the either one or both of the first and second split data comprising all zero values [only the non-zero elements of weights and input activations are provided as operands to the multipliers (para. 0037, etc.)], and perform a replacement on the either one or both of the first and second split data using partial data of next input node data having a [inputs from the same index may be reused to reduce data accesses (para. 0050, see also paras. 0062, 0126, etc.)]..

As per claim 12, see the rejection of claim 3, above.

As per claim 13, Dally teaches a neural network apparatus, the apparatus comprising: a preprocessing apparatus configured to split input node data and weights by a predetermined size of at least two or more, store the split input node data and the split weights by the predetermined size [a tiling strategy is used to partition the weights and inputs into smaller tiles that are distributed from memory to the buffers in the PEs (paras. 0069-73, see also 0045-48 for the memory interface, etc.) which store a set size of data (paras. 0063-65, etc.)], and output the split input node data and the split weights based on symbol data [the data is output from the input and weight buffers to a multiplier and then accumulator arrays using index information from position buffers (fig. 3A, paras. 0071-74, etc.) based on vector instructions (para. 0006, etc.); where the indices and/or the instruction are symbol data]; an operation circuit configured to generate output data by performing a convolution operation on the split input node data and the split weights, and output the generated output data [the PEs are configured to perform convolution operation on the data from the buffers using multiplier and accumulator arrays (para. 0050, figs. 2C-3A, etc.)]; and a shifter configured to shift the generated output data output from the operation circuit [outputting the zero-compressed data from weight and input buffers include shift and mask operations to index the data and account for the compression (paras. 0125-127, etc.)], wherein the preprocessing apparatus is configured to not split and store the split input node data, in response to the split input node data comprising all zero values [only the non-zero elements of weights and input activations are provided as operands to the multipliers (para. 0037, etc.)], and wherein the preprocessing apparatus is configured to perform a replacement on the split input node data using partial data of next input node data having a same index as an index of the input node data [inputs from the same index may be reused to reduce data accesses (para. 0050, see also paras. 0062, 0126, etc.)]..

As per claim 16, see the rejection of claim 13, above.

As per claim 19, Dally teaches splitting the input node data in units of N/D bits, wherein the input node data has an N-bit size, N is a natural number, N/D is an integer, and D increases until a value of N/D becomes 1 [a tiling strategy is used to partition the weights and inputs into smaller tiles that are distributed from memory to the buffers in the PEs (paras. 0069-73, see also 0045-48 for the memory interface, etc.) which will complete all of the convolutions for each layer, which may include remaining sizes (paras. 0065-66 for completing the calculations, para. 0135 for remaining counts, etc.)].

As per claim 22, see the rejection of claim 9, above.

As per claim 23, see the rejection of claim 10, above.
 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 7, 8, 14, 15, 20, and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dally (US 2018/0046900).

As per claim 7, Dally teaches wherein the one or more processors are configured to not split and store the input node data except for a least significant bit (LSB) region of N/D bits, in response to N/D being 4 and the input node data having a value of 0 to 15 [only the non-zero elements of weights and input activations are provided as operands to the multipliers (para. 0037, etc.); where when N/D is 4, and the value is less than 16, only the LSBs will be non-zero].
While Dally does not explicitly describe an N/D of 4 associated with the zeroes, it has been held that where the general conditions of a claim are disclosed in the prior art, In re Aller, 105 USPQ 233.

As per claim 8, Dally teaches wherein the one or more processors are configured to not split and store a least significant bit (LSB) region of N/D bits of the input node data, in response to the input node data having a value of a multiple of 16 [only the non-zero elements of weights and input activations are provided as operands to the multipliers (para. 0037, etc.); where when the value is a multiple of 16 the LSBs will be zero(es)].
While Dally does not explicitly describe an N/D large enough that with multiples of 16 the zeroes will not be in the LSB, it has been held that where the general conditions of a claim are disclosed in the prior art, discovering the optimum or working ranges (in this case the size of the tiles being partitioned, which then indicates what sizes the zeroes would fall into for certain values/multiples) involves only routine skill in the art. In re Aller, 105 USPQ 233.

As per claim 14, Dally teaches wherein the preprocessing apparatus is configured to: split the input node data into at least first and second input node data, and split the weights into at least a first and second weight [a tiling strategy is used to partition the weights and inputs into smaller tiles that are distributed from memory to the buffers in the PEs (paras. 0069-73, see also 0045-48 for the memory interface, etc.)]; output the first input node data and the first weight based on [the data is output from the input and weight buffers to a multiplier and then accumulator arrays using index information from position buffers (fig. 3A, paras. 0071-74, etc.) based on vector instructions (para. 0006, etc.) where different partial results may be created over a series of clock cycles before being combined (paras. 0050-51, etc.)]; output the second input node data and the first weight based on the symbol data in a second cycle operation [the data is output from the input and weight buffers to a multiplier and then accumulator arrays using index information from position buffers (fig. 3A, paras. 0071-74, etc.) based on vector instructions (para. 0006, etc.) where different partial results may be created over a series of clock cycles before being combined (paras. 0050-51, etc.)]; output the first input node data and the second weight based on the symbol data in a third cycle operation [the data is output from the input and weight buffers to a multiplier and then accumulator arrays using index information from position buffers (fig. 3A, paras. 0071-74, etc.) based on vector instructions (para. 0006, etc.) where different partial results may be created over a series of clock cycles before being combined (paras. 0050-51, etc.)]; output the second input node data and the second weight based on the symbol data in a fourth cycle operation [the data is output from the input and weight buffers to a multiplier and then accumulator arrays using index information from position buffers (fig. 3A, paras. 0071-74, etc.) based on vector instructions (para. 0006, etc.) where different partial results may be created over a series of clock cycles before being combined (paras. 0050-51, etc.)]; and wherein the shifter is configured to: shift the first input node data and the first weight output in the first cycle operation by twice the [outputting the zero-compressed data from weight and input buffers include shift and mask operations to index the data and account for the compression (paras. 0125-127, etc.) where different partial results may be created over a series of clock cycles before being combined (paras. 0050-51, etc.)]; shift the second input node data and the first weight output in the second cycle operation by the predetermined size [outputting the zero-compressed data from weight and input buffers include shift and mask operations to index the data and account for the compression (paras. 0125-127, etc.) where different partial results may be created over a series of clock cycles before being combined (paras. 0050-51, etc.)]; and shift the first input node data and the second weight output in the third cycle operation by the predetermined size [outputting the zero-compressed data from weight and input buffers include shift and mask operations to index the data and account for the compression (paras. 0125-127, etc.) where different partial results may be created over a series of clock cycles before being combined (paras. 0050-51, etc.)].
While Dally does not explicitly describe the shift amounts used, it has been held that where the general conditions of a claim are disclosed in the prior art, discovering the optimum or working ranges (in this case the size of shifts required to input all of the necessary data) involves only routine skill in the art. In re Aller, 105 USPQ 233.

As per claim 15, Dally teaches wherein, the predetermined size of input node data is N bits (N is a natural number greater than or equal to 2), the first input node data and the first weight are most significant bit (MSB) N/2 bits and the second input node [a tiling strategy is used to partition the weights and inputs into smaller tiles that are distributed from memory to the buffers in the PEs (paras. 0069-73, see also 0045-48 for the memory interface, etc.) which store a set size of data (paras. 0063-65, etc.)].
While Dally does not explicitly describe a D of 2 (for N/2 bits) associated with the zeroes, it has been held that where the general conditions of a claim are disclosed in the prior art, discovering the optimum or working ranges (in this case the size of the tiles being partitioned) involves only routine skill in the art. In re Aller, 105 USPQ 233.

As per claim 20, see the rejection of claim 7, above.

As per claim 21, see the rejection of claim 8, above.


Response to Arguments
Applicant's arguments filed 23 August 2021 have been fully considered but they are not persuasive.

Applicant argues that Dally does not teach a plurality of node buffers connected to a node lane and configured to store input node data by a predetermined bit size.
However, Dally teaches a sparse convolutional neural network (SCNN) accelerator includes an array of processing elements (PEs) in multiple dimensions (fig. 2A, etc.) where each PE in each row and column includes at least one input activations 

Applicant also argues that Dally does not teach one or more processors configured to generate first and second split data by splitting the input node data by the predetermined bit size.
However, Dally teaches that a tiling strategy is used to partition the weights and inputs into smaller tiles that are distributed from memory to the buffers in the PEs (paras. 0069-73, see also 0045-48 for the memory interface, etc.) which store a set size of data (paras. 0063-65, etc.).

Applicant further argues that Dally does not teach not split and store either one or both of the first and second split data, in response to the either one or both of the first and second split data comprising all zero values, and perform a replacement on the either one or both of the first and second split data using partial data of next input node data having a same index as an index of the input node data.
However, Dally teaches that only the non-zero elements of weights and input activations are provided as operands to the multipliers (para. 0037, etc.), and inputs from the same index may be reused to reduce data accesses (para. 0050, see also paras. 0062, 0126, etc.).

In response to applicant's argument that the references fail to show certain features of applicant’s invention, it is noted that the features upon which applicant relies In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).  In this case the claim does recite that it does “not split and store either one or both of the first and second split data, in response to the either one or both of the first and second split data comprising all zero values”.

Applicant also argues that Dally does not teach store the first and second split data in the node buffers.
However, Dally teaches that a tiling strategy is used to partition (split) the weights and inputs into smaller tiles that are distributed from memory to the buffers in the PEs (paras. 0069-73, see also 0045-48 for the memory interface, etc.) where each PE in each row and column includes at least one input activations buffer and weight buffer, storing a set size of weights and activations (figs. 2C and 3A, paras. 0063-65, etc.); which is within the broadest reasonable interpretation of the generated first and second split data by splitting the input node data by a predetermined bit size, and storing the first and second split data in the node buffers.


Conclusion
The following is a summary of the treatment and status of all claims in the application as recommended by M.P.E.P. 707.07(i): claims 4, 5, 17, and 18 are cancelled; claims 1-3, 6-16, and 19-23 are rejected.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Kim et al. (ZeNA: Zero-Aware Neural Network Accelerator, Aug 2017, pgs. 39-46) – discloses zero-skipping for a CNN.
Dally (US 2018/0046906) – discloses a SCNN accelerator similar to Dally, above.
Yan (US 2018/0218518) – discloses a CNN accelerator including compressing zeroes in memory.
Gibson (US 2017/0323197) – discloses a CNN accelerator with multiple buffers and lanes.

The examiner requests, in response to this Office action, that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line number(s) in the specification and/or drawing figure(s). This will assist the examiner in prosecuting the application.

When responding to this office action, Applicant is advised to clearly point out the patentable novelty which he or she thinks the claims present, in view of the state of the art disclosed by the references cited or the objections made. He or she must also show how the amendments avoid such references or objections.  See 37 CFR 1.111(c).

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GEORGE GIROUX whose telephone number is (571)272-9769. The examiner can normally be reached M-F 10am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on 571-272-2589. The fax phone 
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GEORGE GIROUX/Primary Examiner, Art Unit 2128