DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Amendment
This Office Action is in response to applicant’s communication filed 27 October 2021, in response to the Office Action mailed 27 May 2021.  The applicant’s remarks and any amendments to the claims or specification have been considered, with the results that follow.


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 6 and 8 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

As per claim 6, the scope of the claim is not clear because it is not clear what series of operations are being described in response to the number of inputs.  It is not clear whether the cooperative mode is the response, or the control core being in control is the response, or something else.  The examiner assumes, for the purposes of examination, that the control core controls the operation of the units in response to a number of inputs. 

As per claim 8, the scope of the claim is not clear because it is not clear what series of operations are being described in response to the number of inputs.  It is not clear whether the independent mode is the response, or the control core being in control is the response, or something else.  The examiner assumes, for the purposes of examination, that the control core controls the operation of the units in response to a number of inputs. 


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1 and 4-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mody (US 2018/0197067 – provisional app. No. 62/445493 cited) in view of Ambrose (US 2017/0344882).

As per claim 1, Mody teaches an accelerator for processing of a convolutional neural network (CNN) comprising: a compute core comprising: a compute unit [an multi-mode engine (MME – the compute unit) including a vector multiply unit (VMU) including a plurality of multiply/accumulate (MAC) nodes (para. 0004, fig. 3, etc.)] each compute unit comprising: a first memory cache configured to store at least one vector in a map trace [the VMU is connected to a data feeder storing vectors of input data from a L2 cache which are provided to the VMU to process the set of inputs for a layer (paras. 0004, 0022-27; figs. 3 and 16; etc.)], the map trace corresponding to a contiguous set of input data in an input map for one layer of the CNN [the VMU is connected to a data feeder storing vectors of input data from a L2 cache which are provided to the VMU to process the set of inputs for a layer (paras. 0004, 0022-27; figs. 3 and 16; etc.)]; a second memory cache configured to store at least one vector in a kernel trace [the VMU is connected to a weight feeder storing vectors of weight data from a L2 cache to L1 weight caches, to output weights to the VMU to process the set of inputs for a layer (paras. 0004, 0022-27; figs. 3 and 15; etc.)], the kernel trace corresponding to a contiguous set of weights in a kernel of a convolutional layer or a fully-connected layer of the CNN [the VMU is connected to a weight feeder storing vectors of weight data from a L2 cache to L1 weight caches, to output weights to the VMU to process the set of inputs for a layer (paras. 0004, 0022-27; figs. 3 and 15; etc.)]; a vector multiply-accumulate unit (vMACs) connected to the first memory cache and the second memory cache [an multi-mode engine (MME – the compute core) including a vector multiply unit (VMU – the vMAC) including a plurality of multiply/accumulate (MAC) nodes (para. 0004, fig. 3, etc.)], each vMAC comprising: a plurality of multiply-accumulate units (MACs) [an multi-mode engine (MME – the compute core) including a vector multiply unit (VMU – the vMAC) including a plurality of multiply/accumulate (MAC) nodes (para. 0004, fig. 3, etc.)], each MAC including a multiplier unit configured to multiply a first word that forms a portion of the at least one vector in the map trace in the first memory cache by a second word that forms a portion of the at least one vector in the kernel trace in the second memory cache to produce an intermediate product [the MAC nodes each include multiplier and accumulator units to operate on bytes of weight and input data and output the result (paras. 0024-27, fig. 6, etc.)], and an adder unit that adds the intermediate product to a third word to generate a sum of the intermediate product and the third word as an output [the MAC nodes each include multiplier and accumulator units to operate on bytes of weight and input data and output the result (paras. 0024-27, fig. 6, etc.)].
While Mody teaches an array of MAC units (see above) it does not explicitly teach a plurality of compute units with a plurality of vMAC units.
Ambrose teaches a plurality of compute units with a plurality of vMAC units [a system on chip (SoC) with multiple accelerators for executing a Convolutional Neural Network, which can including multiple groups of processing units (paras. 0021, 0055-60, fig. 5, etc.)].
Mody and Ambrose are analogous art, as they are within the same field of endeavor, namely accelerators for CNNs.
It would have been obvious to one of ordinary skill in the art to include multiple processing units in the system for processing the CNN layers, as taught by Ambrose, for multiple copies of the compute unit MME taught by Mody.
Ambrose provides motivation as [having multiple processing units allows better exploitation of available parallelism in CNN algorithms (para. 0005, etc.)].  Furthermore, it has been held that the mere duplication of the essential working parts of a device involves only routine skill in the art.  St. Regis Paper Co. v. Bemis Co., 193 USPQ 8.

As per claim 4, Mody/Ambrose teaches a memory interface [the MME is connected via a L2 cache to main memory via a master port (Mody: para. 0021, fig. 3, etc.) using an external memory interface to connect o external memory (Ambrose: fig. 5, etc.)]; and a compute cluster connected to the memory interface, the compute cluster comprising: the compute core [the multi-accelerator SoC is connected to the external memory interface (Ambrose: fig. 5, etc.); where the MME is connected via a L2 cache to main memory via a master port (Mody: para. 0021, fig. 3, etc.)]; a data distribution network [the controller is connected for data distribution over busses in the system (Ambrose: fig. 5, Mody: fig. 3etc.)]; and a control core operatively connected to the data distribution network, and the plurality of [the controller is connected for data distribution over busses in the system (Mody: fig. 3, Ambrose: fig. 5, etc.)].

As per claim 5, Mody/Ambrose teaches the control core being further configured to: control the operation of the plurality of compute units in a cooperative mode in which each MAC in the plurality of MACs in at least one of the plurality of vMACs in each compute unit receives the first word from a different portion of the at least one vector in the map trace [the MAC nodes each include multiplier and accumulator units to operate on selected bytes of weight and input data and output the result (Mody: paras. 0024-27, fig. 6, etc.) where, based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (Ambrose: paras. 0179-197, etc.)].

As per claim 6, Mody/Ambrose teaches wherein, in response to a number of input maps to the convolution layer of the CNN being a multiple of 16, the control core controls the operation of the plurality of compute units in the cooperative mode [the MAC nodes each include multiplier and accumulator units to operate on selected bytes of weight and input data and output the result (Mody: paras. 0024-27, fig. 6, etc.) where, based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (Ambrose: paras. 0179-197, etc.)].
Mody teaches that the vectors are 16 bytes and the units handle each byte (see, e.g., Mody: para. 0022) but does not explicitly teach the determination of a multiple of 16.  However, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 7, Mody/Ambrose teaches the control core being further configured to: control the operation of the plurality of compute units in an independent mode in which each MAC in the plurality of MACs in at least one of the plurality of vMACs in each compute unit receives the first word from a single portion of the at least one vector in the map trace [the MAC nodes each include multiplier and accumulator units to operate on selected bytes of weight and input data and output the result (Mody: paras. 0024-27, fig. 6, etc.) where, based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (Ambrose: paras. 0179-197, etc.)].

As per claim 8, Mody/Ambrose teaches wherein, in response to a number of input maps to the convolutional layer of the CNN not being a multiple of 16, the control core controls the operation of the plurality of compute units in the independent mode [the MAC nodes each include multiplier and accumulator units to operate on selected bytes of weight and input data and output the result (Mody: paras. 0024-27, fig. 6, etc.) where, based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (Ambrose: paras. 0179-197, etc.)].
Mody teaches that the vectors are 16 bytes and the units handle each byte (see, e.g., Mody: para. 0022) but does not explicitly teach the determination of a multiple of 16.  However, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 9, Mody/Ambrose teaches wherein the first memory cache in each compute unit is implemented as a first scratchpad memory device in the compute core and the second memory cache in each compute unit is implemented as a second scratchpad memory device [the VMU is connected to an L2 cache storing weight and input data, which is passed to weight and data feeders which temporarily store the weight/input data and then passed to the VMU (Mody: paras. 0004, 0022-27; figs. 3 and 15; etc.)].

As per claim 10, Mody/Ambrose teaches wherein each compute unit in the plurality of compute units comprises four vMACs and each vMAC further comprises [the vectors are 16 bytes and the units handle each byte (see, e.g., Mody: para. 0022)].  
While Mody/Ambrose does not explicitly teach the number of necessary vMACs and MACs, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 11, Mody/Ambrose teaches wherein a first vMAC in the plurality of vMACs reads 256-bits of data as the at least one vector in the map trace from the first memory cache and each MAC in the plurality of MACs in the first vMAC receives a 16-bit first word from the 256-bits of data [the vectors may be stored in a 4x256 register for 256 bit inputs (Mody: para. 0029, fig. 3, etc.)]. 
While Mody/Ambrose does not explicitly teach the size of necessary vectors/portions used, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 12, Mody/Ambrose teaches wherein the first memory cache stores 256 kB of map trace data and the second memory cache stores 32 kB of kernel trace data [the L1 memory at least is 4KB (Mody: fig. 15, etc.)].
While Mody/Ambrose does not explicitly teach the size all the memories used, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 13, Mody/Ambrose teaches a single memory cache that is the first memory cache in each compute unit in the plurality of compute units in the compute core, the single memory cache being shared by the plurality of compute units [the weights may come from a shared L2 cache to be processed by the processing units (Mody: fig. 3, etc.)].

As per claim 14, Mody/Ambrose teaches wherein the compute core further comprises four compute units [the system includes at least 4 compute units (Ambrose: fig. 5, etc.)].

As per claim 15, Mody/Ambrose teaches wherein the single memory cache stores 512 kB of the map trace data [the L1 memory at least is 4KB (Mody: fig. 15, etc.)].
While Mody/Ambrose does not explicitly teach the size all the memories used, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 16, see the rejection of claim 1, above, wherein Mody/Ambrose also teaches loading, with a control core in the accelerator, the map trace and kernel trace data [the VMU is connected to an L2 cache storing weight and input data, which is passed to weight and data feeders which temporarily store the weight/input data and then passed to the VMU (Mody: paras. 0004, 0022-27; figs. 3 and 15; etc.) and the controller is connected for data distribution over busses in the system (Mody: fig. 3, Ambrose: fig. 5, etc.)].

As per claim 17, see the rejection of claim 5, above.

As per claim 18, see the rejection of claim 7, above.

As per claim 19, Mody/Ambrose teaches operating, with the control core, the plurality of MACs in the cooperative mode in response to a number of input maps to the convolutional layer of the CNN being a multiple of 16; and operating, with the control core, the plurality of MACs in the independent mode in response to the number of input maps to the convolutional layer of the CNN not being a multiple of 16 [the MAC nodes each include multiplier and accumulator units to operate on selected bytes of weight and input data and output the result (Mody: paras. 0024-27, fig. 6, etc.) where, based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (Ambrose: paras. 0179-197, etc.)].
Mody teaches that the vectors are 16 bytes and the units handle each byte (see, e.g., Mody: para. 0022) but does not explicitly teach the determination of a multiple of 16.  However, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 20, see the rejection of claim 3, above.


Claim 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mody and Ambrose as applied to claim 1 above, and further in view of Bittner (US 2018/0157465).

As per claim 2, Mody/Ambrose teach the accelerator of claim 1, as described above.
While Mody/Ambrose teaches shifting the results (see, e.g., Mody: para. 0025, etc.) it does not teach each vMAC further comprising: a shift register connected to outputs of the plurality of MACs, the shift register being configured to generate a series of outputs, each output in the series of outputs corresponding to an output of one MAC in the plurality of MACs; and a gather adder configured to generate one of: a single sum of the series of outputs of the shift register and a bias value as an output; or a plurality of sums, each sum in the plurality of sums corresponding to a sum of an output of one MAC in the plurality of MACs received from the shift register and another bias value.
Bittner teaches each vMAC further comprising: a shift register connected to outputs of the plurality of MACs [one or more shifters may be used on the MAC outputs (paras. 0040-45, fig. 1, etc.)], the shift register being configured to generate a [one or more shifters may be used on the MAC outputs (paras. 0040-45, fig. 1, etc.)]; and a gather adder configured to generate one of: a single sum of the series of outputs of the shift register and a bias value as an output; or a plurality of sums, each sum in the plurality of sums corresponding to a sum of an output of one MAC in the plurality of MACs received from the shift register and [the shifted data may be added to bias values to produce an output (paras. 0040-45, fig. 1, etc.)].
Mody/Ambrose and Bittner are analogous art, as they are within the same field of endeavor, namely neural network accelerators.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to include the shifter(s) and bias values taught by Bittner, for the shifting output in the CNN in the system taught by Mody/Ambrose.
Bittner provides motivation as [the MAC outputs must be appropriately shifted, while typically bias values are added in NN calculations (paras. 0002, 0042-44, etc.)].


Claim 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mody and Ambrose as applied to claim 1 above, and further in view of Henry (US 2018/0157961)

As per claim 3, Mody/Ambrose teaches the accelerator of claim 1, as described above.
While Mody/Ambrose also teaches utilizing a max pooling layer (see, e.g., Mody: para. 0003) it does not explicitly teach each compute unit in the plurality of compute units further comprising: a vector maxpool unit (vMAX) connected to the first memory cache of the compute unit, the vMAX comprising a plurality of comparators that 
Henry teaches each compute unit in the plurality of compute units further comprising: a vector maxpool unit (vMAX) connected to the first memory cache of the compute unit, the vMAX comprising a plurality of comparators that implement a max pooling layer of the CNN based on outputs from the plurality of vMACs that are stored in the first memory cache of the compute unit [an ALU including a combination of comparators and muxes may be used to select a max value for the pooling layer (para. 0184, etc.)].
Mody/Ambrose and Henry are analogous art, as they are within the same field of endeavor, namely convolutional neural network accelerators.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to utilize the maxpool hardware for the pooling layer, taught by Henry, for performing the max pooling layer taught by Mody.
Henry and Mody provide motivation as [the max operation is often necessary in pooling layers in neural network applications (Henry: para. 0184, etc.); which may be used to normalize results across various classes of interest (Mody: para. 003, etc.)].


Response to Arguments
Applicant’s arguments, see the remarks, filed 27 October 2021, with respect to the rejection of claim 20 under 35 U.S.C. 112 have been fully considered and are 

Applicant's further arguments, filed 27 October 2021, have been fully considered but they are not persuasive.

Regarding claim 6 (and similarly for claim 8), applicant argues that “the control core controls the operation of the plurality of compute units in the cooperative mode” is the response to the claimed condition.
However, it is not clear what this response entails.  The claim does not recite that the cooperative mode is entered/enabled/etc. in response to the condition, and it is not clear what “the control core controls the operation” entails as a response.

In response to applicant's argument that Mody only teaches a single vMAC in a compute unit, and Ambrose does not teach multiple vMACs, the test for obviousness is not whether the features of a secondary reference may be bodily incorporated into the structure of the primary reference; nor is it that the claimed invention must be expressly suggested in any one or all of the references.  Rather, the test is what the combined teachings of the references would have suggested to those of ordinary skill in the art.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981).  Therefore, the combination teaches including multiple processing units in the system for processing the CNN layers, as taught by Ambrose, for multiple copies of the compute unit MME taught by Mody; which provides multiple compute units with multiple vMACs, each of 

Applicant further argues that the cited art does not teach each MAC in the plurality of MACs in at least one of the plurality of vMACs in each compute unit receives a first word from a different portion of the at least one vector.
However, Mody teaches the MAC nodes each include multiplier and accumulator units to operate on selected bytes of weight and input data and output the result (Mody: paras. 0024-27, fig. 6, etc.), while Ambrose teaches where, based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (Ambrose: paras. 0179-197, etc.); therefore selecting which words from which portions of the vectors are input to the MACs.

Applicant further argues that the cited art does not teach each MAC in the plurality of MACs in at least one of the plurality of vMACs in each compute unit receives the first word from a single portion of the at least one vector in the map trace.
However, Mody teaches the MAC nodes each include multiplier and accumulator units to operate on selected bytes of weight and input data and output the result (Mody: paras. 0024-27, fig. 6, etc.) while Ambrose teaches where, based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (Ambrose: 


Conclusion
The following is a summary of the treatment and status of all claims in the application as recommended by M.P.E.P. 707.07(i): claims 1-20 are rejected.

The examiner requests, in response to this Office action, that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line number(s) in the specification and/or drawing figure(s). This will assist the examiner in prosecuting the application.

When responding to this office action, Applicant is advised to clearly point out the patentable novelty which he or she thinks the claims present, in view of the state of the art disclosed by the references cited or the objections made. He or she must also show how the amendments avoid such references or objections.  See 37 CFR 1.111(c).

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GEORGE GIROUX whose telephone number is (571)272-9769. The examiner can normally be reached M-F 10am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and 





/GEORGE GIROUX/Primary Examiner, Art Unit 2128