Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

 (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1-13 and 15-17 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Phelps(US 2020/0226202).
Regarding claim 1, Phelps discloses a configurable stacked architecture for a fixed function datapath(Fig. 1B) for use with an accelerator to accelerate an operation of a deep neural network (DNN)(Paragraph 40, special-purpose hardware chip for training a neural network using an accelerator), comprising: a plurality of configurable micro-scalar processing units (SPUs) that perform at least one scalar operation on vector values from a received vector(Paragraph 44-45, several compute units, each compute core (101) contains a scalar processing unit (107); example scalar processor performs VLIW instruction fetch/execute loop and controls the compute core. After fetching and decoding an instruction bundle, the scalar processor itself only executes the instructions found in the scalar slots. The scalar instruction set includes normal arithmetic operations, e.g., as used in address calculations, load/store instructions, and branch instructions. The remaining instruction slots encode instructions for the vector processing unit or other extended vector units (113, 114, 116). The decoded vector instructions are forwarded to the vector processing unit); and a plurality of configurable micro-multi-functional units (MFUs) that perform vector operations on the vector values(Paragraphs 44, 56, several compute units, each compute core (101) contains extended vector units (113); each processor has three extended vector units: a matrix multiply unit (113) which performs matrix multiplication operations; a cross-lane unit (XLU) that includes a transpose unit (XU) (114) which performs a transposition operation of a matrix, i.e., 128 by 128 matrix, and a reduction and permutation unit, illustrated as separate units in FIG. 1C, reduction unit 115 and permutation unit 116), wherein the plurality of configurable micro-SPUs and the plurality of configurable micro-MFUs are placed in an order to perform the operation of the DNN where an output of one micro-SPU of the plurality of configurable micro-SPUs is provided as an input to one micro-MFU of the plurality of configurable micro-MFUs(Figure 1B, Paragraph 50, Special-purpose integrated circuit 100 for performing neural network computations. Scalar processing unit 107 provides an output as an input to vector processing unit 106).
Regarding claim 2, Phelps discloses the stacked architecture of claim 1, wherein each micro-SPU of the plurality of configurable micro-SPUs further performs a reduction operation on the vector values, the at least one scalar operation on the vector values, and a broadcast operation to broadcast the vector values to a vector(Paragraph 45, The scalar instruction set includes normal arithmetic operations, e.g., as used in address calculations, load/store instructions, and branch instructions. The remaining instruction slots encode instructions for the vector processing unit or other extended vector units (113, 114, 116). The decoded vector instructions are forwarded to the vector processing unit).
Regarding claim 3, Phelps discloses the stacked architecture of claim 1, wherein the operation is a softmax operation or a layer normalization operation(Paragraph 58, the vector processing unit 106 generates normalized values. The vector of processed outputs can be used as left-hand side data inputs to the matrix multiply unit 113, e.g., for use in a subsequent layer in the neural network).
Regarding claim 4, Phelps discloses the stacked architecture of claim 1, wherein the operation is a non-linear operation that involves at least one of vector operations, scalar operations, or reduction operations(Paragraph 58, the vector processing unit can apply a non-linear function to outputs of the matrix multiply unit to generate vector data value).
Regarding claim 5, Phelps discloses the stacked architecture of claim 1, wherein a number of micro-SPUs for the plurality of configurable micro-SPUs and a number of micro-MFUs for the plurality of configurable micro-MFUs is selected based on the operation(Paragraph 40, special-purpose hardware chip for training a neural network using an accelerator).
Regarding claim 6, Phelps discloses the stacked architecture of claim 1, wherein the plurality of configurable micro-SPUs and the plurality of configurable micro-MFUs is selected based on the operation(Paragraph 40, special-purpose hardware chip for training a neural network using an accelerator).
Regarding claim 7, Phelps discloses the stacked architecture of claim 1, wherein the order of the plurality of configurable micro-SPUs and the plurality of configurable micro-MFUs is selected based on the operation(Paragraph 40, special-purpose hardware chip for training a neural network using an accelerator).
Regarding claim 8, Phelps discloses the stacked architecture of claim 1, wherein the plurality of configurable micro-SPUs and the plurality of configurable micro-MFUs perform the operation without intermediate accesses to memory of the accelerator(Figure 1B, no intermediary memory between SPU and VPU).
Regarding claim 9, Phelps discloses the stacked architecture of claim 1, further comprising: a plurality of first in, first out (FIFO) structures that provide the vector values output from a previous micro-MFU to a next micro-MFU in the order, wherein one FIFO structure of the plurality of FIFO structures is parallel to each micro-SPU of the plurality of micro-SPUs(Paragraph 55, There is a first-in-first-out data storage to store results When an operation is finished, the results are stored in the FIFO. The compute core can use a separate instruction at a later time to pull the data out of the FIFO and put it in the vector register).
Regarding claim 10, Phelps discloses the stacked architecture of claim 1, further comprising: at least one programmable MFU in communication with one or more micro-SPUs of the plurality of configurable micro-SPUs, wherein the output of the at least one programmable MFU is provided as an input to the one or more micro-SPUs(Figure 1B, output of VPU 106 is an input to SPU 107).
Regarding claim 11, Phelps discloses the stacked architecture of claim 1, further comprising: a first programmable MFU in communication with a first micro-SPU of the plurality of configurable micro-SPUs; a second programmable MFU in communication with a last micro-SPU of the plurality of configurable micro-SPUs; and a third programmable MFU in communication with the second programmable MFU(Paragraph 44, FIG. 1B shows a high-level example of compute core (101). The compute core can be a machine, i.e., a VLIW machine, that controls several compute units in parallel Each compute core (101) contains: a scalar memory (104), a vector memory (108), a scalar processing unit (107), vector registers (106), and extended vector units).
Regarding claim 12, Phelps discloses an accelerator, comprising: a plurality of vector register files (VRFs) that provide one or more vectors with data for the accelerator(Paragraph 54, the computational unit includes vector registers, i.e., 32 vector registers, in a vector processing unit (106) that can be used for both floating point operations and integer operations), a plurality of programmable multi-functional units (MFUs) in communication with the VRFs to perform vector operations on vector values from the one or more vectors(Paragraph 54, using these registers as operands, each of the vector units can simultaneously execute two ALU instructions, one load and one store instruction, every clock cycle); at least one programmable scalar processing unit (SPU)(Paragraph 54, base address for a load or a store instruction can be computed in the scalar processor and forwarded to the vector processor); and a configurable stacked architecture with a fixed function datapath(Figure 1B) in communication with the plurality of programmable MFUs, wherein the stacked architecture performs a non-linear operation on the vector values to accelerate a layer of a DNN(Paragraph 58, the vector processing unit can apply a non-linear function to outputs of the matrix multiply unit to generate vector data values. In some implementations, the vector processing unit 106 generates normalized values, pooled values, or both. The vector of processed outputs can be used as left-hand side data inputs to the matrix multiply unit 113, e.g., for use in a subsequent layer in the neural network), and the stacked architecture includes a plurality of configurable limited SPUs that perform a scalar operation on the vector values as part of the non-linear operation and a plurality of configurable limited vector processing units that perform a vector operation on the vector values as part of the non-linear operation(Paragraph 44, FIG. 1B shows a high-level example of compute core (101). The compute core can be a machine, i.e., a VLIW machine, that controls several compute units in parallel Each compute core (101) contains: a scalar memory (104), a vector memory (108), a scalar processing unit (107), vector registers (106), and extended vector units).
Regarding claim 13, Phelps discloses accelerator of claim 12, wherein the plurality of configurable limited SPUs and the plurality of configurable limited vector processing units are stacked in an order to perform the non-linear operation(Paragraph 58, the vector processing unit can apply a non-linear function to outputs of the matrix multiply unit to generate vector data values. In some implementations, the vector processing unit 106 generates normalized values, pooled values, or both. The vector of processed outputs can be used as left-hand side data inputs to the matrix multiply unit 113, e.g., for use in a subsequent layer in the neural network).
Regarding claim 15, Phelps discloses accelerator of claim 13, wherein each limited SPU of the plurality of configurable limited SPUs performs a reduction operation, the at least one scalar operation, and a broadcast operation to broadcast the vector values to a vector(Paragraph 45, The scalar instruction set includes normal arithmetic operations, e.g., as used in address calculations, load/store instructions, and branch instructions. The remaining instruction slots encode instructions for the vector processing unit or other extended vector units (113, 114, 116). The decoded vector instructions are forwarded to the vector processing unit).
Regarding claim 16, Phelps discloses accelerator of claim 13, wherein the plurality of configurable limited SPUs and the plurality of configurable limited vector processing units are selected based on the non-linear operation to accelerate(Paragraph 58, the vector processing unit can apply a non-linear function to outputs of the matrix multiply unit to generate vector data values. In some implementations, the vector processing unit 106 generates normalized values, pooled values, or both. The vector of processed outputs can be used as left-hand side data inputs to the matrix multiply unit 113, e.g., for use in a subsequent layer in the neural network).
Regarding claim 17, Phelps discloses accelerator of claim 12, wherein the accelerator uses a fixed function instruction to identify the non-linear operation to accelerate(Paragraph 58, the vector processing unit can apply a non-linear function to outputs of the matrix multiply unit to generate vector data values. In some implementations, the vector processing unit 106 generates normalized values, pooled values, or both. The vector of processed outputs can be used as left-hand side data inputs to the matrix multiply unit 113, e.g., for use in a subsequent layer in the neural network).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Phelps and Youn(US 2021/0312266).
Regarding claim 18, Chung discloses accelerator of claim 12, wherein the non-linear operation is one of a softmax operation or a layer normalization operation(Paragraph 58) but does not specifically disclose wherein the DNN is a bidirectional encoder representations from transformers (BERT) model. However, Youn discloses a BERT model DNN(Paragraph 19). It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Phelps and Youn to have a DNN that is a BERT model. The motivation to do so is that it is well known type of DNN that would yield predictable results of handling multiple classes of operations.

Allowable Subject Matter
Claims 19-20 are allowed.
Claim 14 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NIMESH G PATEL whose telephone number is (571)272-3640. The examiner can normally be reached Monday-Friday, 8:15-4:15.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jaweed Abbaszadeh can be reached on 571-270-1640. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NIMESH G PATEL/Primary Examiner, Art Unit 2187