DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-3, 9-11, and 17-19 have been amended.
Claims 1-24 have been examined.
The specification objections in the previous Office Action have been addressed and are withdrawn.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on May 10, 2021 has been entered.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5, 7-13, 15-21, 23, and 24 are rejected under 35 U.S.C. 103 as being unpatentable over US Publication No. 2004/0111587 by Nair et al. (hereinafter referred to as “Nair”) in view of US Publication No. 2017/0293659 by Huang (hereinafter referred to as “Huang”) in view of US Patent No. 8,924,455 by Barman et al. (hereinafter referred to as “Barman”). 
Regarding claims 1, 9, and 17, taking claim 1 as representative, Nair discloses:
(Nair discloses, at ¶ [0021], a processor that fetches instructions, which discloses fetch circuitry. Nair also discloses, at ¶ [0062], an instruction that specifies three matrix operands and an opcode. As disclosed at ¶ [0090] (Table 1) the opcode can indicate multiply and accumulate of, as disclosed at ¶ [0089], corresponding elements, which includes non-zero elements.); 
decode circuitry to decode the fetched instruction (Nair discloses, at ¶ [0024], decoding instructions, which discloses decode circuitry.); and 
the execution circuitry to execute the decoded instruction as per the opcode… to multiply and accumulate matching NZ elements of the first matrix and the second matrix with corresponding elements of the third matrix (Nair discloses, at ¶ [0090] (Table 1), executing a multiply and accumulate instruction, which discloses execution circuitry, that multiplies a first and second matrix and accumulates the result with a third (destination) matrix. Nair also discloses, at ¶ [0089], the matrix instructions operate on corresponding (matching) elements, which includes non-zero elements.).
Nair does not explicitly disclose that the execution circuitry to execute the instruction is to generate NZ bitmasks for the first matrix and the second matrix, broadcast NZ elements from each row of the first matrix to a corresponding row of a two-dimensional grid of processing engines and from each column of the second matrix to a corresponding column of the two-dimensional grid of processing engines, that the multiplying and accumulating of the NZ elements is based on the NZ bitmasks, and wherein each processing engine comprises a buffer, and is to store a broadcast NZ element in its buffer for use in a subsequent cycle in response to the NZ bitmasks indicating a matching NZ element will arrive in the subsequent cycle, and not store a broadcast NZ element in its buffer in response to the NZ bitmasks indicating a matching NZ element will not arrive. 
However, in the same field of endeavor (e.g., matrix operations) Huang discloses:
generating NZ bitmasks (Huang discloses, at ¶ [0109], generating bitmasks that identify non-zero elements.)
 (Huang discloses, at ¶¶ [0134]- [0135] and Figure 12, broadcasting rows and vectors into fifos of an array of processing units. The values that are stored are stored are the compressed values, which means they are stored as a result of the non-zero bit masks indicating matching elements will arrive to be used for performing multiplication and addition based on this indication provided by the NZ bitmasks.); 
wherein each processing engine comprises a buffer, and is to store a broadcast NZ element in its buffer for use in a subsequent cycle in response to the NZ bitmasks indicating a matching NZ element will arrive in the subsequent cycle, and not store a broadcast NZ element in its buffer in response to the NZ bitmasks indicating a matching NZ element will not arrive (Huang discloses, at ¶¶ [0134]- [0135] and Figure 12, storing rows and vectors into fifos (buffers) of an array of processing units. The values that are stored are stored are the compressed values, which means the stored values are stored as a result of the non-zero bit masks indicating matching elements will arrive and when the non-zero bit masks do not indicate matching elements will arrive values are not stored, and each value in the fifo, except the first, will be used in subsequent cycles.); and
non-transitory computer-readable storage media (Huang discloses, at ¶ [0103], non-transitory computer-readable storage media.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Nair’s matrix multiply and accumulate instruction to include Huang’s generation of non-zero bit masks and storing of non-zero elements in order to improve performance by reducing the number of computations needed calculate a result. See Huang, ¶ [0121].
Also, in the same field of endeavor (e.g., matrix operations) Barman discloses:
elements are broadcast to a corresponding row of a two-dimensional grid of processing engines and to a corresponding column of the two-dimensional grid of processing engines (Barman discloses, at col. 3, lines 59-62, each processing cell in an MxL (two-dimensional) array of processing cells receives elements from corresponding rows and columns of input matrices.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Nair’s matrix multiply and accumulate instruction to utilize a two-

Regarding claims 2, 10, and 18, taking claim 2 as representative, Nair, as modified, discloses the elements of claim 1, as discussed above. Nair also discloses:
wherein each of the first matrix, second matrix, and third matrix are located in a corresponding single two-dimensional tile register in a matrix operations accelerator (Nair discloses, at ¶ [0037], the matrix data processor, i.e., matrix operations accelerator, utilizes packed data contained in the span of a register set, i.e., single two-dimensional tile register.).

Regarding claims 3, 11, and 19, taking claim 3 as representative, Nair, as modified, discloses the elements of claim 1, as discussed above. Nair does not explicitly disclose the execution circuitry is to execute the decoded instruction to broadcast a first set of NZ elements from a first row of the first matrix to both a first processing engine and a second processing engine in a first row of the two-dimensional grid of processing engines, and broadcast a second set of NZ elements from a first column of the second matrix to both the first processing engine and a third processing engine in a first column of the two-dimensional grid of processing engines.
However, in the same field of endeavor (e.g., matrix operations) Barman discloses:
broadcasting a first set of NZ elements from a first row of the first matrix to both a first processing engine and a second processing engine in a first row of the two-dimensional grid of processing engines, and broadcasting a second set of NZ elements from a first column of the second matrix to both the first processing engine and a third processing engine in a first column of the two-dimensional grid of processing engines (Barman discloses, at col. 6, lines 42-45 and Figure 9, broadcasting a row of an input matrix into a row of the systolic array, which discloses first and second processing engines, and broadcasting a column of a second input matrix into a column of the systolic array, which discloses first and third processing engines.). 
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Nair’s matrix multiply and accumulate instruction to utilize a two-

Regarding claims 4, 12, and 20, taking claim 4 as representative, Nair, as modified, discloses the elements of claim 1, as discussed above. Nair also discloses:
wherein the first matrix has M rows by K columns, the second matrix has K rows by N columns, the third matrix has M rows by N columns…and wherein the instruction is further to specify K, M, and N (Nair discloses, at ¶ [0050], the instruction specifies the number of rows and columns for each of the matrices.).
Nair does not explicitly disclose wherein the two dimensional grid of processing engines has M rows by N columns.
However, in the same field of endeavor (e.g., matrix operations) Barman discloses:
a two dimensional grid of processing engines having M rows by N columns (Barman discloses, at col. 3, lines 10-42, a systolic array having a number of MAC units (processing elements) that depends on the size of the matrices being multiplied.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Nair’s matrix multiply and accumulate instruction to utilize a two-dimensional array of processing cells, one for each element of the output matrix, as disclosed by Barman, in order to improve performance by increasing parallelism to achieve high throughput. See Barman, col. 3, lines 2-4.

Regarding claims 5, 13, and 21, taking claim 5 as representative, Nair, as modified, discloses the elements of claim 1, as discussed above. Nair also discloses:
wherein the first matrix has M rows by K columns, the second matrix has K rows by N columns, the third matrix has M rows by N columns…and wherein K, M, and N are configured in a configuration register in the processor before the instruction is fetched (Nair discloses, at ¶ [0063], specifying the matrix parameters in a configuration register. As these parameters are used when the instruction is executed, the parameters are stored prior to fetching the instruction.).

However, in the same field of endeavor (e.g., matrix operations) Barman discloses:
a two dimensional grid of processing engines having M rows by N columns (Barman discloses, at col. 3, lines 10-42, a systolic array having a number of MAC units (processing elements) that depends on the size of the matrices being multiplied.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Nair’s matrix multiply and accumulate instruction to utilize a two-dimensional array of processing cells, one for each element of the output matrix, as disclosed by Barman, in order to improve performance by increasing parallelism to achieve high throughput. See Barman, col. 3, lines 2-4.

Regarding claims 7, 15, and 23, taking claim 7 as representative, Nair, as modified, discloses the elements of claim 1, as discussed above. Nair does not explicitly disclose wherein at least one of the first matrix and the second matrix is a sparse matrix containing a plurality of zero-valued elements.
However, in the same field of endeavor (e.g., matrix operations) Huang discloses:
wherein at least one of the first matrix and the second matrix is a sparse matrix containing a plurality of zero-valued elements (Huang discloses, at ¶ [0105], sparse matrices.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Nair’s matrix multiply and accumulate instruction to include Huang’s sparse matrices because efficiency can be improved when operating on sparse matrices can by reducing the number of computations needed calculate a result. See Huang, ¶ [0121].

Regarding claims 8, 16, and 24, taking claim 8 as representative, Nair, as modified, discloses the elements of claim 7, as discussed above. Nair does not explicitly disclose wherein the sparse matrix has been stored in memory in compressed format before fetching the instruction, the compressed format to pack NZ elements together and indicate a logical matrix position of each NZ element in a header.
However, in the same field of endeavor (e.g., matrix operations) Huang discloses:
(Huang discloses, at ¶ [0108], storing non-zero elements in compressed format and, at ¶ [0111], storing the compression information in a header.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Nair’s matrix multiply and accumulate instruction to include Huang’s compression because doing so improves efficiency by saving storage space. See Huang, ¶ [0105].

Claims 6, 14, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Nair in view of Huang in view of Barman in view of US Publication No. 2014/0059322 by Ould-Ahmed-Vall et al. (hereinafter referred to as “Ould”).
Regarding claims 6, 14, and 22, taking claim 6 as representative, Nair, as modified, discloses the elements of claim 1, as discussed above. Nair does not explicitly disclose wherein the instruction is further to specify a writemask to indicate, for each element of the third matrix, whether the element is to be updated or is to be masked, the instruction further to specify whether masked elements are to be zeroed, setting their values to zero, or merged, leaving their values unchanged.
However, in the same field of endeavor (e.g., vector operations) Ould discloses:
wherein the instruction is further to specify a writemask to indicate, for each element of the third matrix, whether the element is to be updated or is to be masked, the instruction further to specify whether masked elements are to be zeroed, setting their values to zero, or merged, leaving their values unchanged (Ould discloses, at ¶ [0066], a mask (writemask) that specifies whether a destination will be updated and whether the destination will be zeroed, merged, or retain its old value.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Nair’s matrix multiply and accumulate instruction to include Ould’s writemasking because doing so provides “significant benefits over current techniques which have the disadvantage of increased instruction count resulting from memory access operations.” See Ould, ¶ [0075].

Response to Arguments
On page 10-12 of the response filed May 10, 2021 (“response”), the Applicant argues that the cited references do not disclose the amended claims and emphasizes the newly amended portions of the independent claims. 
Though fully considered, the Examiner respectfully disagrees. The emphasized portions of the amended claims, taking claim 1 as representative, include language largely brought up from previously rejected claim 3. The previous rejection of claim 3 applies to the amended claim 1. As discussed above, Huang discloses, e.g., at ¶¶ [0134]- [0135] and Figure 12, storing matrix values into fifos of an array of processing units. These values are the non-zero values from a sparse matrix, as indicated by mask values and, as indicated at ¶ [0142], when a corresponding mask value is zero, the corresponding operation, i.e., storing and performing the multiply-accumulate operation, is skipped. Accordingly, the Applicant’s arguments are deemed unpersuasive.

On page 11 of the response the Applicant argues “Because the Applicant has demonstrated the patentability of all pending independent claims, the Applicant respectfully submits that all pending claims are allowable. The Applicant's silence with respect to the dependent claims should not be construed as an admission by the Applicant that the Applicant is complicit with the Examiner's rejection of these claims. Because the Applicant has demonstrated the patentability of the independent claims, the Applicant need not substantively address the theories of rejection applied to the dependent claims.. 
Though fully considered, the Examiner respectfully disagrees. The reasons set forth in the remarks and rejections presented above, including those regarding the independent claims, are applicable to these claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAWN DOMAN whose telephone number is (571)270-5677.  The examiner can normally be reached on Monday through Friday 8:30am-6pm Eastern Time.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAWN DOMAN/Primary Examiner, Art Unit 2183