DETAILED ACTION
Status of Claims 
Claim 1-24 have been considered. It is hereby acknowledged that the following papers have been received and placed of record in the file:
Applicant Remarks 						-Receipt Date 04/18/2022
Amended Claims 						-Receipt Date 04/18/2022

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 04/18/2022 has been entered.
 
Response to Amendment
This office action is in response to the amendment filed on 04/18/2022. Claims 1-24 are pending. Claims 1, 5-6, 9, 13-14, 17, and 21-22 are amended. 

Response to Arguments
Applicant's arguments filed 10/26/2021 have been fully considered but they are not persuasive. 
Applicant submits:
“Applicant again notes that Nam (U.S. Pat. No. 10,949,380) has a filing date of November 14, 2019 and a prior publication date of September 10, 2020 (i.e., both after the September 27, 2019 filing date of the Applicant's application). The Office action on page 3 alleges that "the US Patent cited to in the rejection is itself prior art as of the date of the foreign priority (March 7, 2019) and may be cited to in the rejection." Applicant's representative submits this is incorrect. As noted in MPEP § 2151, "Thus, a U.S. patent document is effective as prior art as of the filing date of the earliest application to which benefit or priority is claimed and which describes the subject matter relied upon, regardless of whether the earliest such application is a U.S. provisional or nonprovisional application, an international (PCT) application, or a foreign patent application." (emphasis added). The Office action allegation "that the foreign document supports the US Patent as evidenced by, for example, the same figures being used in both documents" does not prove that the subject matter from the US Patent (referred to herein as "Nam") relied upon by the Examiner in the Office action is described in the non-English KR patent application. The Applicant again requests the Examiner reference the disclosure within that KR patent application, e.g., referencing the page(s), line(s), and figure(s) of that foreign application. If not, the Applicant again requests the withdrawal of the rejections based on Nam.” (Remarks, pages 11-12)
	However, this argument is not persuasive because the MPEP does not require the Examiner to reference the disclosure of foreign application in the rejection, only that the foreign patent application describes the subject matter of the claimed invention, which is evidenced by the same figures being used in the KR application of Nam. 

Applicant submits:
	“As another example, the Office action on page 6 alleges that "It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the systolic array modes of Nam to include a mode for performing matrix multiply by transmitting partial results downstream as taught by Barman such that the output values of the processing elements are also stored to the registers of Nam." However, in reference to PE12, column 3, lines 41-44 of Nam recite "The fourth register R12_ 4 may store an output value of the multiplexer MUX12 and transmit the output value to the PE PE11 of the previous column." (emphasis added). 
Applicant is unclear how Nam would somehow be modified for "transmitting partial results downstream" when Nam discloses that PE12 is to "transmit the output value to the PE PEl 1 of the previous column".
MPEP §2143.01(VI) states that "[i]f the proposed modification or combination of the prior art would change the principle of operation of the prior art invention being modified, then the teachings of the references are not sufficient to render the claims prima facie obvious." To somehow modify Nam (e.g., in the alleged combination with Barman and Grochowski) such that its processing element (e.g., PE12) that transmits the output value to the processing element (e.g., PElt) of the previous column to somehow "transmit partial results downstream" would impermissibly change the principle of operation of Nam (e.g., in the alleged combination with Barman and Grochowski). Further, the alleged modification would require a substantial reconstruction and redesign of the elements shown in the reference as well as a change in the basic principle under which the reference(s) construction was designed to operate.” (Remarks, pages 14-15)
However, this argument is not persuasive because it would be well within the skill of one of ordinary skill in the art to modify Nam to support a mode which transmits the results downstream, for example by adding a multiplexer that either sends the result to the previous processing element or downstream depending on the mode. Further, this modification would not change the principle operation of Nam since the combination would retain all the functionality of Nam and would include a further mode which sends partial results downstream as taught by Barman. Further, the modification would only require minor modifications that would be within the skill of one of ordinary skill in the art to make. 

Applicant submits:
“As another example, the Office action alleges, in reference to claim 7 (and similarly for claims 15 and 23), that the cited portions of the references teach "The apparatus of claim 1, wherein the resultant storage is a third plurality of registers that represents a plurality of output two-dimensional matrices formed by execution of the decoded single instruction". However, Nam of the alleged combination does not teach or suggest "a plurality of output two- dimensional matrices formed by execution of the decoded single instruction" (emphasis added).” (Remarks, page 15)
	However, this argument is not persuasive because Nam teaches that a matrix multiply may involve multiple boundary operations, for example, Fig. 3 shows a matrix multiply being performed in by B1 and B2, and the Office Action maps the two output matrices formed by B1 and B2 as a plurality of output two-dimensional matrices, which would be formed by the execution of the matrix multiply instruction taught by Grochowski since both B1 and B2 are performed to multiply the matrices shown in Fig. 2.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-24 are rejected under 35 U.S.C. 103 as being unpatentable over Nam US 10,949,380 in view of Barman et al. US 8,924,455 (hereinafter, Barman) and Grochowski et al. US 2018/0004510 (hereinafter, Grochowski).
	Regarding claim 1, Nam teaches:
1. An apparatus (Fig. 8, 800) comprising: 
a matrix operations accelerator circuit (Fig. 8, 850) comprising a two-dimensional grid of fused multiply accumulate circuits (col 6 lines 32-35: the systolic array is a two dimensional grid of processing elements; col 3 lines 10-17: the processing elements are fused multiply accumulate circuits, see also Fig. 1); 
a first plurality of registers that represents at least one first input two-dimensional matrix coupled to the matrix operations accelerator circuit (col 3 lines 36-38: registers Rxx_2 represents column inputs from a weight matrix, i.e. a first input two-dimensional matrix, coupled to the systolic array); 
a second plurality of registers that represents at least one second input two-dimensional matrix coupled to the matrix operations accelerator circuit (col 3 lines 33-36 and col 4 lines 1-5: registers Rxx_3 represents row inputs from an input matrix, i.e. a second input two-dimensional matrix, coupled to the systolic array); 
a single instruction (col 5 lines 22-27: the processing system 800 is controlled in response to an instruction);
an execution circuit of the core (col 5 lines 22-25: controller 803) to: 
switch the matrix operations accelerator circuit from a first mode to a second mode (col 5 lines 33-37: the processing system 800 may be switched between a first and second mode) where the respective output of each of the first proper subset of fused multiply accumulate circuits (Fig. 1, PE12, PE22, and PE32) of the two-dimensional grid form first output values from a first input matrix of the at least one first input two-dimensional matrix and a first input matrix of the at least one second input two-dimensional matrix (col 3 lines 11-21: the output values of processing elements PE12, PE22, and PE32 are formed from column inputs IN_C_2, i.e. a first input matrix of the at least one first input matrix, and from row inputs IN_R_1 to IN_R_3 in Rx1_3, i.e. a first input matrix of the at least one second input matrix), and store the first output values in the resultant storage (col 3 lines 22-32: the outputs of the processing elements from PE12, PE22, and PE32 are stored in output registers Rx2_1, i.e. the resultant storage), and a respective output of each of the second proper subset of fused multiply accumulate circuits (Fig. 1, PE13, PE23, and PE33) of the two- dimensional grid form second output values from a second input matrix of the at least one first input two-dimensional matrix and a second input matrix of the at least one second input two-dimensional matrix (col 3 lines 11-21: the output values of PE13, PE23, and PE33 are formed from column inputs IN_C_3, i.e. a second input matrix of the at least one first input matrix, and from row inputs IN_R_1 to IN_R_3 in Rx2_3, i.e. the second input matrix), and store the second output values in the resultant storage (col 3 lines 22-32: the outputs of the processing elements from PE13, PE23, and PE33 are stored in output registers Rx3_1, i.e. the resultant storage).
	Nam does not teach:
	a decoder, of a core coupled to the matrix operations accelerator circuit, to decode a single instruction into a decoded single instruction, the single instruction including a field that identifies a resultant storage; and 
an execution circuit of the core to execute the decoded single instruction to: 
switch the matrix operations accelerator circuit from a first mode where a respective output of each of a first proper subset of fused multiply accumulate circuits of the two-dimensional grid is transmitted downstream to a respective input of each of a second proper subset of fused multiply accumulate circuits of the two-dimensional grid to form output values from a single input matrix of the at least one first input two-dimensional matrix and a single input matrix of the at least one second input two- dimensional matrix, and store the output values in the resultant storage, 
	However, in the analogous art of matrix multiply using systolic arrays, Barman teaches:
a first mode where a respective output of each of a first proper subset of fused multiply accumulate circuits (Fig. 2: a12, a22, and a32) of the two-dimensional grid is transmitted downstream to a respective input of each of a second proper subset of fused multiply accumulate circuits (Fig. 2: a13, a23, and a33) of the two-dimensional grid to form output values from a single input matrix of the at least one first input two-dimensional matrix and a single input matrix of the at least one second input two- dimensional matrix (col 2 line 58-col 3 line3: the processing cells a12, a22, and a32 perform MAC operations and transmit their result downstream to inputs of a13, a23, and a33 respectively, to form output values from a single first input matrix A and single second input matrix B)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the systolic array modes of Nam to include a mode for performing matrix multiply by transmitting partial results downstream as taught by Barman such that the output values of the processing elements are also stored to the registers of Nam. One of ordinary skill in the art would have been motivated to include the mode taught by Barman to capture the advantages of transmitting partial results downstream such as enabling faster retrieval of results, see Barman col 6 lines 51-58, while also reserving the advantages of the modes of Nam which allows for selecting a mode to increase utilization, see Nam col lines 43-66.
	Further, in the analogous art of matrix multiply, Grochowski teaches:
a decoder (Fig. 1, 108), of a core (Fig. 1, 102), to decode a single instruction into a decoded single instruction ([0033]: the decode unit decodes the matrix multiplication instruction into a decoded instruction), the single instruction including a field that identifies a resultant storage ([0030]: the instruction includes a field for a destination operand that identifies a result storage);
to execute a decoded single instruction to: indicate a first mode or a second mode ([0037]: the instruction has a field to indicate whether the instruction is to be performed with or without matrix accumulation, i.e. a first mode or a second mode)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the system of Nam in view of Barman to include a core with a decoder and to modify the instruction of Nam to be a matrix multiplication instruction as taught by Grochowski such that the matrix multiplication instruction destination operand field identifies the result registers of Nam and such that the field that indicates the mode the instruction is to be performed in would cause the processing circuitry to switch to the indicated mode when executing the instruction. One of ordinary skill in the art would have been motivated to make this modification because using a core with a decoder to decode an instruction is a known technique on the known device of a computer processor for executing instructions and would yield the predictable result of enabling a system to execute a program of instructions. Further, one of ordinary skill in the art would have been motivated to make this modification because supporting an instruction that identifies a result storage is a known technique on the known device of a computer processor for enabling operations and the location of the results to be specified which would yield the predictable result of increasing control over processing resources. 

	Regarding claim 2, Nam in view of Barman and Grochowski teaches: 
2. The apparatus of claim 1, wherein an instruction comprises a second field to indicate the matrix operations accelerator circuit is to execute in the first mode when the second field is a first value and in the second mode when the second field is a second value (Nam col 5 lines 33-46: an instruction from a host includes a field that sets a value in the mode switch register to indicate the system is to operate in a first mode when the field is a first value and a second mode when the field is a second value). 
	Nam in view of Barman and Grochowski, as currently mapped, does not teach the matrix multiplication of instruction including the second field indicating the first or second mode. That is, Name in view of Barman and Grochowski, as currently mapped, does not teach:
the single instruction comprises a second field to indicate the matrix operations accelerator circuit is to execute in the first mode when the second field is a first value and in the second mode when the second field is a second value
	However, Grochowski further teaches:
the single instruction comprises a second field to indicate the matrix operations accelerator circuit is to execute in the first mode when the second field is a first value and in the second mode when the second field is a second value ([0037]: the matrix multiplication instruction includes a field to indicate whether the instruction should be perform with or without accumulation, i.e. in a first or second mode)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the matrix multiplication instruction of Nam in view of Barman and Grochowski to include a field that indicates the mode the instruction should be performed in as further taught by Grochowski. Furthermore, one of ordinary skill in the art would have been motivated to make this modification because including an instruction field to indicate a mode for an instruction is a known technique on the known device of a computer processor for specifying a mode for the instruction and would yield the predictable result of efficiently controlling the mode in which the instruction is executed. 

	Regarding claim 3, Nam in view of Barman and Grochowski teaches:
3. The apparatus of claim 2, wherein the second field is an immediate of the single instruction (Grochowski [0037]: the instruction field indicates the mode using bits in the instruction, i.e. the field is an immediate of the instruction).

	Regarding claim 4, Nam in view of Barman and Grochowski teaches:
4. The apparatus of claim 1, wherein the resultant storage is a third plurality of registers (Nam Fig. 1: registers Rxx_1) that represents at least one output two-dimensional matrix formed by execution of the decoded single instruction (Nam col 3 lines 11-32: registers Rxx_1 store the results of the matrix multiply formed by executing the matrix multiply instruction which represents a result matrix).

	Regarding claim 5, Nam in view of Barman and Grochowski teaches: 
5. The apparatus of claim 4, 
	Nam in view of Barman and Grochowski, as currently mapped, does not teach:
wherein the execution of the decoded single instruction is to: in the first mode, add values from a single input matrix of the third plurality of registers that represents at least one third input two-dimensional matrix initially stored in the third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage, and 
in the second mode, add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage.
	However, Grochowski further teaches:
wherein the execution of the decoded single instruction is to: add values that represents at least one third input two-dimensional matrix initially to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage ([0035]-[0036]: an accumulation matrix is added to the output of the multiplication of A and B to form an updated output which is then stored to the storage location that the accumulation matrix was first stored in).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the systolic array of Nam to store an accumulation matrix in its result storage, to add it to the matrix multiply outputs, and to store the updated output to back to same storage that the accumulation matrix was in, as taught by Grochowski. This combination would teach:
wherein the execution of the decoded single instruction (matrix multiplication instruction) is to: in the first mode (the mode in which partial results are transmitted to a next processing element in the systolic array), add values from a single input matrix of the third plurality of registers that represents at least one third input two-dimensional matrix initially stored in the third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage (values of an accumulation matrix, i.e. a single input matrix, are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference), and 
in the second mode (the mode in which the processing elements accumulate results in place), add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage (values of an accumulation matrix are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication, in the second mode these are first and second output values, i.e. first and second input matrices of the third input matrix, from the first and second sets of processing elements respectively, to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference).
One of ordinary skill in the art would have been motivated to make this modification because using an initial accumulation matrix to add to a matric multiply output is a known technique on the known device of a computer processor for accumulating previous results with current results and would yield the predictable result of supporting larger matrix multiples and would also increase the usability of the systolic array. 

	Regarding claim 6, Nam in view of Barman and Grochowski teaches:
6. The apparatus of claim 1, 
	Nam in view of Barman and Grochowski, as currently mapped, does not teach:
wherein the execution of the decoded single instruction is to: in the first mode, add values from a single input matrix of at least one third input two-dimensional matrix initially stored in a third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage, and 
in the second mode, add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage.
	However, Grochowski further teaches:
add values from at least one third input two-dimensional matrix to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage ([0035]-[0036]: an accumulation matrix is added to the output of the multiplication of A and B to form an updated output which is then stored to the storage location that the accumulation matrix was first stored in).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the systolic array of Nam to store an accumulation matrix in its result storage, to add it to the matrix multiply outputs, and to store the updated output to back to same storage that the accumulation matrix was in, as taught by Grochowski. This combination would teach:
wherein the execution of the decoded single instruction (matrix multiplication instruction) is to: in the first mode (the mode in which partial results are transmitted to a next processing element in the systolic array), add values from a single input matrix of at least one third input two-dimensional matrix initially stored in a third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage (values of an accumulation matrix, i.e. a single input matrix, are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference), and 
in the second mode (the mode in which the processing elements accumulate results in place), add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage (values of an accumulation matrix are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication, in the second mode these are first and second output values, i.e. first and second input matrices of the third input matrix, from the first and second sets of processing elements respectively, to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference).
One of ordinary skill in the art would have been motivated to make this modification because using an initial accumulation matrix to add to a matric multiply output is a known technique on the known device of a computer processor for accumulating previous results with current results and would yield the predictable result of supporting larger matrix multiples and would also increase the usability of the systolic array. 

	Regarding claim 7, Nam in view of Barman and Grochowski teaches:
7. The apparatus of claim 1, wherein the resultant storage is a third plurality of registers (Nam Fig. 1: registers Rxx_1) that represents a plurality of output two-dimensional matrices formed by execution of the decoded single instruction (Nam col 3 lines 11-32 and col 4 lines 26-40: registers Rxx_1 store the results of the matrix multiply formed by executing the matrix multiply instruction, the result registers  represent the output matrix formed from executing B1 and the output matrix formed by executing B2, which are a plurality of output matrices which are formed by the execution of the matrix multiply instruction in the combination, see also Nam Fig. 3 for reference).

	Regarding claim 8, Nam in view of Barman and Grochowski teaches:
8. The apparatus of claim 1, wherein the first proper subset of fused multiply accumulate circuits is one of a row or a column of the two-dimensional grid of fused multiply accumulate circuits and the second proper subset of fused multiply accumulate circuits is another of the one of the row or the column of the two-dimensional grid of fused multiply accumulate circuits (Nam Fig. 1: the subset PE12, PE22, and PE32 is a column of the grid of MACs and the subset PE13, PE23, and PE33 is another column of the grid of MACs).

	Regarding claim 9, Nam teaches:
9. A method comprising: 
a processor core (Fig. 8, 800) is coupled to a matrix operations accelerator circuit (Fig. 8, 850) comprising a two-dimensional grid of fused multiply accumulate circuits (col 6 lines 32-35: the systolic array is a two dimensional grid of processing elements; col 3 lines 10-17: the processing elements are fused multiply accumulate circuits, see also Fig. 1), the matrix operations accelerator circuit is coupled to a first plurality of registers that represents at least one first input two-dimensional matrix (col 3 lines 36-38: registers Rxx_2 represents column inputs from a weight matrix, i.e. a first input two-dimensional matrix, coupled to the systolic array) and a second plurality of registers that represents at least one second input two-dimensional matrix (col 3 lines 33-36 and col 4 lines 1-5: registers Rxx_3 represents row inputs from an input matrix, i.e. a second input two-dimensional matrix, coupled to the systolic array; and 
executing with an execution circuit (col 5 lines 22-25: controller 803) of the processor core to: switch the matrix operations accelerator circuit from a first mode to a second mode (col 5 lines 33-37: the processing system 800 may be switched between a first and second mode) where the respective output of each of the first proper subset of fused multiply accumulate circuits (Fig. 1, PE12, PE22, and PE32) of the two-dimensional grid form first output values from a first input matrix of the at least one first input two-dimensional matrix and a first input matrix of the at least one second input two-dimensional matrix (col 3 lines 11-21: the output values of processing elements PE12, PE22, and PE32 are formed from column inputs IN_C_2, i.e. a first input matrix of the at least one first input matrix, and from row inputs IN_R_1 to IN_R_3 in Rx1_3, i.e. a first input matrix of the at least one second input matrix), and store the first output values in the resultant storage (col 3 lines 22-32: the outputs of the processing elements from PE12, PE22, and PE32 are stored in output registers Rx2_1, i.e. the resultant storage), and a respective output of each of the second proper subset of fused multiply accumulate circuits (Fig. 1, PE13, PE23, and PE33) of the two- dimensional grid form second output values from a second input matrix of the at least one first input two-dimensional matrix and a second input matrix of the at least one second input two-dimensional matrix (col 3 lines 11-21: the output values of PE13, PE23, and PE33 are formed from column inputs IN_C_3, i.e. a second input matrix of the at least one first input matrix, and from row inputs IN_R_1 to IN_R_3 in Rx2_3, i.e. the second input matrix), and store the second output values in the resultant storage (col 3 lines 22-32: the outputs of the processing elements from PE13, PE23, and PE33 are stored in output registers Rx3_1, i.e. the resultant storage).
	Nam does not teach:
decoding, with a decoder, a single instruction into a decoded single instruction,
the single instruction includes a field that identifies a resultant storage
executing the decoded single instruction to: switch the matrix operations accelerator circuit from a first mode where a respective output of each of a first proper subset of fused multiply accumulate circuits of the two-dimensional grid is transmitted downstream to a respective input of each of a second proper subset of fused multiply accumulate circuits of the two-dimensional grid to form output values from a single input matrix of the at least one first input two-dimensional matrix and a single input matrix of the at least one second input two- dimensional matrix, and store the output values in the resultant storage,
	However, in the analogous art of matrix multiply using systolic arrays, Barman teaches:
a first mode where a respective output of each of a first proper subset of fused multiply accumulate circuits (Fig. 2: a12, a22, and a32) of the two-dimensional grid is transmitted downstream to a respective input of each of a second proper subset of fused multiply accumulate circuits (Fig. 2: a13, a23, and a33) of the two-dimensional grid to form output values from a single input matrix of the at least one first input two-dimensional matrix and a single input matrix of the at least one second input two- dimensional matrix (col 2 line 58-col 3 line3: the processing cells a12, a22, and a32 perform MAC operations and transmit their result downstream to inputs of a13, a23, and a33 respectively, to form output values from a single first input matrix A and a single second input matrix B)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the systolic array modes of Nam to include a mode for performing matrix multiply by transmitting partial results downstream as taught by Barman such that the output values of the processing elements are also stored to the registers of Nam. One of ordinary skill in the art would have been motivated to make this modification to reduce overhead associated with processing boundaries to obtain the results (Nam col 4 lines 26-37) when utilization of the processing elements is the same or better than accumulating the results in place while also supporting modes for increasing utilization of the processing elements by accumulating the results in place when the technique of transmitting the partial results downstream decreases utilization, see also Barman col 6 lines 51-58. 
	Further, in the analogous art of matrix multiply, Grochowski teaches:
decoding, with a decoder (Fig. 1, 108), a single instruction into a decoded single instruction ([0033]: the decode unit decodes the matrix multiplication instruction into a decoded instruction),
the single instruction includes a field that identifies a resultant storage ([0030]: the instruction includes a field for a destination operand that identifies a result storage);
executing a decoded single instruction to: indicate a first mode or a second mode ([0037]: the instruction has a field to indicate whether the instruction is to be performed with or without matrix accumulation, i.e. a first mode or a second mode)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the system of Nam in view of Barman to include a core with a decoder and to modify the instruction of Nam to be a matrix multiplication instruction as taught by Grochowski such that the matrix multiplication instruction destination operand field identifies the result registers of Nam and such that the field that indicates the mode the instruction is to be performed in would cause the processing circuitry to switch to the indicated mode when executing the instruction. One of ordinary skill in the art would have been motivated to make this modification because using a core with a decoder to decode an instruction is a known technique on the known device of a computer processor for executing instructions and would yield the predictable result of enabling a system to execute a program of instructions. Further, one of ordinary skill in the art would have been motivated to make this modification because supporting an instruction that identifies a result storage is a known technique on the known device of a computer processor for enabling operations and the location of the results to be specified which would yield the predictable result of increasing control over processing resources. 


	Regarding claim 10, Nam in view of Barman and Grochowski teaches:
10. The method of claim 9, wherein an instruction comprises a second field indicating that the matrix operations accelerator circuit is to execute in the first mode when the second field is a first value and in the second mode when the second field is a second value (Nam col 5 lines 33-46: an instruction from a host includes a field that sets a value in the mode switch register to indicate the system is to operate in a first mode when the field is a first value and a second mode when the field is a second value).
	Nam in view of Barman and Grochowski, as currently mapped, does not teach the matrix multiplication of instruction including the second field indicating the first or second mode. That is, Name in view of Barman and Grochowski, as currently mapped, does not teach:
the single instruction comprises a second field indicating the matrix operations accelerator circuit is to execute in the first mode when the second field is a first value and in the second mode when the second field is a second value
	However, Grochowski further teaches:
the single instruction comprises a second field indicating the matrix operations accelerator circuit is to execute in the first mode when the second field is a first value and in the second mode when the second field is a second value ([0037]: the matrix multiplication instruction includes a field to indicate whether the instruction should be perform with or without accumulation, i.e. in a first or second mode)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the matrix multiplication instruction of Nam in view of Barman and Grochowski to include a field that indicates the mode the instruction should be performed in as further taught by Grochowski. One of ordinary skill in the art would have been motivated to make this modification because include an instruction field to indicate a mode for an instruction is a known technique on the known device of a computer processor for specifying a mode for the instruction and would yield the predictable result of efficiently controlling the mode in which the instruction is executed. 

	Regarding claim 11, Nam in view of Barman and Grochowski teaches:
11. The method of claim 10, wherein the second field is an immediate of the single instruction (Grochowski [0037]: the instruction field indicates the mode using bits in the instruction, i.e. the field is an immediate of the instruction).

	Regarding claim 12, Nam in view of Barman and Grochowski teaches:
12. The method of claim 9, wherein the resultant storage is a third plurality of registers (Nam Fig. 1: registers Rxx_1) that represents at least one output two-dimensional matrix formed by execution of the decoded single instruction (Nam col 3 lines 11-32: registers Rxx_1 store the results of the matrix multiply formed by executing the matrix multiply instruction which represents a result matrix).

	Regarding claim 13, Nam in view of Barman and Grochowski teaches:
13. The method of claim 12, 
	Nam in view of Barman and Grochowski, as currently mapped, does not teach:
wherein the executing the decoded single instruction is to: in the first mode, add values from a single input matrix of the third plurality of registers that represents at least one third input two-dimensional matrix initially stored in the third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage, and 
in the second mode, add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage.
	However, Grochowski further teaches:
wherein the execution of the decoded single instruction is to: add values that represents at least one third input two-dimensional matrix initially to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage ([0035]-[0036]: an accumulation matrix is added to the output of the multiplication of A and B to form an updated output which is then stored to the storage location that the accumulation matrix was first stored in).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the systolic array of Nam to store an accumulation matrix in its result storage, to add it to the matrix multiply outputs, and to store the updated output to back to same storage that the accumulation matrix was in, as taught by Grochowski. This combination would teach:
wherein the execution of the decoded single instruction (matrix multiplication instruction) is to: in the first mode (the mode in which partial results are transmitted to a next processing element in the systolic array), add values from a single input matrix of the third plurality of registers that represents at least one third input two-dimensional matrix initially stored in the third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage (values of an accumulation matrix, i.e. a single input matrix, are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference), and 
in the second mode (the mode in which the processing elements accumulate results in place), add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage (values of an accumulation matrix are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication, in the second mode these are first and second output values, i.e. first and second input matrices of the third input matrix, from the first and second sets of processing elements respectively, to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference).
One of ordinary skill in the art would have been motivated to make this modification because using an initial accumulation matrix to add to a matric multiply output is a known technique on the known device of a computer processor for accumulating previous results with current results and would yield the predictable result of supporting larger matrix multiples and would also increase the usability of the systolic array.

	Regarding claim 14, Nam in view of Barman and Grochowski teaches:
14. The method of claim 9, 
	Nam in view of Barman and Grochowski, as currently mapped, does not teach:
wherein the executing the decoded single instruction is to: in the first mode, add values from a single input matrix of at least one third input two-dimensional matrix initially stored in a third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage, and 
in the second mode, add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage.
	However, Grochowski further teaches:
add values from at least one third input two-dimensional matrix to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage ([0035]-[0036]: an accumulation matrix is added to the output of the multiplication of A and B to form an updated output which is then stored to the storage location that the accumulation matrix was first stored in).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the systolic array of Nam to store an accumulation matrix in its result storage, to add it to the matrix multiply outputs, and to store the updated output to back to same storage that the accumulation matrix was in, as taught by Grochowski. This combination would teach:
wherein the execution of the decoded single instruction (matrix multiplication instruction) is to: in the first mode (the mode in which partial results are transmitted to a next processing element in the systolic array), add values from a single input matrix of at least one third input two-dimensional matrix initially stored in a third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage (values of an accumulation matrix, i.e. a single input matrix, are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference), and 
in the second mode (the mode in which the processing elements accumulate results in place), add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage (values of an accumulation matrix are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication, in the second mode these are first and second output values, i.e. first and second input matrices of the third input matrix, from the first and second sets of processing elements respectively, to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference).
One of ordinary skill in the art would have been motivated to make this modification because using an initial accumulation matrix to add to a matric multiply output is a known technique on the known device of a computer processor for accumulating previous results with current results and would yield the predictable result of supporting larger matrix multiples and would also increase the usability of the systolic array. 

	Regarding claim 15, Nam in view of Barman and Grochowski teaches:
15. The method of claim 9, wherein the resultant storage is a third plurality of registers (Nam Fig. 1: registers Rxx_1) that represents a plurality of output two-dimensional matrices formed by execution of the decoded single instruction (Nam col 3 lines 11-32 and col 4 lines 26-40: registers Rxx_1 store the results of the matrix multiply formed by executing the matrix multiply instruction, the result registers  represent the output matrix formed from executing B1 and the output matrix formed by executing B2, which are a plurality of output matrices which are formed by the execution of the matrix multiply instruction in the combination, see also Nam Fig. 3 for reference).

	Regarding claim 16, Nam in view of Barman and Grochowski teaches:
16. The method of claim 9, wherein the first proper subset of fused multiply accumulate circuits is one of a row or a column of the two-dimensional grid of fused multiply accumulate circuits and the second proper subset of fused multiply accumulate circuits is another of the one of the row or the column of the two-dimensional grid of fused multiply accumulate circuits (Nam Fig. 1: the subset PE12, PE22, and PE32 is a column of the grid of MACs and the subset PE13, PE23, and PE33 is another column of the grid of MACs).

	Regarding claim 17, Nam in view of Barman and Grochowski teaches:
17. A non-transitory machine readable medium that stores code that when executed by a machine causes the machine to perform a method comprising: 
a processor core (Fig. 8, 800) is coupled to a matrix operations accelerator circuit (Fig. 8, 850) comprising a two-dimensional grid of fused multiply accumulate circuits (col 6 lines 32-35: the systolic array is a two dimensional grid of processing elements; col 3 lines 10-17: the processing elements are fused multiply accumulate circuits, see also Fig. 1), the matrix operations accelerator circuit is coupled to a first plurality of registers that represents at least one first input two-dimensional matrix (col 3 lines 36-38: registers Rxx_2 represents column inputs from a weight matrix, i.e. a first input two-dimensional matrix, coupled to the systolic array) and a second plurality of registers that represents at least one second input two-dimensional matrix (col 3 lines 33-36 and col 4 lines 1-5: registers Rxx_3 represents row inputs from an input matrix, i.e. a second input two-dimensional matrix, coupled to the systolic array; and 
executing with an execution circuit (col 5 lines 22-25: controller 803) of the processor core to: switch the matrix operations accelerator circuit from a first mode to a second mode (col 5 lines 33-37: the processing system 800 may be switched between a first and second mode) where the respective output of each of the first proper subset of fused multiply accumulate circuits (Fig. 1, PE12, PE22, and PE32) of the two-dimensional grid form first output values from a first input matrix of the at least one first input two-dimensional matrix and a first input matrix of the at least one second input two-dimensional matrix (col 3 lines 11-21: the output values of processing elements PE12, PE22, and PE32 are formed from column inputs IN_C_2, i.e. a first input matrix of the at least one first input matrix, and from row inputs IN_R_1 to IN_R_3 in Rx1_3, i.e. a first input matrix of the at least one second input matrix), and store the first output values in the resultant storage (col 3 lines 22-32: the outputs of the processing elements from PE12, PE22, and PE32 are stored in output registers Rx2_1, i.e. the resultant storage), and a respective output of each of the second proper subset of fused multiply accumulate circuits (Fig. 1, PE13, PE23, and PE33) of the two- dimensional grid form second output values from a second input matrix of the at least one first input two-dimensional matrix and a second input matrix of the at least one second input two-dimensional matrix (col 3 lines 11-21: the output values of PE13, PE23, and PE33 are formed from column inputs IN_C_3, i.e. a second input matrix of the at least one first input matrix, and from row inputs IN_R_1 to IN_R_3 in Rx2_3, i.e. the second input matrix), and store the second output values in the resultant storage (col 3 lines 22-32: the outputs of the processing elements from PE13, PE23, and PE33 are stored in output registers Rx3_1, i.e. the resultant storage).
	Nam does not teach:
decoding, with a decoder, a single instruction into a decoded single instruction,
the single instruction includes a field that identifies a resultant storage
executing the decoded single instruction to: switch the matrix operations accelerator circuit from a first mode where a respective output of each of a first proper subset of fused multiply accumulate circuits of the two-dimensional grid is transmitted downstream to a respective input of each of a second proper subset of fused multiply accumulate circuits of the two-dimensional grid to form output values from a single input matrix of the at least one first input two-dimensional matrix and a single input matrix of the at least one second input two- dimensional matrix, and store the output values in the resultant storage,
	However, in the analogous art of matrix multiply using systolic arrays, Barman teaches:
a first mode where a respective output of each of a first proper subset of fused multiply accumulate circuits (Fig. 2: a12, a22, and a32) of the two-dimensional grid is transmitted downstream to a respective input of each of a second proper subset of fused multiply accumulate circuits (Fig. 2: a13, a23, and a33) of the two-dimensional grid to form output values from a single input matrix of the at least one first input two-dimensional matrix and a single input matrix of the at least one second input two- dimensional matrix (col 2 line 58-col 3 line3: the processing cells a12, a22, and a32 perform MAC operations and transmit their result downstream to inputs of a13, a23, and a33 respectively, to form output values from a single first input matrix A and single second input matrix B)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the systolic array modes of Nam to include a mode for performing matrix multiply by transmitting partial results downstream as taught by Barman such that the output values of the processing elements are also stored to the registers of Nam. One of ordinary skill in the art would have been motivated to make this modification to reduce overhead associated with processing boundaries to obtain the results (Nam col 4 lines 26-37) when utilization of the processing elements is the same or better than accumulating the results in place while also supporting modes for increasing utilization of the processing elements by accumulating the results in place when the technique of transmitting the partial results downstream decreases utilization, see also Barman col 6 lines 51-58. 
	Further, in the analogous art of matrix multiply, Grochowski teaches:
decoding, with a decoder (Fig. 1, 108), a single instruction into a decoded single instruction ([0033]: the decode unit decodes the matrix multiplication instruction into a decoded instruction),
the single instruction includes a field that identifies a resultant storage ([0030]: the instruction includes a field for a destination operand that identifies a result storage);
executing a decoded single instruction to: indicate a first mode or a second mode ([0037]: the instruction has a field to indicate whether the instruction is to be performed with or without matrix accumulation, i.e. a first mode or a second mode)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the system of Nam in view of Barman to include a core with a decoder and to modify the instruction of Nam to be a matrix multiplication instruction as taught by Grochowski such that the matrix multiplication instruction destination operand field identifies the result registers of Nam and such that the field that indicates the mode the instruction is to be performed in would cause the processing circuitry to switch to the indicated mode when executing the instruction. One of ordinary skill in the art would have been motivated to make this modification because using a core with a decoder to decode an instruction is a known technique on the known device of a computer processor for executing instructions and would yield the predictable result of enabling a system to execute a program of instructions. Further, one of ordinary skill in the art would have been motivated to make this modification because supporting an instruction that identifies a result storage is a known technique on the known device of a computer processor for enabling operations and the location of the results to be specified which would yield the predictable result of increasing control over processing resources. 

	Regarding claim 18, Nam in view of Grochowski teaches:
18. The non-transitory machine readable medium of claim 17, wherein an instruction comprises a second field indicating that the matrix operations accelerator circuit is to execute in the first mode when the second field is a first value and in the second mode when the second field is a second value (Nam col 5 lines 33-46: an instruction from a host includes a field that sets a value in the mode switch register to indicate the system is to operate in a first mode when the field is a first value and a second mode when the field is a second value).
	Nam in view of Barman and Grochowski, as currently mapped, does not teach the matrix multiplication of instruction including the second field indicating the first or second mode. That is, Name in view of Barman and Grochowski, as currently mapped, does not teach:
the single instruction comprises a second field indicating the matrix operations accelerator circuit is to execute in the first mode when the second field is a first value and in the second mode when the second field is a second value
	However, Grochowski further teaches:
the single instruction comprises a second field indicating the matrix operations accelerator circuit is to execute in the first mode when the second field is a first value and in the second mode when the second field is a second value ([0037]: the matrix multiplication instruction includes a field to indicate whether the instruction should be perform with or without accumulation, i.e. in a first or second mode)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the matrix multiplication instruction of Nam in view of Barman and Grochowski to include a field that indicates the mode the instruction should be performed in as further taught by Grochowski. One of ordinary skill in the art would have been motivated to make this modification because include an instruction field to indicate a mode for an instruction is a known technique on the known device of a computer processor for specifying a mode for the instruction and would yield the predictable result of efficiently controlling the mode in which the instruction is executed. 

	Regarding claim 19, Nam in view of Barman and Grochowski teaches:
19. The non-transitory machine readable medium of claim 18, wherein the second field is an immediate of the single instruction (Grochowski [0037]: the instruction field indicates the mode using bits in the instruction, i.e. the field is an immediate of the instruction).

	Regarding claim 20, Nam in view of Barman and Grochowski teaches:
20. The non-transitory machine readable medium of claim 17, wherein the resultant storage is a third plurality of registers (Nam Fig. 1: registers Rxx_1) that represents at least one output two-dimensional matrix formed by execution of the decoded single instruction (Nam col 3 lines 11-32: registers Rxx_1 store the results of the matrix multiply formed by executing the matrix multiply instruction which represents a result matrix).

	Regarding claim 21, Nam in view of Barman and Grochowski teaches:
21. The non-transitory machine readable medium of claim 20, 
Nam in view of Barman and Grochowski, as currently mapped, does not teach:
wherein the executing the decoded single instruction is to: in the first mode, add values from a single input matrix of the third plurality of registers that represents at least one third input two-dimensional matrix initially stored in the third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage, and 
in the second mode, add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage.
	However, Grochowski further teaches:
wherein the execution of the decoded single instruction is to: add values that represents at least one third input two-dimensional matrix initially to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage ([0035]-[0036]: an accumulation matrix is added to the output of the multiplication of A and B to form an updated output which is then stored to the storage location that the accumulation matrix was first stored in).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the systolic array of Nam to store an accumulation matrix in its result storage, to add it to the matrix multiply outputs, and to store the updated output to back to same storage that the accumulation matrix was in, as taught by Grochowski. This combination would teach:
wherein the execution of the decoded single instruction (matrix multiplication instruction) is to: in the first mode (the mode in which partial results are transmitted to a next processing element in the systolic array), add values from a single input matrix of the third plurality of registers that represents at least one third input two-dimensional matrix initially stored in the third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage (values of an accumulation matrix, i.e. a single input matrix, are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference), and 
in the second mode (the mode in which the processing elements accumulate results in place), add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage (values of an accumulation matrix are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication, in the second mode these are first and second output values, i.e. first and second input matrices of the third input matrix, from the first and second sets of processing elements respectively, to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference).
One of ordinary skill in the art would have been motivated to make this modification because using an initial accumulation matrix to add to a matric multiply output is a known technique on the known device of a computer processor for accumulating previous results with current results and would yield the predictable result of supporting larger matrix multiples and would also increase the usability of the systolic array. 

Regarding claim 22, Nam in view of Grochowski teaches:
22. The non-transitory machine readable medium of claim 17, 
Nam in view of Barman and Grochowski, as currently mapped, does not teach:
wherein the executing the decoded single instruction is to: in the first mode, add values from a single input matrix of at least one third input two-dimensional matrix initially stored in a third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage, and 
in the second mode, add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage.
	However, Grochowski further teaches:
add values from at least one third input two-dimensional matrix to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage ([0035]-[0036]: an accumulation matrix is added to the output of the multiplication of A and B to form an updated output which is then stored to the storage location that the accumulation matrix was first stored in).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the systolic array of Nam to store an accumulation matrix in its result storage, to add it to the matrix multiply outputs, and to store the updated output to back to same storage that the accumulation matrix was in, as taught by Grochowski. This combination would teach:
wherein the execution of the decoded single instruction (matrix multiplication instruction) is to: in the first mode (the mode in which partial results are transmitted to a next processing element in the systolic array), add values from a single input matrix of at least one third input two-dimensional matrix initially stored in a third plurality of registers to the output values to form updated output values and store the updated output values, instead of the output values, into the resultant storage (values of an accumulation matrix, i.e. a single input matrix, are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference), and 
in the second mode (the mode in which the processing elements accumulate results in place), add values from a first input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the first output values to form updated first output values and add values from a second input matrix of the at least one third input two-dimensional matrix initially stored in the third plurality of registers to the second output values to form updated second output values and store the updated first output values and updated second output values, instead of the first output values and the second output values, into the resultant storage (values of an accumulation matrix are initially stored into the result registers Rxx_1, which is then added to the outputs of the multiplication, in the second mode these are first and second output values, i.e. first and second input matrices of the third input matrix, from the first and second sets of processing elements respectively, to form updated outputs that are stored into the result registers instead of the outputs of the multiplication, see Nam Fig. 1 for reference).
One of ordinary skill in the art would have been motivated to make this modification because using an initial accumulation matrix to add to a matric multiply output is a known technique on the known device of a computer processor for accumulating previous results with current results and would yield the predictable result of supporting larger matrix multiples and would also increase the usability of the systolic array. 

	Regarding claim 23, Nam in view of Barman and Grochowski teaches:
23. The non-transitory machine readable medium of claim 17, wherein the resultant storage is a third plurality of registers (Nam Fig. 1: registers Rxx_1) that represents a plurality of output two-dimensional matrices formed by execution of the decoded single instruction (Nam col 3 lines 11-32 and col 4 lines 26-40: registers Rxx_1 store the results of the matrix multiply formed by executing the matrix multiply instruction, the result registers  represent the output matrix formed from executing B1 and the output matrix formed by executing B2, which are a plurality of output matrices which are formed by the execution of the matrix multiply instruction in the combination, see also Nam Fig. 3 for reference).

	Regarding claim 24, Nam in view of Barman and Grochowski teaches:
24. The non-transitory machine readable medium of claim 17, wherein the first proper subset of fused multiply accumulate circuits is one of a row or a column of the two-dimensional grid of fused multiply accumulate circuits and the second proper subset of fused multiply accumulate circuits is another of the one of the row or the column of the two-dimensional grid of fused multiply accumulate circuits (Nam Fig. 1: the subset PE12, PE22, and PE32 is a column of the grid of MACs and the subset PE13, PE23, and PE33 is another column of the grid of MACs).

Conclusion
	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KASIM ALLI whose telephone number is (571)270-1476. The examiner can normally be reached Monday - Friday 9am 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on (571) 270-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KASIM ALLI/Examiner, Art Unit 2183   
                                                                                                                                                                                                     /JYOTI MEHTA/Supervisory Patent Examiner, Art Unit 2182