DETAILED ACTION
Status of Claims 
Claims 1-20 have been considered. It is hereby acknowledged that the following papers have been received and placed of record in the file:
Applicant Remarks 						-Receipt Date 05/19/2022
Amended Claims 						-Receipt Date 05/19/2022

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 05/19/2022 has been entered.
 
Response to Amendment
This office action is in response to the amendment filed on 05/19/202. Claims 1-20 are pending. Claims 1, 4, 8, 11, 15, and 18 are amended. 

Response to Arguments
Applicant’s arguments, see Remarks page 11, filed 05/19/2022 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made over Hansen et al. US 10,120,649 in view of Boswell et al. US 2018/0321938.

Claim Objections
Claims 2, 5, 9, 12, 16, and 19 are objected to because of the following informalities:  
Claim 2, 5, 9, 12, 16, and 19- “a given column” and “a given row” should be “the given column” and “the given row” since both terms are already introduced in the independent claims
Appropriate correction is required.

Claim Rejections - 35 USC § 112
Claims 1-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
Claim 1 recites: 
an outer product operation that generates a matrix output comprising two or more dimensions based on the given column and the given row; and
perform, using the plurality of dot product units, the outer product operation using values of the first matrix and the second matrix fetched only once from the first vector register file.
	While the specification at [0019] describes that each dot product unit performs an outer product operation, the specification does not describe using a plurality of dot product units to perform an outer product operation that generates a matrix output based on a given column and a given row, as recited in claim 1. 
	Claims 8 and 15 recite similar limitations and are rejected for similar reasons. 
	The dependent claims are rejected based on their dependence from rejected based claims. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 5, 8-9, and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Hansen et al. US 10,120,649 in view of Boswell et al. US 2018/0321938.
Regarding claim 1, Hansen teaches:
1. A system comprising: 
a first vector register file (col 5 lines 15-16 and col 4 lines 29-36: the registers of the register file hold vectors, indicating that the register file is a vector register file); and 
a first execution pipeline coupled to the first vector register file (col 4 lines 25-36: the plurality of multipliers and accumulators together form a first pipeline which is coupled to registers 11 and 12 of the register file, see also Fig. 1), wherein the first execution pipeline comprises a plurality of dot product units (col 4 lines 25-30 and lines 44-52: the plurality of multiplier accumulators are a plurality of dot product units since they each perform a sum of products, see also col 4 lines 18-19); and 
wherein to perform a matrix multiplication operation on a first matrix comprising two or more dimensions and a second matrix comprising two or more dimensions (col 16 lines 23-27: Fig. 10 shows the array of Fig. 1 used for a matrix multiplication on two 4x4 matrices), the system is configured to: 
fetch a given column of the first matrix and a given row of the second matrix until each element of a resulting matrix is updated by an outer product operation that generates a matrix output comprising two or more dimensions based on the given column and the given row (col 4 lines 26-39 and col 16 lines 23-40: a column of matrix A and a row of matrix B, see Fig. 10, are fetched into the registers 11 and 12 until each element of the result matrix, see Fig. 10 r0,0-r3,3 is updated by an outer product operation that generates the two dimensional result matrix as output based on the column of A and row of B); and 
perform, using the plurality of dot product units, the outer product operation using values of the first matrix and the second matrix (col 4 lines 26-49 and col 16 lines 23-40: the outer product operation is performed using the multipliers and accumulators, i.e. using the dot product units, using the values of matrix A and matrix B as input).
	Hansen does not teach:
fetching a column and row from the first vector register file only once until each element of a result matrix is updated; and
performing the outer product unit using values fetched only once from the first vector register file
	However, Boswell teaches fetching row and column operands from a vector register file into an operand collector only once ([0026]: the vector operands are read from the register file into operand collectors once before executing a matrix multiply operation, see also [0109]-[0110] describing fetching into the operand collectors only once until the result matrix is calculated).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the multiplier array of Hansen to include a plurality of operand collectors and to fetch operands from the register file into the operand collectors only once as taught by Boswell. One of ordinary skill in the art would have been motivated to make this modification to reduce the bandwidth between the register file and the inputs to the data path/multiplier array (Boswell [0089]).

	Regarding claim 2, Hansen in view of Boswell teaches:
2. The system as recited in claim 1, 
	Hansen in view of Boswell, as currently mapped, does not teach:
wherein the first execution pipeline is further configured to: 
calculate, in a first cycle, a first portion of an intermediate matrix that is an outer product of a given column of the first matrix and a given row of the second matrix by using a first portion of the given column and an entirety of the given row; and 
calculate, in a second cycle after the first cycle, a second portion different from the first portion of the intermediate matrix by using a second portion different from the first portion of the given column and the entirety of the given row.
	However, Hansen further teaches:
calculate, in a first cycle, a first portion of an intermediate matrix that is an outer product of a given column of the first matrix and a given row of the second matrix by using a first portion of the given column and the given row (Hansen col 10 lines 10-15: the multiplier and multiplicand values may each be stored in multiple registers and a series of outer product multiplications may be performed each using one of the multiplier and multiplicand registers, indicating that there is a first cycle in which first outer product multiplication is calculated using a first portion of the multiplier and multiplicand stored in first registers, i.e. a first portion of a column and row, where the result is a first portion of an intermediate matrix); and 
calculate, in a second cycle after the first cycle, a second portion different from the first portion of the intermediate matrix by using a second portion different from the first portion of the given column and the given row (Hansen col 10 lines 10-15: since a series of outer product multiplications are performed, there is a second cycle after the first cycle in which a second outer product multiplication is calculated using a second portion of the multiplier and multiplicand stored in second registers, where the result is another portion of the intermediate matrix).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hansen in view of Boswell to perform an outer product multiplication over multiple cycles when the operands are too large for the registers as further taught by Hansen. In this combination, a first portion of an intermediate matrix of an outer product multiplication is calculated in a first cycle using a first portion of a multiplicand/column and an entire multiplier/row and a second portion of an intermediate matrix is calculated in a second cycle using a second portion of the multiplicand/column and an entire multiplier/row to when a multiplicand/column is too large for a register and a multiplier/row fits into a register. One of ordinary skill in the art would have been motivated to make this modification to increase flexibility of the multiplier array by supporting larger inputs and by allowing the inputs to differ in size. 

	Regarding claim 5, Hansen in view of Boswell teaches:
5. The system as recited in claim 1, wherein the first execution pipeline is further configured to: 
fetch, in a first cycle, a given column and a given row (Hansen col 4 lines 29-33: a multiplier and multiplicand, i.e. a given row and column, are fetched) from the first vector register file to a plurality of storage elements (Boswell [0026]: the operands are fetched from the register file into operand collectors, i.e. a plurality of storage elements); and 
store, in the plurality of storage elements, the given column and the given row for reuse by the plurality of dot product units for a plurality of cycles after the first cycle until each element is calculated of an intermediate matrix of the outer product operation (Hansen col 10 lines 10-15: a multiplicand/column and multiplier/row may each be stored in multiple registers and operated on over a plurality of cycles to perform an outer product; in the combination with Boswell [0026] the row and column are loaded into the operand collectors before the execution of the matrix multiply operation such that the row and column are reused to perform the outer product multiplications over multiple cycles).

	Regarding claim 8, Hansen teaches:
8. A method comprising: 
performing, by a first execution pipeline (col 4 lines 25-36: the plurality of multipliers and accumulators together form a first pipeline which is coupled to registers 11 and 12 of the register file, see also Fig. 1), a matrix multiplication operation on a first matrix comprising two or more dimensions and a second matrix comprising two or more dimensions (col 16 lines 23-27: Fig. 10 shows the array of Fig. 1 used for a matrix multiplication on two 4x4 matrices) by: 
fetching a given column of the first matrix and a given row of the second matrix until each element of a resulting matrix is updated by an outer product operation that generates a matrix output comprising two or more dimensions based on the given column and the given row (col 4 lines 26-39 and col 16 lines 23-40: a column of matrix A and a row of matrix B, see Fig. 10, are fetched into the registers 11 and 12 until each element of the result matrix, see Fig. 10 r0,0-r3,3 is updated by an outer product operation that generates the two dimensional result matrix as output based on the column of A and row of B); and 
performing, using a plurality of dot product units, the outer product operation using values of the first matrix and the second matrix (col 4 lines 26-49 and col 16 lines 23-40: the outer product operation is performed using the multipliers and accumulators, i.e. using the dot product units, using the values of matrix A and matrix B as input).
	Hansen does not teach:
		a plurality of pipelines;
fetching a column and row from the first vector register file only once until each element of a result matrix is updated; and
performing the outer product unit using values fetched only once from the first vector register file
	However, Boswell teaches:
		a plurality of pipelines (Fig. 9 HMMA data path and FP64 data path);
fetching row and column operands from a vector register file into an operand collector only once ([0026]: the vector operands are read from the register file into operand collectors once before executing a matrix multiply operation, see also [0109]-[0110] describing fetching into the operand collectors only once until the result matrix is calculated).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the multiplier array of Hansen to include a plurality of operand collectors and to fetch operands from the register file into the operand collectors only once as taught by Boswell and to modify the processor of Hansen to include an FP64 data path as taught by Boswell. One of ordinary skill in the art would have been motivated to include the operand collectors to reduce the bandwidth between the register file and the inputs to the data path/multiplier array (Boswell [0089]). Further, one of ordinary skill in the art would have been motivated to include the FP64 data path because adding execution pipelines to a processor is a known technique on the known device of a computer processor for increasing processing resources and would yield the predictable result of increasing processing capabilities, such as adding the ability to perform FMA operations (Boswell [0100]).

	Regarding claim 9, Hansen in view of Boswell teaches:
9. The method as recited in claim 8, further comprising: 
Hansen in view of Boswell, as currently mapped, does not teach:
wherein the first execution pipeline is further configured to: 
calculating, in a first cycle, a first portion of an intermediate matrix that is an outer product of a given column of the first matrix and a given row of the second matrix by using a first portion of the given column and an entirety of the given row; and 
calculating, in a second cycle after the first cycle, a second portion different from the first portion of the intermediate matrix by using a second portion different from the first portion of the given column and the entirety of the given row.
	However, Hansen further teaches:
calculating, in a first cycle, a first portion of an intermediate matrix that is an outer product of a given column of the first matrix and a given row of the second matrix by using a first portion of the given column and the given row (Hansen col 10 lines 10-15: the multiplier and multiplicand values may each be stored in multiple registers and a series of outer product multiplications may be performed each using one of the multiplier and multiplicand registers, indicating that there is a first cycle in which first outer product multiplication is calculated using a first portion of the multiplier and multiplicand stored in first registers, i.e. a first portion of a column and row, where the result is a first portion of an intermediate matrix); and 
calculating, in a second cycle after the first cycle, a second portion different from the first portion of the intermediate matrix by using a second portion different from the first portion of the given column and the given row (Hansen col 10 lines 10-15: since a series of outer product multiplications are performed, there is a second cycle after the first cycle in which a second outer product multiplication is calculated using a second portion of the multiplier and multiplicand stored in second registers, where the result is another portion of the intermediate matrix).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hansen in view of Boswell to perform an outer product multiplication over multiple cycles when the operands are too large for the registers as further taught by Hansen. In this combination, a first portion of an intermediate matrix of an outer product multiplication is calculated in a first cycle using a first portion of a multiplicand/column and an entire multiplier/row and a second portion of an intermediate matrix is calculated in a second cycle using a second portion of the multiplicand/column and an entire multiplier/row to when a multiplicand/column is too large for a register and a multiplier/row fits into a register. One of ordinary skill in the art would have been motivated to make this modification to increase flexibility of the multiplier array by supporting larger inputs and by allowing the inputs to differ in size. 

	Regarding claim 12, Hansen in view of Boswell teaches:
12. The method as recited in claim 8, further comprising: 
fetching, in a first cycle, a given column and a given row (Hansen col 4 lines 29-33: a multiplier and multiplicand, i.e. a given row and column, are fetched) from the first vector register file to a plurality of storage elements (Boswell [0026]: the operands are fetched from the register file into operand collectors, i.e. a plurality of storage elements); and 
storing, in the plurality of storage elements, the given column and the given row for reuse by the plurality of dot product units for a plurality of cycles after the first cycle until each element is calculated of an intermediate matrix of the outer product operation (Hansen col 10 lines 10-15: a multiplicand/column and multiplier/row may each be stored in multiple registers and operated on over a plurality of cycles to perform an outer product; in the combination with Boswell [0026] the row and column are loaded into the operand collectors before the execution of the matrix multiply operation such that the row and column are reused to perform the outer product multiplications over multiple cycles).


Claims 3-4, 10-11, and 15-19 are rejected under 35 U.S.C. 103 as being unpatentable over Hansen et al. US 10,120,649 in view of Boswell et al. US 2018/0321938 and Schulte et al. US 2005/0071413.
	Regarding claim 3, Hansen in view of Boswell teaches:
3. The system as recited in claim 1, wherein the system is configured to read a plurality of accumulation inputs and provide the plurality of accumulation inputs to the first execution pipeline (col 4 lines 50-52: the multiplier-accumulators read accumulation inputs and provides them to the multiplier-accumulators of the array).
	Hansen in view of Boswell does not teach:
wherein the system further comprises a second vector register file, wherein the system is configured to read a plurality of accumulation inputs from the second vector register file 
	However, Schulte teaches: 
wherein the system further comprises a second register file, wherein the system is configured to read a plurality of accumulation inputs from the second register file ([0033]-[0034]: an accumulator register file is used in addition to a vector register file to store intermediate accumulator values for input into a next iteration)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processor of Hansen in view of Boswell to include an accumulator register file as taught by Schulte for storing intermediate results such that the combination would include a second vector register file for storing intermediate vector results. One of ordinary skill in the art would have been motivated to make this modification because storing intermediate accumulator values in a register file is useful when several dot products are computed simultaneously (Schulte [0034]), see also Hansen col 4 lines 18-19 describing calculating sum of products/dot products.

	Regarding claim 4, Hansen in view of Boswell and Schulte teaches:
4. The system as recited in claim 3, wherein each dot product unit is further configured to write an output to the second vector register file, wherein an output of a previous dot product operation is an accumulation input which is added to a sum for a current dot product operation (Hansen col 4 lines 46-52 and Schulte [0068]: the output of the multiplier, i.e. a previous dot product operation, is an accumulator input that is added to the sum in the accumulator).

	Regarding claim 10, Hansen in view of Boswell teaches:
10. The method as recited in claim 8, further comprising reading a plurality of accumulation inputs and provide the plurality of accumulation inputs to the first execution pipeline (Hansen col 4 lines 50-52: the multiplier-accumulators read accumulation inputs and provides them to the multiplier-accumulators of the array).
Hansen in view of Boswell does not teach:
reading the plurality of accumulation inputs from the second vector register file 
	However, Schulte teaches: 
reading a plurality of accumulation inputs from the second vector register file ([0033]-[0034]: an accumulator register file is used in addition to a vector register file to store intermediate accumulator values for input into a next iteration)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processor of Hansen in view of Boswell to include an accumulator register file as taught by Schulte for storing intermediate results. One of ordinary skill in the art would have been motivated to make this modification because storing intermediate accumulator values in a register file is useful when several dot products are computed simultaneously (Schulte [0034]), see also Hansen col 4 lines 18-19 describing calculating sum of products/dot products.

	Regarding claim 11, Hansen in view of Boswell and Schulte teaches: 
11. The method as recited in claim 10, further comprising writing an output to the second vector register file, wherein an output of a previous dot product operation is an accumulation input which is added to a sum for a current dot product operation (Hansen col 4 lines 46-52 and Schulte [0068]: the output of the multiplier, i.e. a previous dot product operation, is an accumulator input that is added to the sum in the accumulator).

Regarding claim 15, Hansen teaches:
15. An apparatus comprising: 
a vector register file (col 5 lines 15-16 and col 4 lines 29-36: the registers of the register file hold vectors, indicating that the register file is a vector register file)
an execution pipeline coupled to the vector register file (col 4 lines 25-36: the plurality of multipliers and accumulators together form a first pipeline which is coupled to registers 11 and 12 of the register file, see also Fig. 1); and 
a plurality of dot product units in a first execution pipeline of the plurality of execution pipelines (col 4 lines 25-30 and lines 44-52: the plurality of multiplier accumulators are a plurality of dot product units since they each perform a sum of products, see also col 4 lines 18-19); and 
wherein to perform a matrix multiplication operation on a first matrix comprising two or more dimensions and a second matrix comprising two or more dimensions (col 16 lines 23-27: Fig. 10 shows the array of Fig. 1 used for a matrix multiplication on two 4x4 matrices), the apparatus is configured to: 
fetch a given column of the first matrix and a given row of the second matrix only once until each element of a resulting matrix is updated by of an outer product that generates a matrix output comprising two or more dimensions based on the given column and the given row (col 4 lines 26-39 and col 16 lines 23-40: a column of matrix A and a row of matrix B, see Fig. 10, are fetched into the registers 11 and 12 until each element of the result matrix, see Fig. 10 r0,0-r3,3 is updated by an outer product operation that generates the two dimensional result matrix as output based on the column of A and row of B); and 
perform, using the plurality of dot product units, the outer product operation using values of the first matrix and the second matrix fetched only once from the first vector register file (col 4 lines 26-49 and col 16 lines 23-40: the outer product operation is performed using the multipliers and accumulators, i.e. using the dot product units, using the values of matrix A and matrix B as input).
	Hansen does not teach:
a plurality of vector register files; 
a plurality of execution pipelines coupled to the plurality of vector register files; and 
fetching a column and row from the first vector register file only once until each element of a result matrix is updated; and
performing the outer product unit using values fetched only once from the first vector register file
	However, Boswell teaches:
a plurality of execution pipelines coupled to a vector register file ([0100] and Fig. 9: HMMA data path and FP64 data path are coupled to the register file 910, see also [0026] describing loading vectors from the register);
fetching row and column operands from a vector register file into an operand collector only once ([0026]: the vector operands are read from the register file into operand collectors once before executing a matrix multiply operation, see also [0109]-[0110] describing fetching into the operand collectors only once until the result matrix is calculated).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the multiplier array of Hansen to include a plurality of operand collectors and to fetch operands from the register file into the operand collectors only once as taught by Boswell. One of ordinary skill in the art would have been motivated to include the operand collectors to reduce the bandwidth between the register file and the inputs to the data path/multiplier array (Boswell [0089]). Further, one of ordinary skill in the art would have been motivated to include the FP64 data path because adding execution pipelines to a processor is a known technique on the known device of a computer processor for increasing processing resources and would yield the predictable result of increasing processing capabilities, such as adding the ability to perform FMA operations (Boswell [0100]).
	Further, Schulte teaches an accumulator register file ([0033]-[0034]: an accumulator register file is used in addition to a vector register file to store intermediate accumulator values for input into a next iteration).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processor of Hansen in view of Boswell to include an accumulator register file as taught by Schulte for storing intermediate results such that the combination would include a plurality of vector register files including a vector register file for storing intermediate vector results. One of ordinary skill in the art would have been motivated to make this modification because storing intermediate accumulator values in a register file is useful when several dot products are computed simultaneously (Schulte [0034]), see also Hansen col 4 lines 18-19 describing calculating sum of products/dot products.

	Regarding claim 16, Hansen in view of Boswell and Schulte teaches:
16. The apparatus as recited in claim 15, 
	Hansen in view of Boswell and Schulte, as currently mapped, does not teach:
wherein the apparatus is further configured to: 
calculate, in a first cycle, a first portion of an intermediate matrix that is an outer product of a given column of the first matrix and a given row of the second matrix by using a first portion of the given column and an entirety of the given row; and 
calculate, in a second cycle after the first cycle, a second portion different from the first portion of the intermediate matrix by using a second portion different from the first portion of the given column and the entirety of the given row.
However, Hansen further teaches:
calculate, in a first cycle, a first portion of an intermediate matrix that is an outer product of a given column of the first matrix and a given row of the second matrix by using a first portion of the given column and the given row (Hansen col 10 lines 10-15: the multiplier and multiplicand values may each be stored in multiple registers and a series of outer product multiplications may be performed each using one of the multiplier and multiplicand registers, indicating that there is a first cycle in which first outer product multiplication is calculated using a first portion of the multiplier and multiplicand stored in first registers, i.e. a first portion of a column and row, where the result is a first portion of an intermediate matrix); and 
calculate, in a second cycle after the first cycle, a second portion different from the first portion of the intermediate matrix by using a second portion different from the first portion of the given column and the given row (Hansen col 10 lines 10-15: since a series of outer product multiplications are performed, there is a second cycle after the first cycle in which a second outer product multiplication is calculated using a second portion of the multiplier and multiplicand stored in second registers, where the result is another portion of the intermediate matrix).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hansen in view of Boswell to perform an outer product multiplication over multiple cycles when the operands are too large for the registers as further taught by Hansen. In this combination, a first portion of an intermediate matrix of an outer product multiplication is calculated in a first cycle using a first portion of a multiplicand/column and an entire multiplier/row and a second portion of an intermediate matrix is calculated in a second cycle using a second portion of the multiplicand/column and an entire multiplier/row to when a multiplicand/column is too large for a register and a multiplier/row fits into a register. One of ordinary skill in the art would have been motivated to make this modification to increase flexibility of the multiplier array by supporting larger inputs and by allowing the inputs to differ in size. 

	Regarding claim 17, Hansen in view of Boswell and Schulte teaches: 
17. The apparatus as recited in claim 16, wherein the apparatus is configured to read a plurality of accumulation inputs from a second vector register file and provide the plurality of accumulation inputs to the first execution pipeline (Hansen col 4 lines 50-52 and Schulte [0033]-[0034]: the multiplier-accumulators read accumulation inputs from the accumulator vector register file in the combination and provides them to the multiplier-accumulators of the array).

	Regarding claim 18, Hansen in view of Boswell and Schulte teaches:
18. The apparatus as recited in claim 17, wherein the apparatus is configured to write an output to the second vector register file, wherein an output of a previous dot product operation is an accumulation input which is added to the a sum for a current dot product operation (Hansen col 4 lines 46-52 and Schulte [0068]: the output of the multiplier, i.e. a previous dot product operation, is an accumulator input that is added to the sum in the accumulator).

	Regarding claim 19, Hansen in view of Boswell and Schulte teaches:
19. The apparatus as recited in claim 15, wherein the apparatus is further configured to: 
fetch, in a first cycle, a given column and a given row (Hansen col 4 lines 29-33: a multiplier and multiplicand, i.e. a given row and column, are fetched) from the first vector register file to a plurality of storage elements (Boswell [0026]: the operands are fetched from the register file into operand collectors, i.e. a plurality of storage elements); and 
store, in the plurality of storage elements, the given column and the given row for reuse by the plurality of dot product units for a plurality of cycles after the first cycle until each element is calculated of an intermediate matrix of the outer product operation (Hansen col 10 lines 10-15: a multiplicand/column and multiplier/row may each be stored in multiple registers and operated on over a plurality of cycles to perform an outer product; in the combination with Boswell [0026] the row and column are loaded into the operand collectors before the execution of the matrix multiply operation such that the row and column are reused to perform the outer product multiplications over multiple cycles).

Claims 6 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Hansen et al. US 10,120,649 in view of Boswell et al. US 2018/0321938 and Wang et al. US 5,204,830.
	Regarding claim 6, Hansen in view of Boswell teaches: 
6. The system as recited in claim 5, 
	Hansen in view of Boswell does not teach:
wherein the system further comprises a second execution pipeline, wherein the system is further configured to read elements of a third matrix from the first vector register file in any cycle of the plurality of cycles after the first cycle and provide the elements of the third matrix to the second execution pipeline.
	However, Wang teaches:
a second execution pipeline (col 5 lines 30-38: Fig. 1B 14 is a second pipeline and Fig. 1A 12 is a first execution pipeline), and to read elements of a third matrix any cycle after performing a first matrix multiply and provide the elements of the third matrix to the second execution pipeline (col 5 lines 30-38 and col 8 lines 9-19: a third matrix C is read and provided to the second stage after the first stage performs a matrix multiply).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processor of Hansen in view of Boswell to include a second pipeline that multiplies a third matrix with the result of a first matrix multiply as taught by Wang such that the third matrix is read from the register file of Hansen after the cycle after a first cycle in which Hansen performs a first matrix multiply. One of ordinary skill in the art would have been motivated to add a second pipeline to the processor of Hansen because adding execution pipelines to a processor is a known technique on the known device of a computer processor for increasing processing resources and would yield the predictable result of increasing processing capabilities. Further, one of ordinary skill in the art would have been motivated to perform a matrix multiply using a third matrix to support discrete cosine transform for compressing video signals (Wang col 1 lines 11-15).
	
Regarding claim 13, Hansen in view of Boswell teaches: 
13. The method as recited in claim 12, 
	Hansen in view of Boswell does not teach:
reading elements of a third matrix from the first vector register file in any cycle of the plurality of cycles after the first cycle and providing the elements of the third matrix to a second execution pipeline.
	However, Wang teaches:
a second execution pipeline (col 5 lines 30-38: Fig. 1B 14 is a second pipeline and Fig. 1A 12 is a first execution pipeline), and reading elements of a third matrix any cycle after performing a first matrix multiply and provide the elements of the third matrix to the second execution pipeline (col 5 lines 30-38 and col 8 lines 9-19: a third matrix C is read and provided to the second stage after the first stage performs a matrix multiply).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processor of Hansen in view of Boswell to include a second pipeline that multiplies a third matrix with the result of a first matrix multiply as taught by Wang such that the third matrix is read from the register file of Hansen after the cycle after a first cycle in which Hansen performs a first matrix multiply. One of ordinary skill in the art would have been motivated to add a second pipeline to the processor of Hansen because adding execution pipelines to a processor is a known technique on the known device of a computer processor for increasing processing resources and would yield the predictable result of increasing processing capabilities. Further, one of ordinary skill in the art would have been motivated to perform a matrix multiply using a third matrix to support discrete cosine transform for compressing video signals (Wang col 1 lines 11-15).

Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Hansen et al. US 10,120,649 in view of Boswell et al. US 2018/0321938 and Kazakov et al. US 11,093,580.
	Regarding claim 7, Hansen in view of Boswell teaches:
7. The system as recited in claim 1, 
	Hansen in view of Boswell, as currently mapped, does not teach:
wherein the first execution pipeline further comprises a crossbar configured to rotate elements of the first matrix before sending the elements of the first matrix to the plurality of dot product units while elements of the second matrix remain unchanged for the plurality of dot product units.
	However, Kazakov teaches:
sending elements of a first matrix to a matrix multiplier while elements of a second matrix remain unchanged for the matrix multiplier (col 4 lines 25-39: elements of matrix A are sent to the matrix multiplier while elements of matrix B remain unchanged).
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processor of Hansen in view of Boswell to send elements of the first matrix to the matrix multiply circuit/the multiplier-accumulators while keeping the elements of the second matrix unchanged as taught by Kazakov such that the combination will perform outer product operations by keeping the elements of matrix B unchanged in the operand collectors while loading in elements of matrix A. One of ordinary skill in the art would have been motivated to make this modification to conserve power (Kazakov col 3 lines 26-45).
Further, Boswell teaches: 
a crossbar configured to rotate elements of a first matrix before sending the elements of the first matrix to a plurality of dot product units ([0099]: crossbar 915 rotates elements of a matrix from the register file before sending the elements to the multiply-accumulate/dot product units in the HMMA data path, see also [0025]-[0026] describing that the elements loaded into the operand collectors are matrix elements)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processor of Hansen in view of Boswell and Kazakov to use a crossbar between the register file and operand collectors for rotating elements of the first matrix as further taught by Boswell such that the combination would teach the elements of matrix A being rotated by the cross bar before being sent to the multiplier-accumulators via the operand collectors while the elements of matrix B remain unchanged. One of ordinary skill in the art would have been motivated to make this modification because using crossbars to rotate elements is a known technique on the known device of a computer processor for rearrange elements and would yield the predictable result of efficiently rotating elements compared to approaches which may use additional instructions to rotate elements. 

	Regarding claim 14, Hansen in view of Boswell teaches:
14. The method as recited in claim 8, 
	Hansen in view of Boswell, as currently mapped, does not teach:
rotating, by a crossbar of the first execution pipeline, elements of the first matrix before sending the elements of the first matrix to the plurality of dot product units while elements of the second matrix remain unchanged for the plurality of dot product units.
	However, Kazakov teaches:
sending elements of a first matrix to a matrix multiplier while elements of a second matrix remain unchanged for the matrix multiplier (col 4 lines 25-39: elements of matrix A are sent to the matrix multiplier while elements of matrix B remain unchanged).
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processor of Hansen in view of Boswell to send elements of the first matrix to the matrix multiply circuit/the multiplier-accumulators while keeping the elements of the second matrix unchanged as taught by Kazakov such that the combination will perform outer product operations by keeping the elements of matrix B unchanged in the operand collectors while loading in elements of matrix A. One of ordinary skill in the art would have been motivated to make this modification to conserve power (Kazakov col 3 lines 26-45).
Further, Boswell teaches: 
rotating, by a crossbar, elements of a first matrix before sending the elements of the first matrix to a plurality of dot product units ([0099]: crossbar 915 rotates elements of a matrix from the register file before sending the elements to the multiply-accumulate/dot product units in the HMMA data path, see also [0025]-[0026] describing that the elements loaded into the operand collectors are matrix elements)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processor of Hansen in view of Boswell and Kazakov to use a crossbar between the register file and operand collectors for rotating elements of the first matrix as further taught by Boswell such that the combination would teach the elements of matrix A being rotated by the cross bar before being sent to the multiplier-accumulators via the operand collectors while the elements of matrix B remain unchanged. One of ordinary skill in the art would have been motivated to make this modification because using crossbars to rotate elements is a known technique on the known device of a computer processor for rearrange elements and would yield the predictable result of efficiently rotating elements compared to approaches which may use additional instructions to rotate elements. 

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Hansen et al. US 10,120,649 in view of Boswell et al. US 2018/0321938, Schulte et al. US 2005/0071413, and Wang et al. US 5,204,830.
	Regarding claim 20, Hansen in view of Boswell and Schulte teaches:
20. The apparatus as recited in claim 19, 
	Hansen in view of Boswell and Schulte does not teach:
wherein the apparatus is further configured to read elements of a third matrix from the first vector register file in any cycle of the plurality of cycles after the first cycle and provide the elements of the third matrix to a second execution pipeline.
However, Wang teaches:
a second execution pipeline (col 5 lines 30-38: Fig. 1B 14 is a second pipeline and Fig. 1A 12 is a first execution pipeline), and to read elements of a third matrix any cycle after performing a first matrix multiply and provide the elements of the third matrix to a second execution pipeline (col 5 lines 30-38: Fig. 1B 14 is a second pipeline and Fig. 1A 12 is a first execution pipeline and a third matrix C is read and provided to the second stage after the first stage performs a matrix multiply).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processor of Hansen in view of Boswell to include a second pipeline that multiplies a third matrix with the result of a first matrix multiply as taught by Wang such that the third matrix is read from the register file of Hansen after the cycle after a first cycle in which Hansen performs a first matrix multiply. One of ordinary skill in the art would have been motivated to add a second pipeline to the processor of Hansen because adding execution pipelines to a processor is a known technique on the known device of a computer processor for increasing processing resources and would yield the predictable result of increasing processing capabilities. Further, one of ordinary skill in the art would have been motivated to perform a matrix multiply using a third matrix to support discrete cosine transform for compressing video signals (Wang col 1 lines 11-15).

Conclusion
	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KASIM ALLI whose telephone number is (571)270-1476. The examiner can normally be reached Monday - Friday 9am 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on (571) 270-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KASIM ALLI/Examiner, Art Unit 2183                                                                                                                                                                                                        
/JYOTI MEHTA/Supervisory Patent Examiner, Art Unit 2182