DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The action is responsive to the applicant’s amendment filed on 04/12/2021. Claims 1-20 are remain pending in the application. Applicant’s amendment to the specification and the drawing have overcome the objections previous set forth in the non-final office action.

Response to Arguments
Applicant’s arguments, see Remarks 7-9, filed on 04/12/2021, with respect to the rejection(s) of claim(s) claims 1  under 35 U.S.C. 103  have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of newly found prior art reference. See rejection below.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –




Claims 11-12, 15-16, and 18 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Boswell.

Regarding claim 11, Boswell teaches a method to be performed by a processor (Boswell [0006] a method and processor are disclosed for performing matrix multiply and accumulate operations), the method comprising: decoding, using decode circuitry, a single instruction (Boswell, [0025] an instruction matrix multiply and accumulate (MMA) operation is received, that includes multiple matrix operands) specifying locations of a M by K first source matrix (Boswell, [0081] matrix A of size NxK), a K by N second source matrix (Boswell, [0081] matrix B size of KxM), a M by N destination matrix (Boswell, [0081]a collector matrix C size of NxM, that is used to accumulate the result of the first two input matrices), and an opcode indicating execution circuitry, for each floating-point (FP) element (M, N) of the destination matrix (Boswell, [0025] an instruction for a matrix multiply and accumulate (MMA) operation is received. [0083] each element of the matrix operands can be a value encoded in a particular format, such as single precision floating point value), is to launch K instances of a pipeline over K cycles (Boswell, [0026] the plurality of flip flops temporarily store data for the operands of the MMA instruction at the inputs of the datapath such that multiple operands can be loaded from the register file to the inputs of the datapath such that multiple operands can be loaded from the register file in a given clock cycle, [0027] the datapath may be designed to generate multiple elements of the result matrix in multiple passes of the datapath. Thus multiple passes are generated in multiple cycles because of pipeline); and 
Boswell [0006] the processor includes datapath to execute the MMA operation); and wherein each instance of the pipeline comprises:
in a first, MULTIPLY stage, generating a product of FP element (M, K) of the first source matrix and a corresponding FP element (K, N) of the second source matrix (Boswell, figure 13, [0136, 0139] pipeline stage 1302, each multiplier 1310 receives a pair of corresponding element from the two input vector A and B to produce the partial product); 
concurrently, in an EXPDIFF stage, determining an exponent difference between the product and a previous FP value of element (M, N) of the destination matrix (Boswell, figure 13 [0136, 0141] shift logic 1330 blocks for shifting the partial product and addend to align all values based on the exponents for the partial products. Although not shown explicitly, such as adder is used to calculate the difference exponent of the input. Figure 13 shows pipeline 1303 where C is fed to shift 1330 for alignment. Note that in a pipelined stage, all stages are computing concurrently); 
in a second, ADD-BYPASS stage, accumulating the product with the previous FP value and storing the accumulated sum to the element (M, N) of the destination matrix (Boswell, figure 13 [0136, 0142] pipeline 1304, the carry value and sum value are passed to a completion adder 1350, which generate a mantissa value, which later output to C), wherein the product, before performing the accumulation, is to be brought into alignment by shifting its mantissa by the exponent difference (Boswell, figure 13 [0136] shift logic 1330 blocks for shifting the partial product and addend to align all values based on the exponents for the partial products, and the alignment is done before the completion adder); and concurrently, in the ADD-BYPASS stage, bypassing the accumulated sum for use by a subsequent instance of the pipeline (Boswell, C is depicted twice in figure 13 to show that a give result is fed back into the computation pipeline).

Regarding claim 12, Boswell discloses the claim invention as in the parent claim above, including the execution circuitry is to complete execution of the K instances of the pipeline over K-plus-one cycles (Boswell, [0026] the plurality of flip flops temporarily store data for the operands of the MMA instruction at the inputs of the datapath such that multiple operands can be loaded from the register file to the inputs of the datapath such that multiple operands can be loaded from the register file in a given clock cycle, [0027] the datapath may be designed to generate multiple elements of the result matrix in multiple passes of the datapath. Thus multiple passes are generated in multiple cycles because of pipeline. Brooks, as shown the bypassing unrounded result in figure 2, if result does not require rounding, then it would take k cycles to perform k pipeline instances, but if the result is required, then bypassing is required which would take another cycle to compete).

	Regarding claim 15, Boswell discloses the claim invention as in the parent claim above, including M is one of 1, 2, 3, 4, 8, and 16, N is one of 1, 2, 3, 4, 8, and 16, and K is one of 1, 2, 3, 4, 8, and 16 (Boswell, [0081] figure 7 shows at least 8x4 matrix multiply 4x8 matrix to generate 8x8 matrix).

	Regarding claim 16, Boswell discloses the claim invention as in the parent claim above, including the first source, second source, and destination matrices are each located in one of a collection of vector registers of a register file, a collection of tile registers, and a plurality of memory locations representing a matrix (Boswell, [0084, 0090-0091, 0097] the register file is configured to store operands specified in an instruction for the MMA operation, where each operand specified in the instruction is a matrix having a plurality of elements in a two dimensional array of rows and columns, and each register may store one or more elements of a particular operand)

	Regarding claim 18, Boswell discloses the claim invention as in the parent claim above, including the EXPDIFF and ADD-BYPASS pipeline stages of the first executed instance of the pipeline receive the previous FP value of the element (M, N) of the destination matrix from its location as specified by the single instruction, and the EXPDIFF and ADD-BYPASS pipeline stages of subsequent executed instances of the pipeline receive the previous FP value of the element (M, N) of the destination matrix as a bypass from the ADD-BYPASS stage of an immediately preceding instance of the pipeline (Boswell, figure 13, pipeline stage 1303, the shift block 1330 and adders 1342 receives the input C, and pipeline stage 1306 C is depicted twice to show that a give result is fed back into the computation pipeline).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 5-6, and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Boswell (US - 20180321938) view of Brooks (US - 8671129).

Regarding claim 1, Boswell teaches A processor (Boswell, [0006] the application is disclosing a processor for performing a matrix multiply and accumulate instruction) comprising:  
decode circuitry to decode a single instruction (Boswell, [0025] an instruction matrix multiply and accumulate (MMA) operation is received, that includes multiple matrix operands) specifying locations of a M by K first source matrix (Boswell, [0081] matrix A of size NxK) , a K by N second source matrix (Boswell, [0081] matrix B size of KxM), a M by N destination matrix (Boswell, [0081]a collector matrix C size of NxM, that is used to accumulate the result of the first two input matrices), and an opcode indicating execution circuitry, for each floating-point (FP) element (M, N) of the destination matrix (Boswell, [0025] an instruction for a matrix multiply and accumulate (MMA) operation is received. [0083] each element of the matrix operands can be a value encoded in a particular format, such as single precision floating point value), is to launch K pipeline instances over K cycles (Boswell, [0026] the plurality of flip flops temporarily store data for the operands of the MMA instruction at the inputs of the datapath such that multiple operands can be loaded from the register file to the inputs of the datapath such that multiple operands can be loaded from the register file in a given clock cycle, [0027] the datapath may be designed to generate multiple elements of the result matrix in multiple passes of the datapath. Thus multiple passes are generated in multiple cycles because of pipeline), each pipeline instance comprising: 
in a first, MULTIPLY stage, generating a product of FP element (M, K) of the first source matrix and element (K, N) of the second source matrix (Boswell, figure 13, [0136, 0139] pipeline stage 1302, each multiplier 1310 receives a pair of corresponding element from the two input vector A and B to produce the partial product); 
concurrently, in an EXPDIFF stage, determining an exponent difference between the product and a previous FP value of element (M, N) of the destination matrix (Boswell, figure 13 [0136, 0141] shift logic 1330 blocks for shifting the partial product and addend to align all values based on the exponents for the partial products. Although not shown explicitly, such as adder is used to calculate the difference exponent of the input. Figure 13 shows pipeline 1303 where C is fed to shift 1330 for alignment. Note that in a pipelined stage, all stages are computing concurrently); 
Boswell, figure 13 [0136, 0142] pipeline 1304, the carry value and sum value are passed to a completion adder 1350, which generate a mantissa value, which later output to C)
wherein the product, before the accumulation, is to be brought into alignment by shifting its mantissa by the exponent difference (Boswell, figure 13 [0136] shift logic 1330 blocks for shifting the partial product and addend to align all values based on the exponents for the partial products, and the alignment is done before the completion adder); and 
concurrently, in the ADD-BYPASS stage, bypassing the accumulated sum to a subsequent instance of the pipeline (Boswell, C is depicted twice in figure 13 to show that a give result is fed back into the computation pipeline); and 
execution circuitry to execute the decoded single instruction as per the opcode (Boswell [0006] the processor includes datapath to execute the MMA operation).

Boswell also teaches after normalizing the accumulating result, the result is to be rounded, but Boswell does not explicitly teaches if rounding is determined to be required, causing a next pipeline instance to add a one. 
However, Brooks discloses a fused multiply add pipeline that determines if rounding is required, causing a next pipeline instance to add a one (Brooks, figure 2, column 5 line 5 – 31 the unrounded result out from the normalizer 222 may be coupled back to input multiplexers 202-206. The rounding control 224 may determine if rounding is required during the pipeline stage and the incrementer 226 may increase the output from normalizer 222, which is rounding up or adding 1. Rounding correction signal may asserted when the unrounded result is used as an input to the next operation and it is determined that rounding is needed on the result of the prior operation).


	Regarding claim 2, the combined system of Boswell in view of Brooks discloses the invention as in parent claim above, including the execution circuitry is to complete execution of the K instances of the pipeline over K-plus-one cycles (Boswell, [0026] the plurality of flip flops temporarily store data for the operands of the MMA instruction at the inputs of the datapath such that multiple operands can be loaded from the register file to the inputs of the datapath such that multiple operands can be loaded from the register file in a given clock cycle, [0027] the datapath may be designed to generate multiple elements of the result matrix in multiple passes of the datapath. Thus multiple passes are generated in multiple cycles because of pipeline. Brooks, as shown the bypassing unrounded result in figure 2, if result does not require rounding, then it would take k cycles to perform k pipeline instances, but if the result is required, then bypassing is required which would take another cycle to compete). 

	Regarding claim 5, the combined system of Boswell in view of Brooks discloses the invention as in parent claim above, including M is one of 1, 2, 3, 4, 8, and 16, N is one of 1, 2, 3, 4, 8, and 16, and K is one of 1, 2, 3, 4, 8, and 16 (Boswell, [0081] figure 7 shows at least 8x4 matrix multiply 4x8 matrix to generate 8x8 matrix).

Regarding claim 6, the combined system of Boswell in view of Brooks discloses the invention as in parent claim above, including the first source, second source, and destination matrices are each located in one of a collection of vector registers of a register file, a collection of tile registers, and a plurality of memory locations representing a matrix (Boswell, [0084, 0090-0091, 0097] the register file is configured to store operands specified in an instruction for the MMA operation, where each operand specified in the instruction is a matrix having a plurality of elements in a two dimensional array of rows and columns, and each register may store one or more elements of a particular operand)

	Regarding claim 8, the combined system of Boswell in view of Brooks discloses the invention as in parent claim above, including the EXPDIFF and ADD- BYPASS pipeline stages of the first executed instance of the pipeline receive the previous FP value of the element (M, N) of the destination matrix from its location as specified by the single instruction, and the EXPDIFF and ADD-BYPASS pipeline stages of subsequent executed instances of the pipeline receive the previous FP value of the element (M, N) of the destination matrix as a bypass from the ADD-BYPASS stage of an immediately preceding instance of the pipeline (Boswell, figure 13, pipeline stage 1303, the shift block 1330 and adders 1342 receives the input C, and pipeline stage 1306 C is depicted twice to show that a give result is fed back into the computation pipeline).

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Brooks as applied to claim 1 above, and further in view of Inaganti (US - 8990283).

Regarding claim 3, the combined system of Boswell in view of Brooks discloses the invention as in parent claim above, but does not teaches the execution circuitry, during the MULTIPLY stage, is to perform rounding of the generated product, as necessary. However, Inaganti, column 3 line 13-16, in unfused multiply add operation can be carried out as multiply operation followed by an add operation. As each operation applies a rounding operation).
It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify the combined system of Boswell in view of Brooks to have two modes operations to perform fused and unfused floating point multiply add. This modification would have been obvious because the fused multiply add rounding implementation is augmented with additional hardware, and to have a system that can operate in two difference modes as recognized by Inaganti column 2 line 46-54. Additionally performing rounding after multiply stage can adjust the data size and speed up the adding floating point stage since the result is rounded.

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Brooks as applied to claim 1 above, and further in view of Zheng (US - 20050187999).

Regarding claim 4, the combined system of Boswell in view of Brooks discloses the invention as in the parent claim above. However, the combined system of Boswell in view of Brooks does not teach execution circuitry, during the ADD-BYPASS stage, is to perform saturation, as necessary, on the accumulated sum. Zheng teaches a multiply and accumulate circuit that perform saturation during the multiply stage as necessary (Zheng, [0047] saturation generates an overflow bit, that can be located in the sum, [0050] when the saturation is activated, the accumulator value is set to a predetermined value in case of a maximum or the minimum).
.

Claim 7 are rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Brooks as applied to claim 1 above, and further in view of Le (NPL – IBM power6 microarchitecture)
Regarding claim 7, the combined system of Boswell in view of Brooks discloses the invention as in parent claim above, however, the combined system of Boswell in view of Brooks does not teach the execution circuitry saves a state after performing the K pipeline instances on each element (M,N) of the destination matrix, and, in the case of a fault, uses the saved state after recovering from the fault to continue execution. Le teaches the execution circuitry saves a state after performing the K pipeline instances on each element (M,N) of the destination matrix, and, in the case of a fault, uses the saved state after recovering from the fault to continue execution (Le, page 653, section error recovery, the POWER6 core RU contains a copy of the designed state of the processor, and checkpoint of the data are constantly saved in the RU, the processor resumes at the restored checkpoint, if there is a core failure).
It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify the system of Boswell in view of Brooks to have an error recovery check point as disclosed in Le. This modification would have been obvious because it would prevent data loss in case of a failure or a fault, data are constantly saved and processor can then resume from the last successfully saved checkpoint.

Claim 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Brooks as applied to claim 1 above, and further in view of Ould-Ahmed-Vall (US 20130339664) 
Regarding claim 9, the combined system of Boswell in view of Brooks discloses the invention as in the parent claim above. However, the combined system of Boswell in view of Brooks does not teach the instruction further specifies a multibit writemask, each bit of which is to mask or otherwise to allow writing of a corresponding element (M, N) of the destination matrix. Ould-Ahmed-Vall teaches the instruction further specifies a multibit writemask, each bit of which is to mask or otherwise to allow writing of a corresponding element (M, N) of the destination matrix (Ould-Ahmed-Vall, [0033] figure 4A shows a multibit masking, where each bit of the masking layer 403_A applies a masking, where the writemask is 1, the data is written to the resultant data 404_A).
	It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify the instruction of the combined system of Boswell in view of Brooks to have bits mask to determine if data should be written into memory or zeroed. This modification would be obvious because it would allow the system to control whether the data element in the destination vector reflects the result of the base operation and augmentation operation as recognized by Ould-Ahmed-Vall.
	
	Regarding claim 10, the combined system of Boswell in view of Brooks and further in view of Ould-Ahmed-Vall discloses the invention as in the parent claim above, including each of the masked elements is to be either zeroed or merged (Ould-Ahmed-Vall [0037] the type of masking may be merged or zeroed).
 
Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Inaganti.
Regarding claim 13, Boswell discloses the invention as in parent claim, but does not teaches the execution circuitry, during the MULTIPLY stage, is to perform rounding of the generated product, as necessary. However, Inaganti teaches a system that can perform both fused multiply accumulate and unfused multiply accumulate, wherein the execution circuity, during the MULTIPLY stage, is to perform rounding of the generated product, as necessary (Inaganti, column 3 line 13-16, in unfused multiply add operation can be carried out as multiply operation followed by an add operation. As each operation applies a rounding operation).
It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify the system of Boswell to have two modes operations to perform fused and unfused floating point multiply add. This modification would have been obvious because the fused multiply add rounding implementation is augmented with additional hardware, and to have a system that can operate in two difference modes as recognized by Inaganti column 2 line 46-54. Additionally performing rounding after multiply stage can adjust the data size and speed up the adding floating point stage since the result is rounded.

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Zheng.

Regarding claim 14, Boswell discloses the invention as in parent claim, but Boswell does not teach execution circuitry, during the ADD-BYPASS stage, is to perform saturation, as necessary, on the accumulated sum. Zheng teaches a multiply and accumulate circuit that perform saturation during the multiply stage as necessary (Zheng, [0047] saturation generates an overflow bit, that can be located in the sum, [0050] when the saturation is activated, the accumulator value is set to a predetermined value in case of a maximum or the minimum).
.

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Le.
Regarding claim 17, Boswell discloses the invention as in parent claim, however, the Boswell does not teach the execution circuitry saves a state after performing the K pipeline instances on each element (M,N) of the destination matrix, and, in the case of a fault, uses the saved state after recovering from the fault to continue execution. Le teaches the execution circuitry saves a state after performing the K pipeline instances on each element (M,N) of the destination matrix, and, in the case of a fault, uses the saved state after recovering from the fault to continue execution (Le, page 653, section error recovery, the POWER6 core RU contains a copy of the designed state of the processor, and checkpoint of the data are constantly saved in the RU, the processor resumes at the restored checkpoint, if there is a core failure).
It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify the system of Boswell to have an error recovery check point as disclosed in Le. This modification would have been obvious because it would prevent data loss in case of a failure or a fault, data are constantly saved and processor can then resume from the last successfully saved checkpoint.

Claim 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Ould-Ahmed-Vall.

Regarding claim 19, Boswell discloses the invention as in the parent claim. However, Boswell does not teach the instruction further specifies a multibit writemask, each bit of which is to mask or otherwise to allow writing of a corresponding element (M, N) of the destination matrix. Ould-Ahmed-Vall teaches the instruction further specifies a multibit writemask, each bit of which is to mask or otherwise to allow writing of a corresponding element (M, N) of the destination matrix (Ould-Ahmed-Vall, [0033] figure 4A shows a multibit masking, where each bit of the masking layer 403_A applies a masking, where the writemask is 1, the data is written to the resultant data 404_A).
	It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify the instruction of Boswell to have bits mask to determine if data should be written into memory or zeroed. This modification would be obvious because it would allow the system to control whether the data element in the destination vector reflects the result of the base operation and augmentation operation as recognized by Ould-Ahmed-Vall.
	
	Regarding claim 20, the combined system of Boswell in view of Ould-Ahmed-Vall discloses the invention as in the parent claim above, including each of the masked elements is to be either zeroed or merged (Ould-Ahmed-Vall [0037] the type of masking may be merged or zeroed).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Nishii – US 2003/0233384
Teaches an arithmetic unit using single instruction to perform multiplication and addition, performing the instruction in a pipeline and bypassing data between a pipeline, however, Nishii 
18.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUY DUONG whose telephone number is (571)272-2764.  The examiner can normally be reached on Mon-Friday 7:30-5:30.3
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee Li can be reached on 571-272-4169.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.





                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                           

/HUY DUONG/Examiner, Art Unit 2182
(571)272-2764

/Aimee Li/Supervisory Patent Examiner, Art Unit 2183