DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Applicant’s filed on June 10, 2022 has been considered and entered. 
Accordingly, claims 1-20 are pending in this application. Claims 1-3, 5-10, 12-13, and 16-17 are currently amended; claims 4, 11, 14-15, and 18-20 are original.
Claim Objections
Claim 14 is objected to because of the following:
A. In claim 14 line 3, “a data type control signal” should read “the data type control signal” instead.
Appropriate correction is required.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5-6, 8-10, 12, 14, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Kaul et al. (EP 3396524 A1), hereinafter Kaul, in view of Siu et al. (US-PGPUB 2006/0149803 A1), hereinafter Siu and Rash et al. (US-PGPUB 2021/0089316 A1), hereinafter Rash. Kaul is cited in the IDS submitted 04/22/2021.
Regarding claim 5, Kaul teaches 
a processing element  the processing element comprising (Kaul Fig. 17A and paragraphs [0166-0167] processing element – logic unit 1700):
a shared multiplier configured to support both integer multiplication and non-integer multiplication of an input data element and a weight, wherein a shared sub-circuit, of the shared multiplier, that is used to support the integer multiplication is also used to support the non-integer multiplication, wherein a data type control signal identifies the integer multiplication or the non-integer multiplication (Kaul Fig. 17A and paragraphs [0166-0167] shared multiplier – multiplier 1702A-1702B in mantissa unit 1709, and multiplexer 1707, and 1705 block in the exponent unit 1708; “FIG. 17A illustrates a logic unit 1700 including merged computation circuits to perform floating point and integer fused-multiply accumulate operations … Some of the illustrated circuits are shared between operational modes, including the signed multiplier 1702A-1702B and 32-bit adder 1704, which are used for both integer and floating-point modes … For both computation modes, multiplication is performed in the first cycle; paragraph [0124] “The neurons compute a dot product between the weights of the neurons and the region in the local input to which the neurons are connected; input data element – input a; weight input b; data type control signal – operational mode or computation mode which indicates integer or floating-point mode);
one or more adders configured to receive one or more products from the shared multiplier, the one or more adders configured to support both integer addition and non-integer addition, wherein the one or more adders are configured to add an input partial sum to at least one product of the one or more products to result in at least one of an integer partial sum and a non-integer partial sum (Kaul Fig. 17A and paragraphs [0166-0167] one or more adders – circuits below shifter 1713 including adder 1704; one or more products from the shared multiplier – output of multiplier 1702A-1702B; input partial sum – input C to the adder; one of an integer partial sum and a non-integer partial sum – Int16 sum and float16 sum shown in Fig. 17A); and
a selector circuit configured to select from among at least the integer partial sum and the non-integer partial sum to provide to an output port as an output partial sum (Kaul Fig. 17A selector circuit – multiplexer above 1730 the floating point or integer result is provided to the output port based on the mode select signal input to the multiplexer).
Further, Kaul teaches that hardware accelerators for computer vision and machine learning can improve energy efficiency for applications such as object, face and speech recognition by orders of magnitude. These accelerators use interconnected processing element (PE) arrays, with multiply-add circuits being performance, area, and energy dominant for mapping key algorithms used for CNN compute operations. Further, Kaul discloses that the processing element shown in Fig. 17A can be used as building blocks for machine learning data processing system.
Kaul does not explicitly teach the processing element is coupled in a systolic array of processing elements and configured to communicate data with at least one neighboring processing element in the systolic array, and wherein a non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the data type control signal.
However, on the same field of endeavor, Siu discloses a processing element (MMAD unit 220) that is configurable to support floating-point and integer operations including floating point multiply and add (FMAD) and integer multiply and add (IMAD) (Siu Fig. 3 shows the operations supported by the MMAD unit; Fig. 4 shows the MMAD unit; paragraph [0060]). Furthermore, Siu discloses the MMAD unit also includes a control path/control block that receives an opcode and generates opcode-dependent control signals which can include the opcode itself indicating the operation to be performed, and the opcode-dependent control signals are also used to enable, disable, and otherwise control the operation of various circuit blocks of MMAD unit 220 in response to the opcode so that different operations can be performed using the same pipeline elements (Siu paragraph [0071]). Furthermore, Siu discloses that for integer arithmetic which includes the IMAD operation, the exponent logic is not used because integer operands do not include exponent bits (Siu paragraphs [0178-0181]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul using Siu and configure multiplier of Kaul to disable sub-circuits of the exponent unit that are not used for integer multiplication such as the Precompute Bigger Mantissa & Alignment block at least in response to the computation mode indicating integer mode because integer operands do not include exponent bits as disclosed by Siu in order to put sub-circuits not being used in an inactive state to reduce power consumption (Siu paragraph [0041]).
Therefore, the combination of Kaul as modified in view of Siu teaches wherein a non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the data type control signal.
Kaul as modified in view of Siu does not explicitly teach the processing element is coupled in a systolic array of processing elements and configured to communicate data with at least one neighboring processing element in the systolic array.
However, on the same field of endeavor, Rash discloses a systolic array of processing elements (PEs) and configured to communicate data with at least one neighboring processing element in the systolic array. Further, Rash discloses that at each PE in the systolic array, an element of A and B are multiplied and added to the incoming summand (from above in the Figure) and the outgoing sum is passed to the next row PEs (or the final output) (Rash Fig. 6 and paragraphs [0084-0089]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul in view of Siu using Rash and implement the processing element of Kaul in a systolic array by creating multiple copies of the PE of Kaul and arranging the PEs in a grid comprising of multiple rows and columns and configuring each PE to multiply the inputs a and b and adding the result to an input partial sum from an upstream PE and passing the outgoing partial sum to the next row of PEs consistent with the teachings of Rash. 
The motivation to do so is because using the processing element of Kaul provide a systolic array that supports both integer and floating-point computations within the same PE for computer vision and machine learning applications. Further, realizing multiply-add circuits for a merged int16/float16 datapath reduces total area by up to 29% compared to conventional designs with separate integer/floating-point datapaths (Kaul paragraph [0151, 0162]).
Therefore, the combination of Kaul as modified in view of Siu and Rash teaches a processing element coupled in a systolic array of processing elements and configured to communicate data with at least one neighboring processing element in the systolic array.

Regarding claim 6, Kaul as modified in view of Siu and Rash teaches all the limitation of claim 5 as stated above. Further, Kaul as modified in view of Siu and Rash teaches wherein the shared multiplier is configured to perform at least floating-point multiplication and the integer multiplication (Kaul Fig. 17A and paragraphs [0166-0167] “FIG. 17A illustrates a logic unit 1700 including merged computation circuits to perform floating point and integer fused-multiply accumulate operations … Some of the illustrated circuits are shared between operational modes, including the signed multiplier 1702A-1702B and 32-bit adder 1704, which are used for both integer and floating-point modes”).

Regarding claim 8, Kaul as modified in view of Siu and Rash teaches all the limitation of claim 5 as stated above. Further, Kaul as modified in view of Siu and Rash teaches wherein the one or more adders include:
a shared adder configured to perform the integer addition and at least a first part of a floating-point addition (Kaul Fig. 17A shared adder – adder 1704); and
a separate circuit configured to perform at least a second part of the floating-point addition  (Kaul Fig. 17A separate circuit – normalize exponent block, 22b LZA, normalization shifter 1714, and negation incrementer 1742).

Regarding claim 9, Kaul as modified in view of Siu and Rash teaches all the limitation of claim 5 as stated above. Further, Kaul as modified in view of Siu and Rash teaches wherein the one or more adders include:
a floating-point adder configured to perform floating-point addition (Kaul Fig. 17A and paragraphs [0166-0167] floating-point adder – adder 1704, normalize exponent block, 22b LZA, normalization shifter 1714, and negation incrementer 1742); and
an integer adder configured to perform the integer addition (Kaul Fig. 17A and paragraphs [0166-0167] integer adder – adder 1704).

Regarding claim 10, Kaul as modified in view of Siu and Rash teaches all the limitation of claim 5 as stated above. Further, Kaul as modified in view of Siu and Rash teaches wherein
the processing element further comprises a delay register coupled between the shared multiplier and the one or more adders (Kaul Fig. 17A delay register – register above 1705 between multiplier 1702A-1702B and adder 1704); and
the one or more adders are configured to receive one or more products from the shared multiplier on a subsequent systolic cycle, wherein the one or more products are generated by the shared multiplier during a prior systolic cycle (Kaul paragraph [0166] “For both computation modes, multiplication is performed in the first cycle and addition/rounding in the second cycle”).
Regarding claim 12, Kaul as modified in view of Siu and Rash teaches all the limitation of claim 5 as stated above. Further, Kaul as modified in view of Siu and Rash teaches wherein the processing element further comprises a delay register coupled between the shared multiplier and the one or more adders (Kaul Fig. 17A delay register – register above 1705 between multiplier 1702A-1702B and adder 1704).

Regarding claim 14, Kaul as modified in view of Siu and Rash teaches all the limitation of claim 5 as stated above. Further, Kaul as modified in view of Siu and Rash teaches wherein the selector circuit is configured to select among at least the integer partial sum and the non-integer partial sum based at least in part on a data type control signal (Kaul paragraph [0151] “A single control signal is utilized to switch, on a per-cycle basis, between floating-point and integer compute modes”; Fig. 17A shows the computation mode signal is input to the multiplexer above 1730).

Regarding claim 16, Kaul as modified in view of Siu and Rash teaches all the limitation of claim 5 as stated above. Further, Kaul as modified in view of Siu and Rash teaches wherein the shared multiplier is selectable to perform at least one of: a single 17-bit integer multiplication, two or more parallel 9-bit integer multiplications, a single 16-bit brain floating-point multiplication, or a single 16-bit floating-point multiplication (Kaul paragraph [0153-0154] and Fig. 17A “The merged floating-point units described herein can selectively perform 16-bit integer or floating-point operations on a per-cycle basis” where 16-bit floating-point operations corresponds to the selectable single 16-bit floating-point multiplication).



Claims 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Siu in view of Rash. 
Regarding claim 17, Siu teaches 
obtaining a data type control signal indicating an integer data type (Siu paragraph [0012] “The input section is configured to receive … an opcode designating one of a number of supported operations to be performed and is further configured to generate control signals in response to the opcode; Fig. 3 shows the operations supported including IMAD in table 304 which is an integer operation);
performing, by a shared multiplier, integer multiplication of a first input data element and an output of a weight register to generate an integer product (Siu Fig. 4 and paragraphs [0044, 0064, 0183-0185] shared multiplier – circuits in stage 0 to stage 3; first input data element – input A0; output of a weight register – input B0 obtained from, for example register file 224; integer product – output R3a);
a first non-shared sub-circuit of the shared multiplier (Siu Fig. 4 first non-shared sub-circuit – sub circuits in exponent path 415 in stage 0 to stage 3);
performing, by one or more adders, integer addition of a first input  (Siu paragraph [0064] “stages 4-6 perform the addition (P+C) portion”; Fig. 4 one or more adders – circuits in stage 4 to stage 6; paragraphs [0186-0188] first input – operand C; integer sum – result R5);
selecting, based at least partly on the data type control signal indicating the integer data type, the integer  (Siu Fig. 12 and paragraphs [0132-0133]);
obtaining a changed data type control signal to indicate a non-integer data type (Siu paragraph [0012] “The input section is configured to receive … an opcode designating one of a number of supported operations to be performed and is further configured to generate control signals in response to the opcode; Fig. 3 shows the operations supported including FMAD in table 302 which is a floating point operation);
performing, by the shared multiplier, non-integer multiplication of a second input data element and the output of the weight register to generate a non-integer product (Siu Fig. 4 and paragraphs [0044, 0064, 0145-0151] second input data element – operand A0 of the FMAD operation; output of the weight register – input B0 obtained from, for example register file 224; non-integer product – output R3a);
a second non-shared sub-circuit, of the shared multiplier (Siu Figs. 4 and 5 and paragraph [0075] a second non-shared sub-circuit – sub-circuit in formatting block 400 such as 504, 505, 506);
performing, by the one or more adders, non-integer addition of a second input  (Siu paragraph [0064] “stages 4-6 perform the addition (P+C) portion”; Fig. 4 one or more adders – circuits in stage 4 to stage 6; paragraphs [0152-0154] second input – operand C; non-integer sum – result R5); and
selecting, based at least partly on the changed data type control signal indicating the non-integer data type, the non-integer  (Siu Fig. 12 and paragraphs [0134-0136, 0155]).
Furthermore, Siu discloses the MMAD unit also includes a control path/control block that receives an opcode and generates opcode-dependent control signals which can include the opcode itself indicating the operation to be performed, and the opcode-dependent control signals are also used to enable, disable, and otherwise control the operation of various circuit blocks of MMAD unit 220 in response to the opcode so that different operations can be performed using the same pipeline elements (Siu paragraph [0071]). Furthermore, Siu discloses that for integer arithmetic which includes the IMAD operation, the exponent logic is not used because integer operands do not include exponent bits (Siu paragraphs [0178-0181]). Furthermore, Siu discloses that for floating-point operations, the input operands are passed through the formatting block 400 without modification (Siu paragraph [0148]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Siu using Siu and configure the MMAD unit to disable sub-circuits of the exponent path 415 that are not used for integer multiplication at least in response to the opcode/operation mode indicating integer mode because integer operands do not include exponent bits as disclosed by Siu. Furthermore, enable the exponent path 415 in response to an opcode/operation mode indicating floating-point mode because the exponent path is used for floating-point computations. Furthermore, disable sections of the formatting block such as 504, 505, 506 during floating-point computations since the formatting block only passed through the operands A, B, and C without modification in order to put sub-circuits not being used in an inactive state to reduce power consumption (Siu paragraph [0041]).
Therefore, the combination of Siu as modified teaches wherein a first non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the data type control signal; and wherein the first non-shared sub-circuit is enabled and a second non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the changed data type control signal.
Siu as modified does not explicitly teach performing, by one or more adders, integer addition of a first input partial sum and the integer product to generate an integer partial sum; selecting, based at least partly on the data type control signal indicating the integer data type, the integer partial sum to provide as an output partial sum to another processing element; performing, by the one or more adders, non-integer addition of a second input partial sum and the non-integer product to generate a non-integer partial sum; and selecting, based at least partly on the changed data type control signal indicating the non-integer data type, the non-integer partial sum to provide as an output partial sum to another processing element.
However, on the same field of endeavor, Rash discloses a systolic array of processing elements (PEs) and configured to communicate data with at least one neighboring processing element in the systolic array. Further, Rash discloses that at each PE in the systolic array, an element of A and B are multiplied and added to the incoming summand (from above in the Figure) and the outgoing sum is passed to the next row PEs (or the final output) (Rash Fig. 6 and paragraphs [0084-0089]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Siu using Rash and implement the processing element of Siu in a systolic array by creating multiple copies of the PE of Siu and arranging the PEs in a grid comprising of multiple rows and columns and configuring each PE to multiply the inputs a and b and adding the result to an input partial sum from an upstream PE and passing the outgoing partial sum to the next row of PEs consistent with the teachings of Rash. 
The motivation to do so is because the processing element of Siu provide a systolic array that supports both integer and floating-point computations within the same PE requiring reduced chip area and that can be used more efficiently (Siu paragraph [0010]).
Therefore, the combination of Siu as modified in view of Rash teaches performing, by one or more adders, integer addition of a first input partial sum and the integer product to generate an integer partial sum; selecting, based at least partly on the data type control signal indicating the integer data type, the integer partial sum to provide as an output partial sum to another processing element; performing, by the one or more adders, non-integer addition of a second input partial sum and the non-integer product to generate a non-integer partial sum; and selecting, based at least partly on the changed data type control signal indicating the non-integer data type, the non-integer partial sum to provide as an output partial sum to another processing element.

Regarding claim 18, Siu as modified in view Rash teaches all the limitation of claim 17 as stated above. Further, Siu as modified in view of Rash teaches further comprising: performing, by multiplier exponent logic in the shared multiplier, a computation of exponent bits in the non-integer product (Siu Fig. 4 and paragraphs [0149-0151 “Exponent product block 424 receives the exponent portions (Ea, Eb) of operands A and B and computes Ea+Eb, with bias advantageously being β being used to re-establish the correct fp16 or fp32 exponent bias in the sum”).
Claims 1, 2, and 4 are rejected under 35 U.S.C. 103 as being unpatentable over Kaul in view of Siu, Rash and Phelps et al. (US-PGPUB 2018/0336165 A1), hereinafter Phelps.
	Regarding claim 1, Kaul teaches 
	a processing element comprising:
a first input port for receiving an input data element (Kaul Fig. 17A and paragraph [0166] first input port receiving an input data element – input port receiving input a);
an output port for providing an output partial sum (Kaul Fig. 17A and paragraph [0166] output port providing an output partial sum – output port 1730);
a shared multiplier configured to multiply the input data element by the  (Kaul Fig. 17A and paragraphs [0166-0167] shared multiplier – circuits above 1713 including multiplier 1702A-1702B in mantissa unit 1709, and multiplexer 1707, and 1705 block in the exponent unit 1708; “FIG. 17A illustrates a logic unit 1700 including merged computation circuits to perform floating point and integer fused-multiply accumulate operations … Some of the illustrated circuits are shared between operational modes, including the signed multiplier 1702A-1702B and 32-bit adder 1704, which are used for both integer and floating-point modes”; paragraph [0124] “The neurons compute a dot product between the weights of the neurons and the region in the local input to which the neurons are connected; input data element – input a; weight input b):
a first sub-circuit configured to support integer multiplication (Kaul Fig. 17A and paragraphs [0166-0167] first sub-circuit - multiplier 1702A-1702B); and
a second sub-circuit configured to support floating-point multiplication, wherein at least a shared part of the first sub-circuit is shared with the second sub-circuit, wherein a data type control signal identifies the integer multiplication or the floating-point multiplication (Kaul Fig. 17A and paragraphs [0166-0167] second sub-circuit – multiplier 1702A-1702B in mantissa unit 1709, and multiplexer 1707, and 1705 block in the exponent unit 1708; “Some of the illustrated circuits are shared between operational modes, including the signed multiplier 1702A-1702B and 32-bit adder 1704, which are used for both integer and floating-point modes … For both computation modes, multiplication is performed in the first cycle”; data type control signal – operational mode or computation mode which indicates integer or floating-point mode);
one or more adders coupled to  (Kaul Fig. 17A and paragraphs [0166-0167] one or more adders – circuits below shifter 1713 including adder 1704):
generate an integer partial sum by performing integer addition on an integer product from the shared multiplier and the input partial sum (Kaul Fig. 17A and paragraphs [0166-0167] integer partial sum – Int16 sum in Fig. 17A; integer product – integer result of the integer multiplication; input partial sum – input C to the adder); and
generate a floating-point partial sum by performing floating-point addition on a floating-point product from the shared multiplier and the input partial sum (Kaul Fig. 17A and paragraphs [0166-0167] floating-point partial sum – float16 sum in Fig. 17A; floating-point product – floating-point result of the floating-point multiplication; input partial sum – input C to the adder); and
a selector circuit configured to select among at least the integer partial sum and the floating-point partial sum for providing to the output port as the output partial sum (Kaul Fig. 17A selector circuit – multiplexer above 1730; the floating point or integer result is provided to the output port 1730 based on the mode select signal input to the multiplexer).
Further, Kaul teaches that hardware accelerators for computer vision and machine learning can improve energy efficiency for applications such as object, face and speech recognition by orders of magnitude. These accelerators use interconnected processing element (PE) arrays, with multiply-add circuits being performance, area, and energy dominant for mapping key algorithms used for CNN compute operations. Further, Kaul discloses that the processing element shown in Fig. 17A can be used as building blocks for machine learning data processing system.
Kaul does not explicitly teach a systolic array of processing elements arranged in a first plurality of rows and a second plurality of columns, the systolic array of processing elements configured to operate on input data sets, each processing element of the processing elements comprising: a weight register for storing a stored weight value; a second input port for receiving an input partial sum; a shared multiplier configured to multiply the input data element by the stored weight value; wherein a non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the data type control signal; and one or more adders coupled to the second input port.
However, on the same field of endeavor, Siu discloses a processing element (MMAD unit 220) that is configurable to support floating-point and integer operations including floating point multiply and add (FMAD) and integer multiply and add (IMAD) (Siu Fig. 3 shows the operations supported by the MMAD unit; Fig. 4 shows the MMAD unit; paragraph [0060]). Furthermore, Siu discloses the MMAD unit also includes a control path/control block that receives an opcode and generates opcode-dependent control signals which can include the opcode itself indicating the operation to be performed, and the opcode-dependent control signals are also used to enable, disable, and otherwise control the operation of various circuit blocks of MMAD unit 220 in response to the opcode so that different operations can be performed using the same pipeline elements (Siu paragraph [0071]). Furthermore, Siu discloses that for integer arithmetic which includes the IMAD operation, the exponent logic is not used because integer operands do not include exponent bits (Siu paragraphs [0178-0181]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul using Siu and configure multiplier of Kaul to disable sub-circuits of the exponent unit that are not used for integer multiplication such as the Precompute Bigger Mantissa & Alignment block at least in response to the computation mode indicating integer mode because integer operands do not include exponent bits as disclosed by Siu in order to put sub-circuits not being used in an inactive state to reduce power consumption (Siu paragraph [0041]).
Therefore, the combination of Kaul as modified in view of Siu teaches wherein a non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the data type control signal.
Kaul as modified in view of Siu does not explicitly teach a systolic array of processing elements arranged in a first plurality of rows and a second plurality of columns, the systolic array of processing elements configured to operate on input data sets, each processing element of the processing elements comprising: a weight register for storing a stored weight value; a second input port for receiving an input partial sum; a shared multiplier configured to multiply the input data element by the stored weight value; and one or more adders coupled to the second input port.
However, on the same field of endeavor, Rash discloses a systolic array of processing elements arranged in a first plurality of rows and a second plurality of columns, the systolic array of processing elements configured to operate on input data sets, each processing element of the processing elements comprising: a second input port for receiving an input partial sum. Further, Rash discloses that at each PE in the systolic array, an element of A and B are multiplied and added to the incoming summand (from above in the Figure) and the outgoing sum is passed to the next row PEs (or the final output) (Rash Fig. 6 and paragraphs [0084-0089]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul in view of Siu using Rash and implement the processing element of Kaul in a systolic array by creating multiple copies of the PE of Kaul and arranging the PEs in a grid comprising of multiple rows and columns and configuring each PE to multiply the inputs a and b and adding the result to an input partial sum from an upstream PE and passing the outgoing partial sum to the next row of PEs and configure each PE to include a second input port for receiving the input partial sum consistent with the teachings of Rash. 
The motivation to do so is because the processing element of Kaul provide a systolic array that supports both integer and floating-point computations within the same PE for computer vision and machine learning applications. Further, realizing multiply-add circuits for a merged int16/float16 datapath reduces total area by up to 29% compared to conventional designs with separate integer/floating-point datapaths (Kaul paragraph [0151, 0162]).
Therefore, the combination of Kaul as modified in view of Siu and Rash teaches a systolic array of processing elements arranged in a first plurality of rows and a second plurality of columns, the systolic array of processing elements configured to operate on input data sets, each processing element of the processing elements comprising: a second input port for receiving an input partial sum; and one or more adders coupled to the second input port.
Kaul as modified in view of Siu and Rash does not explicitly teach each processing element of the processing elements comprising: a weight register for storing a stored weight value; and the shared multiplier configured to multiply the input data element by the stored weight value.
However, on the same field of endeavor, Phelps teaches a processing element for a systolic array comprising a weight register for storing a weight input and a multiplication circuitry that is used to multiply the weight input from the weight register 402 with the activation input from the activation register 406 (Phelps Fig. 4 and paragraphs [0077-0078]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul in view of Siu and Rash using Phelps and configure each PE to include a weight register for storing the weight value b and configure the shared multiplier to multiply the input data element by the stored weight value as taught by Phelps.
The motivation to include a weight register is to statically store the weight input such that as activation inputs are transferred to the cell, e.g., through the activation register 406, over multiple clock cycles, the weight input remains within the cell and is not transferred to an adjacent cell. Therefore, the weight input can be applied to multiple activation inputs, e.g., using the multiplication circuitry 408, and respective accumulated values can be transferred to an adjacent cell (Phelps paragraph [0082]).
Therefore, the combination of Kaul as modified in view of Siu, Rash and Phelps teaches each processing element of the processing elements comprising: a weight register for storing a stored weight value; and the shared multiplier configured to multiply the input data element by the stored weight value.

Regarding claim 2, Kaul as modified in view of Siu, Rash and Phelps teaches all the limitations of claim 1 as stated above. Further, Kaul as modified in view of Siu, Rash and Phelps teaches wherein:
the first sub-circuit is configured to generate a first product by performing the integer multiplication on the input data element and the stored weight value (Kaul Fig. 17A and paragraphs [0166-0167] first product – output of multiplier 1702A-1702B when the mode indicate an integer operation);
the second sub-circuit is configured to generate a second product by performing the floating-point multiplication on the input data element and the stored weight value, wherein the second product includes a significand and an exponent (Kaul Fig. 17A and paragraphs [0153, and 0166-0167] second product – output of multiplier 1702A-1702B when the mode indicate a floating-point operation; further floating-point data type includes a significand part and an exponent part; significand part - float16 [9:0] generated by the multiplier 1702A-1702B; exponent part – 5-bit exponent input into 1713); and
the shared part of the first sub-circuit that is shared with the second sub-circuit is used in generating the first product and is used in generating the significand of the second product (Kaul Fig. 17A and paragraphs [0153, and 0166-0167]; Fig. 17A shows the fractional part of the second product float16 [9:0] is generated by the multiplier 1702A-1702B and exponent part is generated by the exponent unit input into 1713).

Regarding claim 4, Kaul as modified in view of Siu, Rash and Phelps teaches all the limitations of claim 1 as stated above. Further, Kaul as modified in view of Siu, Rash and Phelps teaches wherein: wherein the one or more adders includes:
a shared adder part configured to both (Kaul Fig. 17A and paragraphs [0166-0167] shared adder – adder 1704):
generate the integer partial sum that is an integer data type by adding the integer product generated to the input partial sum (Kaul Fig. 17A - integer partial sum – Int16 sum in output port 1730); and
calculate one or more portions of the floating-point partial sum (Kaul Fig. 17A - one or more portions of the floating-point partial sum – mantissa part of the float16 sum output port 1730); and
a separate circuit that is separate from the shared adder part, wherein the shared adder part and the separate circuit are used together in generating the floating-point partial sum that is a floating-point data type (Kaul Fig. 17A separate circuit - normalize exponent block, 22b LZA, normalization shifter 1714, and negation incrementer 1742).
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Kaul in view of Siu and Rash as applied to claim 5 above, and further in view of Snelgrove et al (US-PGPUB 2021/0091794 A1), hereinafter Snelgrove.
Regarding claim 11, Kaul as modified in view of Siu and Rash teaches all the limitations of claim 5 as stated above.
	Kaul as modified in view of Siu and Rash does not explicitly teach further comprising a skip calculation generator configured to propagate a skip calculation signal to a plurality of processing elements, wherein the skip calculation signal is configured to prevent at least one of the shared multiplier or the one or more adders for a systolic cycle from contributing to arithmetic computations of the systolic array during the systolic cycle.
	However, on the same field of endeavor, Snelgrove teaches a processing element comprising a skip calculation generator configured to propagate a skip calculation signal to a plurality of processing elements, wherein the skip calculation signal is configured to prevent at least one of a multiplier or an adder for a processing/clock cycle from contributing to arithmetic computations (Snelgrove Fig. 21 and paragraphs [0143-0154] skip calculation generator – zero detect 2116 and zero disable 2110; multiplier – ALU 2102 “The ALU 2102 may include one or more levels of multiplexor and/or a multiplier 2108”).
	Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul in view of Siu and Rash using Snelgrove and configure each processing element of Kaul to include circuitry for zero detect and zero disable to skip or disable zero multiplication and accumulation operations as taught by Snelgrove.
	The motivation to do so is because multiplication by zero produces a zero product which does not need to be accumulated. As such, the zero disable saves energy, as each PE uses significantly more energy when an input changes as opposed to when the inputs do not change (Snelgrove paragraph (0152]).
	Therefore, the combination of Kaul as modified in view of Siu, Rash and Snelgrove teaches further comprising a skip calculation generator configured to propagate a skip calculation signal to a plurality of processing elements, wherein the skip calculation signal is configured to prevent at least one of the shared multiplier or the one or more adders for a systolic cycle from contributing to arithmetic computations of the systolic array during the systolic cycle.
Claims 13, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Kaul in view of Siu and Rash as applied to claims 5 and 17 above respectively, and further in view of Vantrease et al. (US-PGPUB 2019/0294413 A1), hereinafter Vantrease.
Regarding claim 15, Kaul as modified in view of Siu and Rash teaches all the limitations of claim 5 as stated above.
Kaul as modified in view of Siu and Rash does not explicitly teach further comprising a partial de-quantizer configured to partially de-quantize the input data element for a row of the processing elements, wherein partially de-quantizing the input data element includes shifting the input data element and adding an extra bit to the input data element.
However, on the same field of endeavor, Vantrease discloses a systolic array circuit further comprising a partial de-quantizer configured to partially de-quantize the input data element for a row of the processing elements, wherein partially de-quantizing the input data element includes shifting the input data element and adding an extra bit to the input data element (Vantrease Fig. 9 and paragraphs [0107-0112] partial de-quantizer – subtraction engines 930a-930d “Each row may also include a subtraction engine 930a, 930b, 930c, ... , or 930d that takes quantized inputs (e.g., in UNIT8 format) and zero-point integers (e.g., in UNIT8 format) from a memory or a buffer … The subtraction engine … subtract zero-point integer Xqz 938 from input data element Xq 932 to generate a shifted input data element Xq_shift 942 in INT9 format”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul in view of Siu and Rash using Vantrease and configure the systolic array of Rash to include a subtraction engine connected to each row of processing elements as shown in Fig. 9 of Vantrease for partially de-quantizing the input data element including shifting the input data element and adding an extra 1 bit to the input data element such as generating 17-bit shifted integer input data a from the 16-bit integer input data a as taught by Vantrease which can then be used as input to the multiplier 1702A-1702B.
The motivation to do so is several benefits may be achieved by using  subtraction engines in front of the PE array to shift the unsigned integer inputs to generate difference values. For example, the input data may be stored in the memory in UINT8 format, which may allow easier hardware design and efficient storage and management than data in the INT9 format. Further, a real value 0 that is quantized asymmetrically to a non-zero unsigned integer (i.e., zero-point integer) with no quantization error may be converted to a signed integer value 0 by the subtraction engine before the multiplication  (Vantrease paragraph [0112]).
Therefore, the combination of Kaul as modified in view of Siu, Rash and Vantrease teaches further comprising a partial de-quantizer configured to partially de-quantize the input data element for a row of the processing elements, wherein partially de-quantizing the input data element includes shifting the input data element and adding an extra bit to the input data element.

Regarding claim 13, Kaul as modified in view of Siu and Rash teaches all the limitations of claim 5 as stated above. Further, Kaul as modified in view of Siu and Rash teaches wherein:
the shared multiplier is selectable to perform at least one of: a single 16-bit brain floating-point multiplication or a single 16-bit floating-point multiplication (Kaul Fig. 17A and paragraphs [0153-0154 and 0162]  the shared multiplier is configured to support a single 16-bit floating-point multiplication).
Kaul as modified in view of Siu and Rash does not explicitly teach wherein: the shared multiplier is selectable to perform at least one of: a single 17-bit integer multiplication or two parallel 9-bit integer multiplications.
However, on the same field of endeavor, Vantrease discloses a systolic array circuit further comprising a partial de-quantizer configured to partially de-quantize the input data element for a row of the processing elements, wherein partially de-quantizing the input data element includes shifting the input data element and adding an extra bit to the input data element (Vantrease Fig. 9 and paragraphs [0107-0112] partial de-quantizer – subtraction engines 930a-930d “Each row may also include a subtraction engine 930a, 930b, 930c, ... , or 930d that takes quantized inputs (e.g., in UNIT8 format) and zero-point integers (e.g., in UNIT8 format) from a memory or a buffer … The subtraction engine … subtract zero-point integer Xqz 938 from input data element Xq 932 to generate a shifted input data element Xq_shift 942 in INT9 format”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul in view of Siu and Rash using Vantrease and configure the systolic array to include a subtraction engine connected to each row of processing elements as shown in Fig. 9 of Vantrease for partially de-quantizing the input data element including shifting the input data element and adding an extra 1 bit to the input data element such as generating 17-bit shifted integer input data a from the 16-bit integer input data a as taught by Vantrease which can then be used as input to the multiplier 1702A-1702B.
The motivation to combine is the same as claim 15.
Therefore, the combination of Kaul as modified in view of Siu, Rash and Vantrease teaches wherein: the shared multiplier is selectable to perform at least one of: a single 17-bit integer multiplication or two parallel 9-bit integer multiplications.
Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Siu and Rash as applied to claim 17 above, and further in view of Vantrease.
Regarding claim 19, Siu as modified in view of Rash teaches all the limitations of claim 17 as stated above. Further, Siu as modified in view of Rash teaches further comprising: 
receiving the first input data element […] (Siu Fig. 4 shows receiving operand A0); and
performing multiplication on the first input data element and the output of the weight register by the shared multiplier, wherein the shared multiplier is configured to support multiplication (Siu Fig. 4 and paragraphs [0183-0185]);
Siu as modified in view of Rash does not explicitly teach wherein the first input data element is a 9-bit or 17-bit integer; and wherein the shared multiplier is configured to support multiplication of at least one of 9-bit or 17-bit integers.
However, on the same field of endeavor, Vantrease discloses a systolic array circuit further comprising a partial de-quantizer configured to partially de-quantize the input data element for a row of the processing elements, wherein partially de-quantizing the input data element includes shifting the input data element and adding an extra bit to the input data element (Vantrease Fig. 9 and paragraphs [0107-0112] partial de-quantizer – subtraction engines 930a-930d “Each row may also include a subtraction engine 930a, 930b, 930c, ... , or 930d that takes quantized inputs (e.g., in UNIT8 format) and zero-point integers (e.g., in UNIT8 format) from a memory or a buffer … The subtraction engine … subtract zero-point integer Xqz 938 from input data element Xq 932 to generate a shifted input data element Xq_shift 942 in INT9 format”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Siu in view of Rash using Vantrease and configure the systolic array of Rash to include a subtraction engine connected to each row of processing elements as shown in Fig. 9 of Vantrease for partially de-quantizing the input data element including shifting the input data element and adding an extra 1 bit to the input data element such as generating a 17-bit shifted integer input data a from the 16-bit integer input data a as taught by Vantrease which can then be used as input to the multiplier 1702A-1702B. Further, configure the multiplier to support 17-bit integer multiplication which can be achieved by increasing the multiplier width by 1 bit.
The motivation to combine is the same as claim 15.
Therefore, the combination of Siu as modified in view of Rash and Vantrease teaches wherein the first input data element is a 9-bit or 17-bit integer; and wherein the shared multiplier is configured to support multiplication of at least one of 9-bit or 17-bit integers.
Allowable Subject Matter
Claims 3, 7 and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. 
The following is a statement of reasons for the indication of allowable subject matter:
In addition to the reasons for indication of allowable subject matter provided in the non-final office action submitted 3/10/2022, claim 3 would be allowable for substantially the same reason as claim 7 with respect to claimed feature of “a multiplexer configured to select between a first product and a second product for providing to the one or more adders based on the data type control signal selecting a data type”.
Response to Arguments
Applicant’s arguments, see remarks pages 3-7, filed 06/20/2022, with respect to the rejection of claims 1-2, 4-6, and 8-19 under 35 U.S.C. 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of amendments made and newly found prior art reference.
In response to applicant’s argument with respect to the 35 U.S.C. 103 rejection of claims 1-2, 4-6, and 8-19, applicant amended independent claim 5 to include the features of “wherein a data type control signal identifies the integer multiplication or the non-integer multiplication, and wherein a non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the data type control signal”. Additionally applicant amended independent claim 1 to include the features of “wherein a data type control signal identifies the integer multiplication or the floating-point multiplication, and wherein a non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the data type control signal”, and applicant amended independent claim 17 to include the features of “wherein a first non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the data type control signal; and wherein the first non-shared sub-circuit is enabled and a second non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the changed data type control signal”. Applicant argues that Kaul and Rash, alone or in combination, do not teach or suggest the above-noted elements of Claim 5.
Examiner agrees in part. The feature of “wherein a data type control signal identifies the integer multiplication or the non-integer multiplication,” in claim 5 and “wherein a data type control signal identifies the integer multiplication or the floating-point multiplication” in claim 1 is disclosed at least by Kaul. Paragraph [0166] discloses floating-point and integer modes and switching between operation/computation modes. Furthermore, Fig. 17A shows a mode control signal is input to various sub-circuits of the logic unit. However, examiner agrees that the feature of “wherein a non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the data type control signal” is not disclosed by Kaul or Rash. However, this feature is disclosed in Siu. Siu teaches a functional unit, MMAD unit that is configured to support both floating-point and integer operations in response to an opcode and opcode generated control signals indicating the operation to be performed. Furthermore, Siu discloses using the opcode generated control signals which can include the opcode itself to enable, disable, and otherwise control the operation of various circuit blocks of MMAD unit 220 in response to the opcode so that different operations can be performed using the same pipeline elements in paragraph [0071], and placing sub-circuits into an inactive state to reduce power consumption [0141]. Therefore, it would be obvious to modify the logic unit of Kaul and disable sub-circuits not used for an operation to reduce power consumption. 
Furthermore, the feature of “wherein the first non-shared sub-circuit is enabled and a second non-shared sub-circuit, of the shared multiplier, is disabled based at least partly on the changed data type control signal” in claim 17 is also fairly disclosed in Siu because the exponent unit/path is used in floating-point operation as disclosed in paragraph [0145], therefore, it is implied or at least obvious to enable the exponent path when performing floating-point multiplication in order to calculate the correct result while disabling other sub-circuits that are bypassed or not used as disclosed in paragraph [0148] such as 504-506 in Fig. 5 in order to reduce power consumption. 
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Carlo Waje whose telephone number is (571)272-5767. The examiner can normally be reached 9:00-6:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on (571) 270-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/C.W./
Carlo WajeExaminer, Art Unit 2182                                                                                                                                                                                                        (571)272-576


/JYOTI MEHTA/Supervisory Patent Examiner, Art Unit 2182