DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/22/2021 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(4) because reference characters "219" and “218” have both been used to designate the register receiving 235 in Figs. 2C-2D and 4-5.  
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(4) because:
A. reference character “218” has been used to designate both input partial sum register and data type register in Fig. 4
B. reference character “212” has been used to designate both cached weight register and skip calculation generator in Fig. 4
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: 
A. Reference characters 110 and 116 in Fig. 1
B. Reference characters 252 in Figs. 2A-2D
Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended 
Specification
Applicant is reminded of the proper content of an abstract of the disclosure.
A patent abstract is a concise statement of the technical disclosure of the patent and should include that which is new in the art to which the invention pertains. The abstract should not refer to purported merits or speculative applications of the invention and should not compare the invention with the prior art.
The abstract is objected to because the first three lines refer to purported merits of the invention. Further, the abstract contains speculative languages such as “can have a shared multiplier” and “can have a separate and/or a shared circuitry”.
The specification is objected to under 37 C.F.R. 1.74, which requires the detailed description to refer to the different parts of the figures by use of reference letters or reference numerals. Implicit in this rule is that the detailed description correctly references the figures. In this application the figures and detailed description are inconsistent as explained below.
A. Starting at paragraph [0138] lines 4-5, reference 612 is referred to an activation engine in multiple instances but is labeled as memory in Fig. 6


Claim Objections
Claims 2, 5-16 are objected to because of the following:  
A. In claim 2 line3, “integer multiplication” should read “the integer multiplication” instead.
B. In claim 2 line 5, “floating-point multiplication” should read “the floating-point multiplication” instead.
C. In claim 5 lines 7-8 “integer multiplication” should read “the integer multiplication” instead. Claims 6-16 inherit the same deficiency as claim 5 by reason of dependence.
D. In claim 6 line 2 “floating-point multiplication and integer multiplication” should read “the floating-point multiplication and the integer multiplication” instead.
E. In claim 7 line 7, “floating-point multiplication” should read “the floating-point multiplication” instead.
F. In claim 9 line 3, “integer addition” should read “the integer addition” instead.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 3, 10, 13 and 16 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.


Claim 10 recites “wherein: each of the processing elements further includes a delay register coupled between the shared multiplier and the one or more adders” in lines 1-3. These limitations are unclear because of the phrase “wherein each of the processing elements further includes a delay register”. Independent claim 5 as currently written only recites a processing element comprising the claimed circuitry, therefore, it is unclear whether each processing elements in the systolic array includes the circuitry recited in claim 5 in addition to the delay register recited in claims 10 or not, and if not, it is unclear what other circuitry are included in the other processing elements in the systolic array. Claim 12 recites “wherein each of the processing elements further includes a delay register coupled between the shared multiplier and the one or more adders” in lines 1-2. Claim 12 is rejected for the same reason. For purposes of examination, it is interpreted that each processing element includes the same circuity.
Claim 13 recites “wherein: the shared multiplier is selectable to at least one of: a single 17-bit integer multiplication or two parallel 9-bit integer multiplications; and the shared multiplier is selectable to at least one of: a single 16-bit brain floating-point multiplication or a single 16-bit floating-point multiplication”. These limitations are unclear because it is unclear how “selectable to at least one of” is to be interpreted as this phrase seems to be incomplete. For example, is the shared multiplier selectable 
Claim 16 recites “wherein the shared multiplier is selectable to perform among at least: a single 17 bit integer multiplication, two or more parallel 9 bit integer multiplications, a single 16 bit brain floating-point multiplication, and a single 16 bit floating-point multiplication”. This limitation is unclear as a definition of “selectable” is “able to be selected” and the definition of “among” includes “in each of”, “in the group, class, or number of”, and “taken out of (a group)”. Therefore, it is unclear whether the claim is to be interpreted as the shared multiplier is able to be selected to perform each of or one of at least: a single 17 bit integer multiplication, two or more parallel 9 bit integer multiplications, a single 16 bit brain floating-point multiplication, and a single 16 bit floating-point multiplication. For purposes of examination, among is interpreted one of.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5-6, 8-10, 12, 14, and 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Kaul et al. (EP 3396524 A1), hereinafter Kaul, in view of Rash et al. (US-PGPUB 2021/0089316 A1), hereinafter Rash. Kaul is cited in the IDS submitted 04/22/2021.
Regarding claim 5, Kaul teaches 
a processing element  the processing element comprising (Kaul Fig. 17A and paragraphs [0166-0167] processing element – logic unit 1700):
a shared multiplier configured to support both integer multiplication and non-integer multiplication of an input data element and a weight, wherein a shared sub-circuit, of the shared multiplier, that is used to support integer multiplication is also used to support the non-integer multiplication (Kaul Fig. 17A and paragraphs [0166-0167] shared multiplier – multiplier 1702A-1702B; “FIG. 17A illustrates a logic unit 1700 including merged computation circuits to perform floating point and integer fused-multiply accumulate operations … Some of the illustrated circuits are shared between operational modes, including the signed multiplier 1702A-1702B and 32-bit adder 1704, which are used for both integer and floating-point modes”; paragraph [0124] “The neurons compute a dot product between the weights of the neurons and the region in the local input to which the neurons are connected; input data element – input a; weight input b);
one or more adders configured to receive one or more products from the shared multiplier, the one or more adders configured to support both integer addition and non-integer addition, wherein the one or more adders are configured to add an input partial sum to at least one product of the one or more products to result in at least one of an integer partial sum and a non-integer partial sum (Kaul Fig. 17A and paragraphs [0166-0167] one or more adders – circuits below shifter 1713 including adder 1704; one or more products from the shared multiplier – output of multiplier 1702A-; and
a selector circuit configured to select from among at least the integer partial sum and the non-integer partial sum to provide to an output port as an output partial sum (Kaul Fig. 17A selector circuit – multiplexer above 1730 the floating point or integer result is provided to the output port based on the mode select signal input to the multiplexer).
Further, Kaul teaches that hardware accelerators for computer vision and machine learning can improve energy efficiency for applications such as object, face and speech recognition by orders of magnitude. These accelerators use interconnected processing element (PE) arrays, with multiply-add circuits being performance, area, and energy dominant for mapping key algorithms used for CNN compute operations. Further, Kaul discloses that the processing element shown in Fig. 17A can be used as building blocks for machine learning data processing system.
Kaul does not explicitly teach the processing element is coupled in a systolic array of processing elements and configured to communicate data with at least one neighboring processing element in the systolic array.
However, on the same field of endeavor, Rash discloses a systolic array of processing elements (PEs) and configured to communicate data with at least one neighboring processing element in the systolic array. Further, Rash discloses that at each PE in the systolic array, an element of A and B are multiplied and added to the incoming summand (from above in the Figure) and the outgoing sum is passed to the next row PEs (or the final output) (Rash Fig. 6 and paragraphs [0084-0089]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to combine the teachings of Kaul and Rash and implement the processing element of Kaul in a systolic array by creating multiple copies of the PE of Kaul and arranging the PEs in a grid comprising of multiple rows and columns and configuring each PE to multiply the inputs 
The motivation to do so is because using the processing element of Kaul provide a systolic array that supports both integer and floating-point computations within the same PE for computer vision and machine learning applications. Further, realizing multiply-add circuits for a merged int16/float16 datapath reduces total area by up to 29% compared to conventional designs with separate integer/floating-point datapaths (Kaul paragraph [0151, 0162]).
Therefore, the combination of Kaul as modified in view of Rash teaches a processing element coupled in a systolic array of processing elements and configured to communicate data with at least one neighboring processing element in the systolic array.

Regarding claim 6, Kaul as modified in view of Rash teaches all the limitation of claim 5 as stated above. Further, Kaul teaches wherein the shared multiplier is configured to perform at least floating-point multiplication and integer multiplication (Kaul Fig. 17A and paragraphs [0166-0167] “FIG. 17A illustrates a logic unit 1700 including merged computation circuits to perform floating point and integer fused-multiply accumulate operations … Some of the illustrated circuits are shared between operational modes, including the signed multiplier 1702A-1702B and 32-bit adder 1704, which are used for both integer and floating-point modes”)

Regarding claim 8, Kaul as modified in view of Rash teaches all the limitation of claim 5 as stated above. Further, Kaul teaches 
a shared adder configured to perform an integer addition and at least a first part of a floating-point addition (Kaul Fig. 17A shared adder – adder 1704); and
a separate circuit configured to perform at least a second part of the floating-point addition  (Kaul Fig. 17A separate circuit – normalize exponent block, 22b LZA, normalization shifter 1714, and negation incrementer 1742).

Regarding claim 9, Kaul as modified in view of Rash teaches all the limitation of claim 5 as stated above. Further, Kaul teaches wherein the one or more adders include:
a floating-point adder configured to perform floating-point addition (Kaul Fig. 17A and paragraphs [0166-0167] floating-point adder – adder 1704, normalize exponent block, 22b LZA, normalization shifter 1714, and negation incrementer 1742); and
an integer adder configured to perform integer addition (Kaul Fig. 17A and paragraphs [0166-0167] integer adder – adder 1704).

Regarding claim 10, Kaul as modified in view of Rash teaches all the limitation of claim 5 as stated above. Further, Kaul teaches 
each of the processing elements further includes a delay register coupled between the shared multiplier and the one or more adders (Kaul Fig. 17A delay register – register above 1705 between multiplier 1702A-1702B and adder 1704); and
the one or more adders are configured to receive one or more products from the shared multiplier on a subsequent systolic cycle, wherein the one or more products are generated by the shared multiplier during a prior systolic cycle (Kaul paragraph [0166] “For both computation modes, multiplication is performed in the first cycle and addition/rounding in the second cycle”).

Regarding claim 12, Kaul as modified in view of Rash teaches all the limitation of claim 5 as stated above. Further, Kaul teaches wherein each of the processing elements further includes a delay register coupled between the shared multiplier and the one or more adders (Kaul Fig. 17A delay register – register above 1705 between multiplier 1702A-1702B and adder 1704).

Regarding claim 14, Kaul as modified in view of Rash teaches all the limitation of claim 5 as stated above. Further, Kaul teaches wherein the selector circuit is configured to select among at least the integer partial sum and the non-integer partial sum based at least in part on a data type control signal (Kaul paragraph [0151] “A single control signal is utilized to switch, on a per-cycle basis, between floating-point and integer compute modes”; Fig. 17A data type control signal - mode select signal input to the multiplexer above 1730).

Regarding claim 16, Kaul as modified in view of Rash teaches all the limitation of claim 5 as stated above. Further, Kaul teaches wherein the shared multiplier is selectable to perform among at least: a single 17-bit integer multiplication, two or more parallel 9-bit integer multiplications, a single 16-bit brain floating-point multiplication, and a single 16-bit floating-point multiplication (Kaul paragraph [0153-0154] and Fig. 17A “The merged floating-point units described herein can selectively perform 16-bit integer or floating-point operations on a per-cycle basis” where 16-bit floating-point operations corresponds to the selectable single 16-bit floating-point multiplication).

Regarding claim 17, Kaul
obtaining a data type control signal indicating an integer data type (Kaul Fig. 17A and paragraphs [0166-0167] a data type control signal indicating an integer data type – computation mode signal indicating integer data type operations);
performing, by a shared multiplier, integer multiplication of a first input data element and an output of a weight register to generate an integer product (Kaul Fig. 17A shared multiplier – circuits ;
performing, by one or more adders, integer addition of a first input partial sum and the integer product to generate an integer partial sum (Kaul Fig. 17A paragraphs [0166-0167] one or more adders - circuits below shifter 1713 including adder 1704; first input partial sum – accumulator c value; integer partial sum – Int16 sum shown in Fig. 17A);
selecting, based at least partly on the data type control signal indicating the integer data type, the integer partial sum to provide as an output partial sum  (Kaul Fig. 17A for integer operations indicated by the mode signal, the integer partial sum int16 would be output in output port 1730);
obtaining a changed data type control signal to indicate a non-integer data type (Kaul Fig. 17A and paragraphs [0166-0167] changed data type control signal to indicate a non-integer data type – computation mode signal indicating floating-point data type operations);
performing, by the shared multiplier, non-integer multiplication of a second input data element and the output of the weight register to generate a non-integer product (Kaul Fig. 17A and paragraphs [0166-0167] second input data element – input a during the floating-point data type operations; output of a weight register – input b during the floating-point data type operations; non-integer product – output of the multiplier during the floating-point data type operations);
performing, by the one or more adders, non-integer addition of a second input partial sum and the non-integer product to generate a non-integer partial sum (Kaul Fig. 17A and paragraphs [0166-0167] second input partial sum – accumulator c value; non-integer partial sum – float16 sum shown in Fig. 17A that is input to 1714 as sum [22:0]); and
selecting, based at least partly on the changed data type control signal indicating the non-integer data type, the non-integer partial sum to provide as an output partial sum  (Kaul Fig. 17A for non-integer operations indicated by the mode signal, the floating-point partial sum float16 would be output in output port 1730).
Kaul does not explicitly teach the integer partial sum is provided as an output partial sum to another processing element and the non-integer partial sum is provided as an output partial sum to another processing element.
However, on the same field of endeavor, Rash discloses a systolic array of processing elements (PEs) and configured to communicate data with at least one neighboring processing element in the systolic array. Further, Rash discloses that at each PE in the systolic array, an element of A and B are multiplied and added to the incoming summand (from above in the Figure) and the outgoing sum is passed to the next row PEs (or the final output) (Rash Fig. 6 and paragraphs [0084-0089]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to combine the teachings of Kaul and Rash and implement the processing element of Kaul in a systolic array by creating multiple copies of the PE of Kaul and arranging the PEs in a grid comprising of multiple rows and columns and configuring each PE to multiply the inputs a and b and adding the result to an input partial sum from an upstream PE and passing the outgoing partial sum to the next row of PEs consistent with the teachings of Rash. 
The motivation to do so is because the processing element of Kaul provide a systolic array that supports both integer and floating-point computations within the same PE for computer vision and machine learning applications. Further, realizing multiply-add circuits for a merged int16/float16 datapath reduces total area by up to 29% compared to conventional designs with separate integer/floating-point datapaths (Kaul paragraph [0151, 0162]).
Therefore, the combination of Kaul as modified in view of Rash teaches the integer partial sum is provided as an output partial sum to another processing element and the non-integer partial sum is provided as an output partial sum to another processing element.
further comprising: performing, by multiplier exponent logic in the shared multiplier, a computation of exponent bits in the non-integer product (Kaul Fig. 17A multiplier exponent logic – precompute bigger mantissa and alignment 1705).

Claims 1, 2, and 4 are rejected under 35 U.S.C. 103 as being unpatentable over Kaul in view of Rash and Phelps et al. (US-PGPUB 2018/0336165 A1), hereinafter Phelps.
	Regarding claim 1, Kaul teaches 
	a processing element comprising:
a first input port for receiving an input data element (Kaul Fig. 17A and paragraph [0166] first input port receiving an input data element – input port receiving input a);
an output port for providing an output partial sum (Kaul Fig. 17A and paragraph [0166] output port providing an output partial sum – output port 1730);
a shared multiplier configured to multiply the input data element by the […] weight value, the shared multiplier comprising (Kaul Fig. 17A and paragraphs [0166-0167] shared multiplier – multiplier 1702A-1702B in mantissa unit 1709, and multiplexer 1707, and 1705 block in the exponent unit 1708; “FIG. 17A illustrates a logic unit 1700 including merged computation circuits to perform floating point and integer fused-multiply accumulate operations … Some of the illustrated circuits are shared between operational modes, including the signed multiplier 1702A-1702B and 32-bit adder 1704, which are used for both integer and floating-point modes”; paragraph [0124] “The neurons compute a dot product between the weights of the neurons and the region in the local input to which the neurons are connected; input data element – input a; weight input b):
a first sub-circuit configured to support integer multiplication (Kaul Fig. 17A and paragraphs [0166-0167] first sub-circuit - multiplier 1702A-1702B); and
a second sub-circuit configured to support floating-point multiplication, wherein at least a shared part of the first sub-circuit is shared with the second sub-circuit (Kaul Fig. 17A and paragraphs [0166-0167] second sub-circuit – multiplier 1702A-1702B in mantissa unit 1709, and multiplexer 1707, and 1705 block in the exponent unit 1708; “Some of the illustrated circuits are shared between operational modes, including the signed multiplier 1702A-1702B and 32-bit adder 1704, which are used for both integer and floating-point modes”);
one or more adders coupled to  (Kaul Fig. 17A and paragraphs [0166-0167] one or more adders – circuits below shifter 1713 including adder 1704):
generate an integer partial sum by performing integer addition on an integer product from the shared multiplier and the input partial sum (Kaul Fig. 17A and paragraphs [0166-0167] integer partial sum – Int16 sum in Fig. 17A; integer product – integer result of the integer multiplication; input partial sum – input C to the adder); and
generate a floating-point partial sum by performing floating-point addition on a floating-point product from the shared multiplier and the input partial sum (Kaul Fig. 17A and paragraphs [0166-0167] floating-point partial sum – float16 sum in Fig. 17A; floating-point product – floating-point result of the floating-point multiplication; input partial sum – input C to the adder); and
a selector circuit configured to select among at least the integer partial sum and the floating-point partial sum for providing to the output port as the output partial sum (Kaul Fig. 
Further, Kaul teaches that hardware accelerators for computer vision and machine learning can improve energy efficiency for applications such as object, face and speech recognition by orders of magnitude. These accelerators use interconnected processing element (PE) arrays, with multiply-add circuits being performance, area, and energy dominant for mapping key algorithms used for CNN compute operations. Further, Kaul discloses that the processing element shown in Fig. 17A can be used as building blocks for machine learning data processing system.
Kaul does not explicitly teach a systolic array of processing elements arranged in a first plurality of rows and a second plurality of columns, the systolic array of processing elements configured to operate on input data sets, each processing element of the processing elements comprising: a weight register for storing a stored weight value; a second input port for receiving an input partial sum; a shared multiplier configured to multiply the input data element by the stored weight value; one or more adders coupled to the second input port.
However, on the same field of endeavor, Rash discloses a systolic array of processing elements arranged in a first plurality of rows and a second plurality of columns, the systolic array of processing elements configured to operate on input data sets, each processing element of the processing elements comprising: a second input port for receiving an input partial sum. Further, Rash discloses that at each PE in the systolic array, an element of A and B are multiplied and added to the incoming summand (from above in the Figure) and the outgoing sum is passed to the next row PEs (or the final output) (Rash Fig. 6 and paragraphs [0084-0089]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to combine the teachings of Kaul and Rash and implement the processing element of Kaul in a systolic array by creating multiple copies of the PE of Kaul and arranging 
The motivation to do so is because the processing element of Kaul provide a systolic array that supports both integer and floating-point computations within the same PE for computer vision and machine learning applications. Further, realizing multiply-add circuits for a merged int16/float16 datapath reduces total area by up to 29% compared to conventional designs with separate integer/floating-point datapaths (Kaul paragraph [0151, 0162]).
Therefore, the combination of Kaul as modified in view of Rash teaches a systolic array of processing elements arranged in a first plurality of rows and a second plurality of columns, the systolic array of processing elements configured to operate on input data sets, each processing element of the processing elements comprising: a second input port for receiving an input partial sum; one or more adders coupled to the second input port.
Kaul as modified in view of Rash does not explicitly teach each processing element of the processing elements comprising: a weight register for storing a stored weight value; and the shared multiplier configured to multiply the input data element by the stored weight value.
However, on the same field of endeavor, Phelps teaches a processing element for a systolic array comprising a weight register for storing a weight input and a multiplication circuitry that is used to multiply the weight input from the weight register 402 with the activation input from the activation register 406 (Phelps Fig. 4 and paragraphs [0077-0078]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul in view of Rash using Phelps and configure each PE 
The motivation to include a weight register is to statically store the weight input such that as activation inputs are transferred to the cell, e.g., through the activation register 406, over multiple clock cycles, the weight input remains within the cell and is not transferred to an adjacent cell. Therefore, the weight input can be applied to multiple activation inputs, e.g., using the multiplication circuitry 408, and respective accumulated values can be transferred to an adjacent cell (Phelps paragraph [0082]).
Therefore, the combination of Kaul as modified in view of Rash and Phelps teaches each processing element of the processing elements comprising: a weight register for storing a stored weight value; and the shared multiplier configured to multiply the input data element by the stored weight value.

Regarding claim 2, Kaul as modified in view of Rash and Phelps teaches all the limitations of claim 1 as stated above. Further, Kaul as modified in view of Rash and Phelps teaches wherein:
the first sub-circuit is configured to generate a first product by performing integer multiplication on the input data element and the stored weight value (Kaul Fig. 17A and paragraphs [0166-0167] first product – output of multiplier 1702A-1702B when the mode indicate an integer operation);
the second sub-circuit is configured to generate a second product by performing floating-point multiplication on the input data element and the stored weight value, wherein the second product includes a significand and an exponent (Kaul Fig. 17A and paragraphs [0153, and 0166-0167] second product – output of multiplier 1702A-1702B when the mode indicate a floating-point operation; further floating-point data type includes a significand part and an exponent part; significand part - float16 [9:0] generated by the multiplier 1702A-1702B; exponent part – 5-bit exponent input into 1713); and
the shared part of the first sub-circuit that is shared with the second sub-circuit is used in generating the first product and is used in generating the significand of the second product (Kaul Fig. 17A and paragraphs [0153, and 0166-0167]; Fig. 17A shows the fractional part of the second product float16 [9:0] is generated by the multiplier 1702A-1702B and exponent part is generated by the exponent unit input into 1713).

Regarding claim 4, Kaul as modified in view of Rash and Phelps teaches all the limitations of claim 1 as stated above. Further, Kaul as modified in view of Rash and Phelps teaches wherein: wherein the one or more adders includes:
a shared adder part configured to both (Kaul Fig. 17A and paragraphs [0166-0167] shared adder – adder 1704):
generate the integer partial sum that is an integer data type by adding the integer product generated to the input partial sum (Kaul Fig. 17A - integer partial sum – Int16 sum in output port 1730); and
calculate one or more portions of the floating-point partial sum (Kaul Fig. 17A - one or more portions of the floating-point partial sum – mantissa part of the float 16 sum output port 1730); and
a separate circuit that is separate from the shared adder part, wherein the shared adder part and the separate circuit are used together in generating the floating-point partial sum that is a floating-point data type (Kaul Fig. 17A separate circuit - normalize exponent block, 22b LZA, normalization shifter 1714, and negation incrementer 1742).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Kaul in view of Rash as applied to claim 5 above, and further in view of Snelgrove et al (US-PGPUB 2021/0091794 A1), hereinafter Snelgrove.
Regarding claim 11, Kaul as modified in view of Rash teaches all the limitations of claim 5 as stated above.
	Kaul does not explicitly teach further comprising a skip calculation generator configured to propagate a skip calculation signal to a plurality of processing elements, wherein the skip calculation signal is configured to prevent at least one of the shared multiplier or the one or more adders for a systolic cycle from contributing to arithmetic computations of the systolic array during the systolic cycle.
	However, on the same field of endeavor, Snelgrove teaches a processing element comprising a skip calculation generator configured to propagate a skip calculation signal to a plurality of processing elements, wherein the skip calculation signal is configured to prevent at least one of a multiplier or an adder for a processing/clock cycle from contributing to arithmetic computations (Snelgrove Fig. 21 and paragraphs [0143-0154] skip calculation generator – zero detect 2116 and zero disable 2110; multiplier – ALU 2102 “The ALU 2102 may include one or more levels of multiplexor and/or a multiplier 2108”).
	Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul in view of Rash using Snelgrove and configure each processing element of Kaul to include circuitry for zero detect and zero disable to skip or disable zero multiplication and accumulation operations as taught by Snelgrove.
	The motivation to do so is because multiplication by zero produces a zero product which does not need to be accumulated. As such, the zero disable saves energy, as each PE uses significantly more energy when an input changes as opposed to when the inputs do not change (Snelgrove paragraph (0152]).
.
Claims 13, 15, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Kaul in view of Rash as applied to claims 5 and 17 above respectively, and further in view of Vantrease et al. (US-PGPUB 2019/0294413 A1), hereinafter Vantrease.
Regarding claim 15, Kaul as modified in view of Rash teaches all the limitations of claim 5 as stated above.
Kaul does not explicitly teach further comprising a partial de-quantizer configured to partially de-quantize the input data element for a row of the processing elements, wherein partially de-quantizing the input data element includes shifting the input data element and adding an extra bit to the input data element.
However, on the same field of endeavor, Vantrease discloses a systolic array circuit further comprising a partial de-quantizer configured to partially de-quantize the input data element for a row of the processing elements, wherein partially de-quantizing the input data element includes shifting the input data element and adding an extra bit to the input data element (Vantrease Fig. 9 and paragraphs [0107-0112] partial de-quantizer – subtraction engines 930a-930d “Each row may also include a subtraction engine 930a, 930b, 930c, ... , or 930d that takes quantized inputs (e.g., in UNIT8 format) and zero-point integers (e.g., in UNIT8 format) from a memory or a buffer … The subtraction engine … subtract zero-point integer Xqz 938 from input data element Xq 932 to generate a shifted input data element Xq_shift 942 in INT9 format”).

The motivation to do so is several benefits may be achieved by using  subtraction engines in front of the PE array to shift the unsigned integer inputs to generate difference values. For example, the input data may be stored in the memory in UINT8 format, which may allow easier hardware design and efficient storage and management than data in the INT9 format. Further, a real value 0 that is quantized asymmetrically to a non-zero unsigned integer (i.e., zero-point integer) with no quantization error may be converted to a signed integer value 0 by the subtraction engine before the multiplication  (Vantrease paragraph [0112]).
Therefore, the combination of Kaul as modified in view of Rash and Vantrease teaches further comprising a partial de-quantizer configured to partially de-quantize the input data element for a row of the processing elements, wherein partially de-quantizing the input data element includes shifting the input data element and adding an extra bit to the input data element.

Regarding claim 13, Kaul as modified in view of Rash teaches all the limitations of claim 5 as stated above. Further, Kaul as modified in view of Rash and Vantrease teaches wherein:
the shared multiplier is selectable to at least one of: a single 16-bit brain floating-point multiplication or a single 16-bit floating-point multiplication (Kaul Fig. 17A and paragraphs [0153-0154 and 0162]  the shared multiplier is configured to support a single 16-bit floating-point multiplication).
wherein: the shared multiplier is selectable to at least one of: a single 17-bit integer multiplication or two parallel 9-bit integer multiplications.
However, on the same field of endeavor, Vantrease discloses a systolic array circuit further comprising a partial de-quantizer configured to partially de-quantize the input data element for a row of the processing elements, wherein partially de-quantizing the input data element includes shifting the input data element and adding an extra bit to the input data element (Vantrease Fig. 9 and paragraphs [0107-0112] partial de-quantizer – subtraction engines 930a-930d “Each row may also include a subtraction engine 930a, 930b, 930c, ... , or 930d that takes quantized inputs (e.g., in UNIT8 format) and zero-point integers (e.g., in UNIT8 format) from a memory or a buffer … The subtraction engine … subtract zero-point integer Xqz 938 from input data element Xq 932 to generate a shifted input data element Xq_shift 942 in INT9 format”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul in view of Rash using Vantrease and configure the systolic array to include a subtraction engine connected to each row of processing elements as shown in Fig. 9 of Vantrease for partially de-quantizing the input data element including shifting the input data element and adding an extra 1 bit to the input data element such as generating 17-bit shifted integer input data a from the 16-bit integer input data a as taught by Vantrease which can then be used as input to the multiplier 1702A-1702B.
The motivation to combine is the same as claim 15.
Therefore, the combination of Kaul as modified in view of Rash and Vantrease teaches wherein: the shared multiplier is selectable to at least one of: a single 17-bit integer multiplication or two parallel 9-bit integer multiplications.

further comprising: 
receiving the first input data element […] (Kaul Fig. 17A shows receiving input a through the first input port); and
performing multiplication on the first input data element and the output of the weight register by the shared multiplier (Kaul Fig. 17A and paragraph [0166]).
Further, Kaul teaches the processing element supports a 16-bit integer operations (Kaul paragraphs [0154 and 0166]).
Kaul does not explicitly teach wherein the first input data element is a 9-bit or 17-bit integer; and wherein the shared multiplier is configured to support multiplication of at least one of 9-bit or 17-bit integers.
However, on the same field of endeavor, Vantrease discloses a systolic array circuit further comprising a partial de-quantizer configured to partially de-quantize the input data element for a row of the processing elements, wherein partially de-quantizing the input data element includes shifting the input data element and adding an extra bit to the input data element (Vantrease Fig. 9 and paragraphs [0107-0112] partial de-quantizer – subtraction engines 930a-930d “Each row may also include a subtraction engine 930a, 930b, 930c, ... , or 930d that takes quantized inputs (e.g., in UNIT8 format) and zero-point integers (e.g., in UNIT8 format) from a memory or a buffer … The subtraction engine … subtract zero-point integer Xqz 938 from input data element Xq 932 to generate a shifted input data element Xq_shift 942 in INT9 format”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kaul in view of Rash using Vantrease and configure the systolic array of Rash to include a subtraction engine connected to each row of processing elements as shown in Fig. 9 of Vantrease for partially de-quantizing the input data element including shifting the 
The motivation to combine is the same as claim 15.
Therefore, the combination of Kaul as modified in view of Rash and Vantrease teaches wherein the first input data element is a 9-bit or 17-bit integer; and wherein the shared multiplier is configured to support multiplication of at least one of 9-bit or 17-bit integers.
Allowable Subject Matter
Claims 7 and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. 
The following is a statement of reasons for the indication of allowable subject matter:
Claim 7 is directed to a systolic circuit comprising, among other things, a multiplexer configured to select among a plurality of products generated by the shared multiplier for providing to the one or more adders based on a data type control signal, wherein the plurality of products includes at least an integer product and a floating-point product. Claim 20 is directed to a method of operating the systolic circuit comprising, among other things, performing the integer addition by an integer adder; and performing the non-integer addition by a non-integer adder that is separate from the integer adder.
Kaul et al. (EP 3396524 A1) discloses the claimed subject matter in accordance with the claim mappings discussed above. Further, Kaul discloses wherein the shared multiplier is configured to, based at least in part on the data type control signal, prevent circuitry for floating-point multiplication from contributing to an operation as shown in Fig. 17 in where switches 1701A and 1702B routes the upper 6 bits of the integer to the multiplier for integer operation while in the floating-point operation, the a multiplexer configured to select among a plurality of products generated by the shared multiplier for providing to the one or more adders based on a data type control signal, wherein the plurality of products include at least an integer product and a floating-point product as recited in claim 7. As shown in Fig. 17A, the multiplier only generates or outputs one product, therefore, it would not be obvious to modify Kaul and configure the multiplier to generate a floating-point product and an integer product on the same cycle and include a multiplexer to select which product to provide to the adder. Further, Kaul does not explicitly teach or suggest performing the integer addition by an integer adder; and performing the non-integer addition by a non-integer adder that is separate from the integer adder as recited in claim 20. As shown in Fig. 17A of Kaul and paragraph [0166], the integer and floating-point addition is performed using a shared circuitry including the same adder 1704 as provides a single result which are routed to different circuitry of the adder circuit. Further, as shown in Figures 10B and 10D of the present application, using separate adders for the integer addition and non-integer, i.e. floating-point addition requires generating both the integer and non-integer product. Therefore, it is not obvious to modify Kaul to include separate adders as this would require generating two partial sum result and would also require generating two products as inputs to each respective adder as shown in Figs. 10B and 10D. 
Choquette (US Patent No. 6,480,872 B1) discloses a device that supports floating-point and integer multiply-accumulate operations. The device includes a multiply array capable of performing floating-point and integer multiplication. The device performs floating-point multiplication when the operands are in floating-point format or performs integer multiplication when the operands are in integer format or in response to a control signal. The device also includes a first adder and a second adder to add the multiplication result with a third operand. Further, Choquette discloses a configuration in which the second adder is configured to implement floating-point additions and integer a multiplexer configured to select among a plurality of products generated by the shared multiplier for providing to the one or more adders based on a data type control signal, wherein the plurality of products include at least an integer product and a floating-point product as recited in claim 7. Just like Kaul, the multiplier of Choquette only generates or outputs one product which includes the sum and carry portion. Furthermore, although Choquette discloses using two separate adders, there is no motivation to modify Kaul to include separate adders for performing the integer addition and the non-integer addition as this would require generating two partial sum result and would also require generating two products as inputs to each respective adder as shown in Figs. 10B and 10D. Choquette is cited in the IDS submitted 04/22/2021.
Wyland et al. (US Patent No. 6,205,462 B1) discloses a multiply-accumulate circuit that supports both integer and floating-point operands. The integer operands are represented using floating-point format with the exponent bits set to all zeros or ones to represent positive or negative integers. The circuit includes a multiplier configured to multiply the mantissa bits to produce a product. The circuit also includes an exponent logic which detects whether the exponent are all zeroes or ones which indicate integer operands, and if not passes the exponent values of the operands to an adder to produce and exponent sum used for shifting the floating-point product. The circuit also includes an adder to add the product with a partial result stored in an accumulator. However, Wyland does not explicitly teach or suggest the circuit comprising a multiplexer configured to select among a plurality of products generated by the shared multiplier for providing to the one or more adders based on a data type control signal, wherein the plurality of products include at least an integer product and a floating-point product as recited in claim 7. The circuit is shown in Fig. 2. Similar to Kaul, the multiplier 142 of Wyland only generates or outputs one product 143. Further, Wyland does not explicitly teach or suggest performing the integer addition by an integer adder; and performing the non-integer addition by a non-integer adder that is separate from the integer adder as recited in claim 20. As shown in Fig. 2 of Wyland, the integer and floating-point addition is performed using a shared circuitry including the same adder 148 and provides a single result which are routed to the accumulator. 
Pugh et al. (US-PGPUB 2021/0042087 A1) discloses an arithmetic circuit that supports both floating-point and integer multiply-accumulate operations. The circuit includes multipliers and adders configured to operate on integer operands or floating-point operands in response to a mode selection signal. Further, the circuit includes multiplexers, delay registers, and remap logic that remaps the operands to a format used by the multipliers and adders. However, Pugh does not explicitly teach or suggest the circuit comprising a multiplexer configured to select among a plurality of products generated by the shared multiplier for providing to the one or more adders based on a data type control signal, wherein the plurality of products include at least an integer product and a floating-point product as recited in claim 7. Further, Pugh does not explicitly teach or suggest performing the integer addition by an integer adder; and performing the non-integer addition by a non-integer adder that is separate from the integer adder as recited in claim 20.
Rash et. al (US-PGPUB 2021/0089316 A1) discloses a systolic array of processing elements (PE) that include multiple rows and multiple columns. Further, Rash discloses the processing flow within the systolic array in which processing element multiplies an operand a and b and adds the result to an input partial sum and outputs the outgoing sum to a neighboring PE. Phelps et al. (US-PGPUB 2018/0336165 A1) discloses a processing flow of a systolic array similar to Rash. Further, Phelps discloses a processing element that includes a weight register as shown in Fig. 4. Vantrease et al. (US-PGPUB US 2019/0294413 multiplexer configured to select among a plurality of products generated by the shared multiplier for providing to the one or more adders based on a data type control signal, wherein the plurality of products include at least an integer product and a floating-point product as recited in claim 7. Further, neither Rash, Phelps, Vantrease, nor Snelgrove explicitly teach or suggest performing the integer addition by an integer adder; and performing the non-integer addition by a non-integer adder that is separate from the integer adder as recited in claim 20.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Carlo Waje whose telephone number is (571)272-5767. The examiner can normally be reached 9:00-6:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on (571) 270-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/C.W./
Carlo WajeExaminer, Art Unit 2182                                                                                                                                                                                                        (571)272-576




/JYOTI MEHTA/Supervisory Patent Examiner, Art Unit 2182