Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
This action is in reply to the amendments and remarks filed on 10/25/2022.
Claims 1-20 are pending.
Claims 2, 7, and 10-12 have been amended.  

Response to Arguments
Applicant’s arguments, with respect to the drawing objections, have been fully considered and are persuasive. Therefore, the objections set forth in the previous office action have been withdrawn.

Applicant’s arguments, with respect to select rejections of claim(s) 12 under 35 U.S.C. 112(b), have been fully considered and are persuasive. Therefore, the rejections have been withdrawn.

Applicant’s arguments, with respect to select rejection(s) of claim(s) 2-20 under 35 U.S.C. 112(b), have been fully considered but they are not persuasive. Examiner notes that not all the rejection(s) to claims 2-20 were not addressed and are therefore maintained. See 35 U.S.C. 112(b) section below for updated analysis.

Applicant’s arguments, with respect to the publication dates of specific prior art under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that “the Office Action has not provided evidence that [the Kim and Gao references are]…a printed publication having a publication date that predates the effective filing date of the present application”. The examiner respectfully disagrees. 
In the “List of references cited by examiner” filed 07/25/2022, the published dates have been listed that not only predate the instant application’s EFD of 04/18/2019, but also the claimed priority date of 12/07/2018.
Further, the references in question are cited below for assistance to the applicant:
Kim et al., "A novel zero weight/activation-aware hardware architecture of convolutional neural network", 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1462-1467.
Gao et al., “TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory”, 2017, ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, Pages 751–764.

Applicant’s arguments, with respect to the rejection(s) of claim 1 under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that no reference teaches the claim limitation of claim 1 stating “tensor computation dataflow accelerator semiconductor circuit”, since Das cited portions “do not mention a tensor computation dataflow accelerator semiconductor circuit”. The examiner respectfully disagrees. 
Das is cited as teaching the components making up the tensor computation dataflow accelerator semiconductor circuit, thus teaching the required language of the claim. Nonetheless, due to the broadness of the claim language, Das is found to teach the limitation above since paragraphs 0138-0140 teach a “Hardware acceleration for the machine learning application” on a “GPGPU” (accelerator semiconductor circuit) and executing “tensor convolutions” (tensor computation dataflow) applied to the embodiments of the disclosure as such.
See 35 U.S.C 103 section for full mapping of claim limitations.

Applicant’s arguments, with respect to the rejection(s) of claim 2 under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that no reference teaches the claim limitations of claim 2 since “in Li’s FIG. 10…the MACs 131..are not adjacent to the second buffer 133”. The examiner respectfully disagrees. 
Due to the broadness of the claim language, Li has been found to teach the claim limitations as required by the claim language. Li, page 6, line 31 teach the “MACs 131” in close proximity to the “second buffer 133” thus meeting the claim language. Further, applicant points to Li Fig. 10, however this depicts the components also be in close proximity as well.
See 35 U.S.C 103 section for full mapping of claim limitations necessitated by applicant amendments.

Applicant’s arguments, with respect to the rejection(s) of claims 4-6 and 9 under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that no reference teaches the claim limitations since “nowhere does Li teach a weight matrix vector”, “that the weights are initialized, which is different than merely having an initial value”, “that [Li’s] input is an input matrix vector”, and that “‘weights’ are not necessarily a weight matrix vector”. The examiner respectfully disagrees. 
Due to the broadness of the claim language and high-level arguments posed against the mappings without further details (i.e., how initialized weights are different than determining an initial weight value), Li has been found to teach amended claim limitations as required by the claim language. Li, page 6 and Fig. 11 teach a “matrix of input data” (input is an input matrix vector), a matrix representing “weights of the filter” (weight matrix vector), and that the weights are stored in a “weight data buffer”. Further, previously cited page 3, lines 19-21, 24-25, and 31, page 5, lines 1-2, and page 6 teach the limitations as previously mapped.
See 35 U.S.C 103 section for full mapping of claim limitations necessitated by applicant amendments.

Applicant’s arguments, with respect to the rejection(s) of claim 10 under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that no reference teaches the amended claim limitations since “Nowhere do Li’s paragraphs 0007-0008 disclose the specific limitations of the amended claim”. The examiner respectfully disagrees. 
Due to the broadness of the claim language and high-level arguments of not teaching the amendments, Li has been found to teach amended claim limitations as required by the claim language. Li, (Pg. 3, Lines 21-22) “and the partial sum adder is configured to iteratively add the second intermediate data to the corresponding partial sum stored in the second buffer and store the partial sum calculated for each iteration as a partial sum of the output data into the second buffer. ]
[ (Pg. 5, Paragraph 7 – Paragraph 8) In general, these two paragraphs discuss the processing elements being linked together to create a chain of processing elements that carry out the convolution operations. This along with the citation above teaches that multiple processing elements can be iterated through this process with the steps of the next processing element summing the previous processing element(s) outputs together.
See 35 U.S.C 103 section for full mapping of claim limitations necessitated by applicant amendments.

Applicant’s arguments, with respect to the rejection(s) of claim 11 under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that no reference teaches the claim limitations of claim 11 since “Kim does not teach a pipelined propagation of partial sums”. The examiner respectfully disagrees. 
Due to the broadness of the claim language, Kim has been found to teach the claim limitations as required by the claim language. Kim, page 1464, Col. 2, Paragraphs 1-3 and 6-7 teach “Moreover, in order to further increase the overlapping between current and next activation tiles, we adopt a zig-zag order in visiting the activation tiles in a WG. After the current activation tile is processed, we first move the window horizontally to the right or left and process the next activation tile. Then, when it reaches the horizontal edge of the input activation, we move the window down by a stride and again move horizontally in the opposite direction in a zigzag fashion to choose the next activation tile (in Fig. 3b)”
This citation from Kim teaches the serpentine (zig-zag) pattern being used to propagate produced “partial sum” data through to the processing elements. Examiner notes that while the reference teaches using the partial sums, the primary reference also teaches the partial sums being propagated from one processing element to the next and this reference is relied upon to teach the serpentine fashion for data propagation to processing elements.
See 35 U.S.C 103 section for full mapping of claim limitations necessitated by applicant amendments.

Applicant’s arguments, with respect to the rejection(s) of claims 12-13 under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that no reference teaches the claim limitations of the claims since “nowhere does Gao teach that the PE passes in a serpentine fashion”, Gao “being capable of performing MAC operations is not the same as a systolic MAC array”, Gao’s “HMC memory dies are not NDP-DF accelerator unit dies” as claimed. The examiner respectfully disagrees. 
Due to the broadness of the claim language and high-level arguments of the references not teaching the limitations with no further details as to how or why, Gao, Hyde, and newly cited Wu in view of the amendments have been found to teach amended claim limitations as required by the claim language. 
Applicant’s specification shows the serpentine fashion to still be a grid but going from column to column and then the next row. Gao’s diagram shows the same vault set up with each PE in Fig. 3 passing in a serpentine fashion. Further, new reference Wu has been cited as teaching the amendments concerning the serpentine fashion. Gao additionally was found, in page 2, Col. 1, paragraph 1, and page 4, Col 1, section 3.1 for teaching an array of processing elements (systolic MAC array), that section 4.1 teaches can be scheduled for certain rhythmic intervals; thus, teaching the claimed language. Further still, new reference Wu has been cited for teaching these claim elements as amended.
Gao continues to teach Pg. 2, Col. 1, Paragraph 1) “We combine these hardware and software optimizations in the design of TETRIS, an NN accelerator that uses an eight-die HMC memory stack”, to which the applicant offers no further details as to how this does not read on the claimed elements, but merely offers a high-level argument of Gao not teaching.
See 35 U.S.C 103 section for full mapping of claim limitations necessitated by applicant amendments.

Applicant’s arguments, with respect to the rejection(s) of claims 14-17 under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that no reference teaches the claim limitations of the claims since “neither Hyde” or any reference “disclose NDP-DP accelerator unit die, let alone a stack of them atop a base die”, and “Nowhere does Hyde disclose a host, much less NDP-DP accelerator unit dies and the base die are configured to offload computation from a host that is separate from the tensor computation dataflow accelerator semiconductor circuit”. The examiner respectfully disagrees. 
Due to the broadness of the claim language and high-level arguments of the references not teaching the limitations with no further details as to how or why, Gao, Hyde, and newly cited Wu in view of the amendments have been found to teach amended claim limitations as required by the claim language. 
Hyde was cited as teaching the limitations of claims 14-17 in view of the teaching of the other references (as noted in the previous OA), and in view of the broadness of the claim language is maintained as reading on the claimed elements. (¶0013) “disclosed herein include at least one CPU die and at least one memory die stacked on top of each other (e.g., in vertical alignment). Additionally, or alternatively, in some examples, at least one GPU die is stacked in vertical alignment with one or more CPU die and/or one or more memory die” and “To further increase transfer rates and reduce an overall form factor for a multi-die package, the individual dies may be stacked on top of one another in vertical alignment and communicatively coupled using through silicon vias (TSVs)”. Here the reference teaches that the dies are stacked upon one another and connected. (¶0021) “The type of chip that is positioned closest to the die stack 106 may depend upon the intended use for the package 100”, thus the die stack is adjacent (positioned closest) to the chip (in this case the controller chip). (¶0021) “Thus, if a GPU is to be implemented for general-purpose workloads (rather than merely graphics) with demand for high computation performance, the directly adjacent die 124 may be a GPU die with the second additional die 126 being an ICH die”, thus the workload is offloaded as previously explained.
See 35 U.S.C 103 section for full mapping of claim limitations.

Applicant’s arguments, with respect to the rejection(s) of claim 1 under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that “The reason provided as motivation to combine Das with Li falls far short of establishing a prima facie case for obviousness. Stating that the reason is to increase performance and efficiency is not sufficient articulated reasoning with some rational underpinning”. The examiner respectfully disagrees. 
Applicant poses a high-level argument that the motivation to combine Li and Das is not sufficient with no further reasoning as to why. As previously shown, Das paragraph 0004 explains that executing data in parallel in a distributed machine learning system “increase[s] processing efficiency”. Situating processors communicatively adjacent to the memory and executing data in parallel as taught in Das is obvious to improve the functionality of Li in order to increase processing speed and efficiency. See 35 U.S.C 103 section for full mapping of claim limitations.

Applicant’s arguments, with respect to the rejection(s) of claims 2-10 under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that “[n]o reason is provided as motivation to combine Das with Li with respect to claim[s 2-10], and therefore, the Office Action falls far short of establishing a prima facie case for obviousness”. The examiner respectfully disagrees. 
The primary reference, Li, was cited as teaching claims 2-10 and is not required to have a motivation to combine the teachings of Li with itself. A statement of motivation to combine references is required when combining subsequent references with a primary reference, see MPEP §1504.03.
Further, in response to applicant’s argument that there is no teaching, suggestion, or motivation to combine the references, the examiner recognizes that obviousness may be established by combining or modifying the teachings of the prior art to produce the claimed invention where there is some teaching, suggestion, or motivation to do so found either in the references themselves or in the knowledge generally available to one of ordinary skill in the art.  See In re Fine, 837 F.2d 1071, 5 USPQ2d 1596 (Fed. Cir. 1988), In re Jones, 958 F.2d 347, 21 USPQ2d 1941 (Fed. Cir. 1992), and KSR International Co. v. Teleflex, Inc., 550 U.S. 398, 82 USPQ2d 1385 (2007). See 35 U.S.C 103 section for full mapping of claim limitations and motivations to combine references.

Applicant’s arguments, with respect to the rejection(s) of claims 11-13 and 18 under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that “The reason[s] provided as motivation to combine Kim with Das and Li [regarding claim 11 and to combine Gao with Das and Li regarding claims 12, 13, and 18] falls far short of establishing a prima facie case for obviousness. Stating that the reason is to increase performance and efficiency is not sufficient articulated reasoning with some rational underpinning, while avoiding hindsight bias”. The examiner respectfully disagrees. 
Primarily, in response to applicant’s argument that there is no teaching, suggestion, or motivation to combine the references, the examiner recognizes that obviousness may be established by combining or modifying the teachings of the prior art to produce the claimed invention where there is some teaching, suggestion, or motivation to do so found either in the references themselves or in the knowledge generally available to one of ordinary skill in the art.  See In re Fine, 837 F.2d 1071, 5 USPQ2d 1596 (Fed. Cir. 1988), In re Jones, 958 F.2d 347, 21 USPQ2d 1941 (Fed. Cir. 1992), and KSR International Co. v. Teleflex, Inc., 550 U.S. 398, 82 USPQ2d 1385 (2007). See 35 U.S.C 103 section for full mapping of claim limitations and motivations to combine references.
Next, the applicant again merely offers a high-level argument that the stated previous motivation “to increase performance and efficiency” (as argued) is not sufficient without offering any further details as to how or why. Nonetheless, Kim’s abstract demonstrates the value of adding the use of zig-zag scheduling through the processing system for offering a significant “speedup” against other architectures; and Gao’s abstract demonstrates the structuring of the NN accelerator system, software scheduling, and partitioning techniques leading to “increase performance and energy efficiency”, thus both Kim and Gao’s teachings are obvious to add to the combination of Li and Das as previous stated.
See 35 U.S.C 103 section for full mapping of claim limitations and motivations to combine references.

Claim Interpretation
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: 
“Processing unit” in claims 1-20. The claim(s) language recites these processing units but does not reference any structure or algorithm for support of these units. The processing units perform a variety of operations throughout the claims with one example being forming a pipelined dataflow chain for partial output data as in claim 1. In light of the specification, the applicant seems to be referencing a hardware processor with no specific structure (¶0046). As a “processing unit” is a general term such that one of ordinary skill in the art would agree has no agreed upon one specific structure or definition, the term is now being interpreted under 112(f). For the purposes of this office action, the term is being interpreted as any hardware processor or logic processor that can perform the claim limitation(s).
“Processing engine” in claims 7-11. The claim(s) language recites these processing engines but does not reference any structure or algorithm for support of these units. The claims recite the processing engines containing multiply-and-add units and various buffers, but no structure for the engines themselves is disclosed. In claim 7 the engines are configured to output the product to adjacent engines and further claims like 8 have the engines performing similar data management actions without specifying a particular component within the engine to perform this action. In light of the specification, the applicant points to the processing engines in Fig. 12 (reference number 1215) with the diagram including the components within each processing engine but fails to disclose the structure of the engine itself and if it is hardware including these components or something else such as a software module. As a “processing engine” is a general term such that one of ordinary skill in the art would agree has no agreed upon one specific structure or definition, the term is now being interpreted under 112(f). For the purposes of this office action, the term is being interpreted as any hardware processor or logic processor that can perform the claim limitation(s).
“multiply-and-add unit” in claims 6-11. The claim(s) language recites these units but does not reference any structure or algorithm for support of these units. The claims recite the units being within the processing engines of the independent claim but fail to disclose whether the units are hardware components located on the processing engines or are integrated into the processing engines themselves. In claim 6 the units are configured to calculate a product of the input matrix and weight matrix vectors. In light of the specification, the examiner cannot find a clear indication of what the units’ structure is. As a “multiply-and-add unit” is a general term such that one of ordinary skill in the art would agree has no agreed upon one specific structure or definition, the term is now being interpreted under 112(f). For the purposes of this office action, the term is being interpreted as any hardware processor or logic processor that can perform the claim limitation(s).

Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 2-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
In regards to claim 2, the claim recites: “wherein: the array of processing units is an array of multiply-and-add units; the first processing unit is a first multiply-and-add unit;” It is unclear if the processing units are each a multiply-and-add unit and form an array, if the array of processing units make up an array equivalent to an array of multiply-and-add units, or another structure altogether. Additionally, the claim recites processing engines each containing a multiply-and-add unit which continues to obfuscate the structure and where the units are located. Applicant’s specification does not provide any clear distinction for the structure or organization for these components and as such leaves the claim indefinite. For the purposes of this office action, the claim is interpreted to mean the processing engines contain a multiply-and-add unit/processing unit. Applicant is kindly asked to please fix this and all similar issues throughout the claims.
In regards to claims 7–11, the claims recite a processing engine with components including: “each including a multiply-and-add unit from among the peripheral array of multiply-and-add units” but fails to disclose the structure for the engine itself and whether it is a physical structure for including the components or something else such as a software module. Within the claims, the engines perform several actions like “configured to output the product to the partial sim buffer of a second processing engine” (claim 7) or “configured to receive a second input matrix vector from the memory bank in the streaming fashion” (claim 9) without reciting which component within the engine is performing these actions. If the engine itself, and not the components, are performing the action then the structure of the engine that allows it to perform these and other limitations is unclear and leaves the claim indefinite. However, the components in question and the engines remain unclear as shown above. Applicant is kindly asked to please fix this and all similar issues throughout the claims.

In regards to claim 12, the claim recites “the circuit further comprising: a near-DRAM-processing dataflow (NDP-DF) accelerator unit die” but does not disclose what a near-DRAM-processing dataflow accelerator unit is. In light of the specification, the applicant states that near-data-processing places arithmetic logic units (ALU) outside of a memory core bank, but does not explain or disclose what a NDP-DF accelerator die would comprise of or the differences between the NDP-DF accelerator unit die and a normal die or what a NDP-DF accelerator unit would comprise of. The applicant stated that the NDP “is a term used in the art”, however, the NDP-DF accelerator unit die remains to have undefined structure with a seemingly place-holder term. Without the context of differences for this component, one of ordinary skill in the art would not know the definition of said unit and as such, the claim is rendered indefinite. For the purposes of this office action, the claim is interpreted to mean the accelerator die is located on a DRAM memory bank and as such is equivalent to a near-DRAM-processing dataflow as in the claim limitation. However, the components in question and the engines remain unclear as shown above. Applicant is kindly asked to please fix this and all similar issues throughout the claims.

The rest of the claims are rejected for their dependence on the above claims. Applicant is kindly asked to fix all similar issues throughout the claims.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-10 are rejected under 35 U.S.C. 103 as being unpatentable over Li (CN 111144545 A) (using a machine translation) and further in view of Das (US 20180322606 A1)
In regards to claim 1, Li teaches the following:
	wherein the peripheral array of processing units are configured to form a pipelined dataflow chain in which partial output data from a first processing unit from among the array of processing units is fed into another processing unit from among the array of processing units for data accumulation.
[ (Pg. 4, Line 44-47) “Referring to fig. 4, the spatial architecture of the hardware for processing convolutional neural networks as shown in fig. 4 uses a data flow (Dataflow) processing approach. In the spatial architecture, the ALUs form a data processing chain, so that data can be transferred directly between the ALUs.”
This citation from Li teaches the passing of data in a pipeline manner where the processed data gets passed from one processing element to another. Fig. 4 has been attached to this office action below for convenience. Examiner notes that the secondary reference is relied upon for teaching the plurality of processors (equivalent to processing units of claim 
    PNG
    media_image1.png
    320
    356
    media_image1.png
    Greyscale
language) being connected together. ]
[ (Pg. 3, Lines 23-26) “Wherein the processing element comprises: a first buffer configured to store input data and a weight corresponding to a convolution operation; a shift unit configured to perform a shift operation on input data to generate first intermediate data; a plurality of operation units configured to perform at least a part of the two-dimensional convolution operations based on the weight values and the first intermediate data, and generate output data. The shift operation performed by the shift unit includes: acquiring data from a neighboring processing element”
	Whereas this citation shows that the processing elements (which are made up of ALUs) contain the shift elements which pass the data from one processing element to another and also contain the intermediate data which will be used for data accumulation. ]
	What Li does not distinctly disclose and is instead taught by Das is seen below:
A tensor computation dataflow accelerator semiconductor circuit, comprising: a memory bank;
[ (Fig. 1) and (¶0006) “FIG. 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the embodiments described herein”
This diagram from Das teaches the memory hub (reference number 105) which examiner notes is equivalent to a memory bank. ]
and a peripheral array of processing units in communication with the memory bank,
[ (Fig. 1) and (¶0006) “FIG. 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the embodiments described herein”
	Continuing on, the same diagram also teaches a plurality of processors (equivalent to processing units) in reference numbers 112 and 102 that are connected to the memory hub via a communication link which has the reference number 113. This is equivalent to the applicant’s specification which cites the processing units as either a CPU or GPU [Applicant’s specification (¶0046) ] and is the structure that Das is relied upon for teaching. ]
	Therefore, it would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine a system for implementing convolution operations consisting of processing elements as taught by Li with the system for distributed machine learning with hardware that facilitates data parallelism as taught by Das. The reason it would be obvious is one of ordinary skill in the art would recognize, prior to the effective filing date, that combining the two would provide an increase in performance for the neural network processes [ Das (¶0004) ]. This would facilitate the recognized benefit of creating a much more efficient system overall which can provide an increase in speed and/or reduction in processing time to finish the calculations.

In regards to claim 2, the tensor computation dataflow accelerator semiconductor circuit of claim 1, is taught by Li/Das in the rejection for claim 1 above. Li continues teaching the claim as below:
the array of processing units is an array of multiply-and-add units;
[ (Pg. 3, Lines 19-20) “Optionally, the plurality of operation units includes a plurality of multiply-accumulate units, partial sum adders, and a second buffer. Wherein the plurality of multiply-accumulate units are configured to perform multiply-and-accumulate operations on the first intermediate data”
	This citation from Li teaches the plurality of operation units (which are/make-up the processing elements) containing a plurality of multiply-accumulate units. ]
the first processing unit is a first multiply-and-add unit; the second processing unit is a second multiply-and-add unit;
[ (Pg. 6, Line 31) “Further, the plurality of operation units include a plurality of multiply-accumulate units (MACs) 131”
	As cited above, each of the processing units of Li contains MAC units (equivalent to the multiply-and-add unit).  ]
 the array of multiply-and-add units is disposed adjacent to the memory bank;
[ (Pg. 6, Line 31) “Further, the plurality of operation units include a plurality of multiply-accumulate units (MACs) 131, a partial sum adder (PSUM)132, and a second buffer 133”
	This citation from Li teaches that the MAC units are connected to and therefore adjacent to buffers which are a part of the memory bank. ]
and the tensor computation dataflow accelerator semiconductor circuit further comprises a peripheral array of processing engines each including a multiply-and-add unit from among the peripheral array of multiply-and-add units.
[ (Pg. 3, Lines 23-24) “According to another aspect of an embodiment of the present invention, there is provided an apparatus for implementing a convolution operation, including a plurality of Processing Elements (PEs)”
	This citation from Li teaches the processing elements (equivalent to the engines) ]
[ (Pg. 3, Lines 24-25) “Wherein the processing element comprises: a first buffer configured to store input data and a weight corresponding to a convolution operation; a shift unit configured to perform a shift operation on input data to generate first intermediate data; a plurality of operation units”
	This teaches the processing elements containing the operation units which as seen above, contain the multiply-accumulate units. ]

In regards to claim 3, the tensor computation dataflow accelerator semiconductor circuit of claim 2, is taught by Li/Das in the rejection for claim 2 above. Li continues teaching the claim as below:
wherein each of the processing engines includes: an input buffer; a partial sum buffer; and a weight buffer.
[ (Pg. 3, Lines 24-25) “Wherein the processing element comprises: a first buffer configured to store input data and a weight corresponding to a convolution operation; a shift unit configured to perform a shift operation on input data to generate first intermediate data; a plurality of operation units”
	This teaches the processing elements containing the input buffer and the weight buffer. ][ (Pg. 3, Line 31) “Optionally, the plurality of operation units includes a plurality of multiply-accumulate units, partial sum adders, and a second buffer”
	The second buffer in this citation is the partial sum buffer. Examiner notes that the operation units are within the processing elements array and are therefore included in the processing engine. ]

In regards to claim 4, the tensor computation dataflow accelerator semiconductor circuit of claim 3, is taught by Li/Das in the rejection for claim 3 above. Li continues teaching the claim as below:
wherein the weight buffer of each of the processing engines is configured to store a weight matrix vector in an initialized state.
[ (Pg. 3, Lines 24-25) “Wherein the processing element comprises: a first buffer configured to store input data and a weight corresponding to a convolution operation; a shift unit configured to perform a shift operation on input data to generate first intermediate data; a plurality of operation units”
	As stated previously, there are a plurality of processing elements and each processing element contains a buffer that stores the weight matrix as seen above. Examiner notes that the plain meaning of an “initialized state” in this context, as one of ordinary skill in the art would define at the time of filing, is such that the variables/object (the weight buffer in this case) have the initial values (the weights) assigned to them. As the examiner cannot find a different definition in the applicant’s specification or anything that is contrary to this, the claim limitation is considered taught by the weight buffers of Li loading in the weights into the buffer at the time of operation. ]

In regards to claim 5, the tensor computation dataflow accelerator semiconductor circuit of claim 4, is taught by Li/Das in the rejection for claim 4 above. Li continues teaching the claim as below:
wherein the input buffer of a processing engine from among the peripheral array of processing engines is configured to receive an input matrix vector from the memory bank in a streaming fashion.
[ (Pg. 5, Lines 1-2) “Wherein the accumulation of partial sums in a Register File (RF) is kept constant by streaming input data in the PE array and then broadcasting weight data to the PE array, thereby minimizing the energy consumption of reading and writing the partial sums”
	This citation from Li teaches the constant streaming of input data into the processing element which is equivalent to the processing engine. Examiner notes that the input data includes weights and input data. Further, examiner notes that although the memory bank was taught by the secondary reference, Li still has the data streaming into the processing element from memory. With Das’ teaching of memory bank as a replacement for the memory of Li, the claim limitation is taught.]

In regards to claim 6, the tensor computation dataflow accelerator semiconductor circuit of claim 5, is taught by Li/Das in the rejection for claim 5 above. Li continues teaching the claim as below:
wherein the multiply-and-add unit of the processing engine is configured to calculate a product of the input matrix vector and the weight matrix vector stored in the weight buffer of the processing engine.
[ (Pg. 6, Lines 31-32) “Wherein the plurality of multiply-accumulate units are configured to perform multiply-and-accumulate operations on the first intermediate data according to the weights and output second intermediate data”
	This citation from Li teaches the multiply accumulate units calculating a product between the input data and weights. ]

In regards to claim 7, the tensor computation dataflow accelerator semiconductor circuit of claim 6, is taught by Li/Das in the rejection for claim 6 above. Li continues teaching the claim as below:
wherein: the processing engine is a first processing engine, and the multiply-and-add unit of the first processing engine is configured to output the product to the partial sum buffer of a second processing engine from among the peripheral array of processing engines, wherein the second processing engine is adjacent to the first processing engine.
[ (Pg. 5, Paragraph 7 – Paragraph 8 and Pg. 6, paragraphs 7-8) 
	In general, these two paragraphs discuss the processing elements being linked together to create a chain of processing elements that carry out the convolution operations. The citation explicitly mentions a processing element grabbing the output of the previous processing element and using it for the shifting operation prior to performing its own convolution calculation. Further, the elements including MAC and partial sum operations. Examiner also notes that further support of this at the bottom of paragraph 8 that states multiple adjacent PEs may be combined together to perform a CNN calculation. ]

In regards to claim 8, the tensor computation dataflow accelerator semiconductor circuit of claim 7, is taught by Li/Das in the rejection for claim 7 above. Li continues teaching the claim as below:
wherein the second processing engine is configured to store the product in the partial sum buffer of the second processing engine.
[ (Pg. 3, Lines 19-21) “Wherein the plurality of multiply-accumulate units are configured to perform multiply-and-accumulate operations on the first intermediate data according to the weights and output second intermediate data; the partial sum adder is configured to iteratively add the second intermediate data to the corresponding partial sum stored in the second buffer and store the partial sum calculated for each iteration as a partial sum of the output data in the second buffer”
	This citation, teaches the partial sum data being stored in the partial sum buffer (second buffer) of the operation unit which is within the processing element. In tandem with the citation from the rejection of claim 7, which shows the processing elements are able to grab data from other adjacent elements, teaches the storage of the partial sum on a secondary processing engine.  ]

In regards to claim 9, the tensor computation dataflow accelerator semiconductor circuit of claim 8, is taught by Li/Das in the rejection for claim 8 above. Li continues teaching the claim as below:
wherein: the input matrix vector is a first input matrix vector; the product is a first product; the input buffer of the second processing engine is configured to receive a second input matrix vector from the memory bank in the streaming fashion;
[ (Pg. 5, Lines 1-2) “Wherein the accumulation of partial sums in a Register File (RF) is kept constant by streaming input data in the PE array and then broadcasting weight data to the PE array, thereby minimizing the energy consumption of reading and writing the partial sums”
	This citation from Li teaches the constant streaming of input data into the processing element which is equivalent to the processing engine. Examiner notes that the input data includes weights and input data. Further, examiner notes that although the memory bank was taught by the secondary reference, Li still has the data streaming into the processing element from memory. With Das’ teaching of memory bank as a replacement for the memory of Li, the claim limitation is taught. ]
the multiply-and-add unit of the second processing engine is configured to calculate a second product of the second input matrix vector and the weight matrix vector stored in the weight buffer of the second processing engine;
[ (Pg. 3, Lines 19-20) “Wherein the plurality of multiply-accumulate units are configured to perform multiply-and-accumulate operations on the first intermediate data according to the weights and output second intermediate data”
	This citation from Li teaches the multiply accumulate units calculating a product between the input data and weights. Examiner notes that the previous citation(s) from Li disclosing the use of multiple processing elements with the ability for the processing elements to pass data from one to another along with the citation that teaches multiple processing elements can be utilized to process one convolutional operation teach that the process can be repeated a second, third, or any number of times. ]
 and the multiply-and-add unit of the second processing engine is configured to calculate a sum of the first product and the second product.
[ (Pg. 3, Lines 20-21) “and the partial sum adder is configured to iteratively add the second intermediate data to the corresponding partial sum stored in the second buffer and store the partial sum calculated for each iteration as a partial sum of the output data into the second buffer”
	Here, Li teaches that every iteration of a partial sum is stored in the partial sum buffer (second buffer) and accumulated with all the previous iterations. ]

In regards to claim 10, the tensor computation dataflow accelerator semiconductor circuit of claim 9, is taught by Li/Das in the rejection for claim 9 above. Li continues teaching the claim as below:
wherein: the multiply-and-add unit of the second processing engine is configured to output the sum of the first product and the second product to the partial sum buffer of a third processing engine from among the peripheral array of processing engines, wherein the third processing engine is adjacent to the second processing engine; and the third processing engine is configured to store the sum in the partial sum buffer of the third processing engine.
[ (Pg. 3, Lines 21-22) “and the partial sum adder is configured to iteratively add the second intermediate data to the corresponding partial sum stored in the second buffer and store the partial sum calculated for each iteration as a partial sum of the output data into the second buffer. ]
[ (Pg. 5, Paragraph 7 – Paragraph 8) 
	In general, these two paragraphs discuss the processing elements being linked together to create a chain of processing elements that carry out the convolution operations. This along with the citation above teaches that multiple processing elements can be iterated through this process with the steps of the next processing element summing the previous processing element(s) outputs together. ]


Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Li/Das as applied above, and further in view of Kim (“A Novel Zero Weight/Activation-Aware Hardware Architecture of Convolutional Neural Network”).
In regards to claim 11, the tensor computation dataflow accelerator semiconductor circuit of claim 10, is taught by Li/Das in the rejection for claim 10 above. Li continues teaching the claim as below:
 and the peripheral array of processing engines is configured to receive a plurality of input matrix vectors at corresponding input buffers of corresponding processing engines from among the peripheral array of processing engines in a streaming fashion, and to propagate the plurality of input matrix vectors in a direction that is perpendicular to a data flow direction of the partial sums.
[ (Fig. 6 and 7) and (page 5, paragraph 1 and page 6, paragraphs 10-12) “FIG. 6 shows a schematic diagram of an output-fixed spatial architecture. Wherein the accumulation of partial sums in a Register File (RF) is kept constant by streaming input data in the PE array and then broadcasting weight data to the PE array, thereby minimizing the energy consumption of reading and writing the partial sums”
	This citation from Li and the corresponding drawings (which have been placed below) show that the processing elements create a data flow chain where the partial sums go from one processing element to the next (from left to right) while the input data comes down (from top to bottom) creating a perpendicular dataflow direction and the “”matrix of input data” of the “PE[‘s]” is received and stored in a “WBUF”. ]

    PNG
    media_image2.png
    460
    672
    media_image2.png
    Greyscale

	What is not distinctly disclosed by Li/Das and is instead taught by Kim is seen below:
Wherein: the peripheral array of processing engines is a systolic array that is configured to propagate partial sums in a serpentine fashion, wherein the serpentine fashion includes a pipelined propagation of the partial sums;
[ (Pg. 1464, Col. 2, Paragraphs 1-3) “Moreover, in order to further increase the overlapping between current and next activation tiles, we adopt a zig-zag order in visiting the activation tiles in a WG. After the current activation tile is processed, we first move the window horizontally to the right or left and process the next activation tile. Then, when it reaches the horizontal edge of the input activation, we move the window down by a stride and again move horizontally in
the opposite direction in a zigzag fashion to choose the next activation tile ( in Fig. 3b)”
	This citation from Kim teaches the serpentine (zig-zag) pattern being used to propagate produced “partial sum” data through to the processing elements. Examiner notes that while the reference teaches using the partial sums, the primary reference also teaches the partial sums being propagated from one processing element to the next and this reference is relied upon to teach the serpentine fashion for data propagation to processing elements. ]
	Therefore, it would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine a system for implementing convolution operations consisting of processing elements as taught by Li/Das with the system for distributed machine learning with architecture for accelerating a neural network as taught by Kim. The reason it would be obvious is one of ordinary skill in the art would recognize, prior to the effective filing date, that combining the two would provide an increase in performance for the neural network  processes [ Kim (Abstract) ]. This would facilitate the recognized benefit of creating a much more efficient system overall which can provide an increase in speed and/or reduction in processing time to finish the calculations.


Claim(s) 12, 13, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Li/Das as applied above, in view of Gao (“TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory”), in view of Wu et al (US Patent 10346093)
In regards to claim 12, the tensor computation dataflow accelerator semiconductor circuit of claim 3, is taught by Li/Das in the rejection for claim 3 above. Gao continues teaching the claim as below:
wherein the memory bank is a DRAM memory bank,
[ (Pg. 1, Col. 2) “Ideally, we would like to continue scaling the performance and efficiency of NN accelerators in order to achieve real-time performance for increasingly complicated problems. In general, the size of state-of-the-art NNs has been increasing over the years, in terms of both the number of layers and the size of each layer [19, 26, 41]. As a result, the memory system is quickly becoming the bottleneck for NN accelerators”…. “Advances in through-silicon-via (TSV) technology have enabled 3D memory that includes a few DRAM dies on top of a logic chip [20, 22, 44]”
	These two citations from Gao and the cited column go over neural network accelerators and specifically discuss the memory system for the neural network located on the processing chip (equivalent to the memory bank) utilizing DRAM. ]
[ (Pg. 1, Col. 2, Paragraph 4 – Pg. 2, Col. 1) “This allows us to achieve both higher performance and better energy efficiency. We also move simple accumulation operations close to the data locations (DRAM banks) in order to reduce memory accesses and improve performance and energy”
	Further support of the DRAM bank. ]
 the circuit further comprising: a near-DRAM-processing dataflow (NDP-DF) accelerator unit die including a plurality of channels,
[ (Pg. 2, Col. 1, Paragraph 1) “We combine these hardware and software optimizations
in the design of TETRIS, an NN accelerator that uses an eight-die HMC memory stack organized into 16 vaults (vertical channels)”
	This citation from Gao teaches the accelerator die for neural networks being improved upon by having these vaults (equivalent to plurality of channels). Examiner notes that in tandem with the previous limitation citations, the accelerator die is located on a DRAM memory bank and as such is equivalent to a near-DRAM-processing dataflow as in the claim limitation. ]
 	wherein: each of the channels includes a plurality of bank units arranged in a serpentine fashion, 
[ (Pg. 4, Col. 1, Section 3.1) and (Fig. 2) and (Fig. 3) “The HMC stack (Figure 2 left) is vertically divided into sixteen 32-bit-wide vaults [21], which are similar to conventional DDRx channels and can be accessed independently. The vault channel bus uses TSVs to connect all DRAM dies to the base logic die. Each DRAM die contains two banks per vault (Figure 2 right top)”
	This citation teaches the vaults which are equivalent to channels, containing the “bank” units. ]
[ (Pg. 4, Fig. 2) “TETRIS architecture. Left: HMC stack. Right top: per-vault DRAM die structure. Right bottom: per-vault logic die structure”
	This citation shows the structure of the channels and the corresponding banks. As the applicant’s specification shows the serpentine fashion to still be a grid but going from column to column and then the next row, the diagram from Gao shows the same vault set up with each PE in Fig. 3 passing in a serpentine fashion. ]
and each of the smart bank units includes a DRAM bank, an input buffer, a systolic MAC array, and an output buffer.
[ (Pg. 2, Col. 1, Paragraph 1) “Each vault is associated with an array of 14 × 14 NN processing elements and a small SRAM buffer”
	This citation teaches the vault (equivalent to channels) containing the processing elements (equivalent to systolic MAC array) and the SRAM buffer (equivalent to input/output buffer). ]
[ (Pg. 4, Col 1, Section 3.1) “Each DRAM die contains two banks per vault (Figure 2 right top). Each bank is an array of DRAM cells. On data access, the global datalines transfer data from the internal DRAM cell arrays to the global sense-amplifiers (SAs)”
	This citation teaches the DRAM bank. ]
	Therefore, it would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine a system for implementing convolution operations consisting of processing elements as taught by Li/Das with the system for distributed machine learning with architecture for scalable and efficient neural network acceleration as taught by Gao. The reason it would be obvious is one of ordinary skill in the art would recognize, prior to the effective filing date, that combining the two would provide an increase in performance for the neural network processes [ Gao (Abstract) ]. This would facilitate the recognized benefit of creating a much more efficient system overall which can provide an increase in speed and/or reduction in processing time to finish the calculations.
However, the combination does not explicitly teach wherein the serpentine fashion includes a systolic dataflow in a first direction through a first row of bank units from among the plurality of bank units, and a systolic dataflow in a second direction opposite the first direction through a second row of bank units from among the plurality of bank units, and further at least implies serpentine fashion.
Wu teaches wherein the serpentine fashion includes a systolic dataflow in a first direction through a first row of bank units from among the plurality of bank units, and a systolic dataflow in a second direction opposite the first direction through a second row of bank units from among the plurality of bank units and serpentine fashion
[Col. 6, lines 60-67, Col. 7, line 41-Col. 8, line 54 and Fig. 5 teach “tensor banks” and “tensor buffers” arranged and the “time in the exemplary schedule progresses in a serpentine fashion. Operations are drawn from left to right across the two parallel pipelines.” ]
Therefore, it would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine a system for implementing convolution operations consisting of processing elements as taught by Li/Das/Gao with the system for neural network acceleration for serpentine process scheduling through tensor buffers and banks as taught by Wu. The reason it would be obvious is one of ordinary skill in the art would recognize, prior to the effective filing date, that combining the two would provide an improved processing flow due to “the primary pipeline does not have to wait for data and is kept fully utilized” [ Wu (Col. 6, lines 60-67 and Col. 8, lines 8-16) ]. This would facilitate the recognized benefit of creating a much more efficient system overall which can provide an increase in speed and/or reduction in processing time to finish the calculations.

In regards to claim 13, the tensor computation dataflow accelerator semiconductor circuit of claim 12, is taught by Li/Das/Gao/Wu in the rejection for claim 12 above. Gao continues teaching the claim as below:
wherein: the systolic MAC array includes the peripheral array of multiply-and-add units;
[ (Pg. 3, Col. 1, Paragraph 1) “Recent NN accelerators are typically spatial architectures with a large number of processing elements (PEs) for multiply-accumulate (MAC) operations [10, 30, 42]”
	This citation, in tandem with the previous citation that had Gao showing that the channels included processing elements (equivalent to the systolic MAC array), shows the processing elements are capable of performing the MAC operations. ]
 and the NDP-DF accelerator unit die is one of a plurality of NDP-DF accelerator unit dies that are stacked one atop another
[ (Pg. 2, Col. 1, Paragraph 1) “We combine these hardware and software optimizations
in the design of TETRIS, an NN accelerator that uses an eight-die HMC memory stack”
	This citation shows that the dies are stacked upon one another. ]
	Please see the motivation to combine from claim 12.

In regards to claim 18, the tensor computation dataflow accelerator semiconductor circuit of claim 13, is taught by Li/Das/Gao/Wu in the rejection for claim 13 above. Gao continues teaching the claim as below:
wherein the plurality of stacked NDP-DF accelerator unit dies and the base die are configured to process the partial output data in parallel.
[ (Pg. 2, Col. 1, Paragraph 1) “Finally, the proposed hybrid partitioning scheme improves the performance and energy efficiency by more than 10% over simple heuristics as we parallelize NN computations across multiple stacks”
	This citation teaches that the computations being performed (including the processing of partial output data) can be configured in a parallel manner. ]
[ (Pg. 2, Col. 1, Paragraph 1) “We show that TETRIS improves computational density by optimally using area for processing elements and on-chip buffers, and that moving partial computations to DRAM dies is beneficial”
	Further support of the above. ]
	Please see the motivation to combine from claim 12. 

Claim(s) 14-17 are rejected under 35 U.S.C. 103 as being unpatentable over Li/Das/Gao/Wu as applied above, and further in view of Hyde (US 20190051642 A1).
In regards to claim 14, the tensor computation dataflow accelerator semiconductor circuit of claim 13, is taught by Li/Das/Gao/Wu in the rejection for claim 13 above. Hyde continues teaching the claim as below:
further comprising: a passive silicon interposer;
[ (¶0013) “In some examples, to reduce the concern of thermal issues and/or to increase performance of such systems, one or more logic and/or memory circuits are implemented in a silicon-based connector (e.g., an embedded silicon bridge or an interposer)”
	This citation from Hyde teaches the user of a passive silicon interposer for applications and/or apparatus trying to implement efficient memory storage in multi-die packages. ]
a processor disposed on the passive silicon interposer;
[ (¶0013) “In some examples, to reduce the concern of thermal issues and/or to increase performance of such systems, one or more logic and/or memory circuits are implemented in a silicon-based connector (e.g., an embedded silicon bridge or an interposer)” ]
[ (¶0013) “Example multi-die packages (also referred to as embedded systems) disclosed herein include at least one CPU die and at least one memory die”
	These two citations teach the logic unit being a processor being disposed on the passive silicon interposer. ]
and a base die disposed on the passive silicon interposer adjacent to the processor, wherein the plurality of NDP-DF accelerator unit dies are stacked atop the base die.
[ (¶0013) “disclosed herein include at least one CPU die and at least one memory die stacked on top of each other (e.g., in vertical alignment). Additionally or alternatively, in some examples, at least one GPU die is stacked in vertical alignment with one or more CPU die and/or one or more memory die”
	Here the reference teaches that the dies are stacked upon one another. ]
[ (¶0013) “Placing logic and/or memory circuits within the silicon-based connector in this manner takes advantage of the space in the silicon-based connector beyond the basic function of interconnecting the adjacent dies”
	Further, this citation teaches the ability for the dies connected via the passive silicon interposer to be adjacent to the processor. ]
Therefore, it would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine a system for implementing convolution operations consisting of processing elements as taught by Li/Das/Gao/Wu with the system for distributed machine learning with the multi-die efficient hardware architecture as taught by Hyde. The reason it would be obvious is one of ordinary skill in the art would recognize, prior to the effective filing date, that combining the two would provide an improvement in thermal dissipation which can improve the thermal design power envelope for the main processor [ Hyde (¶0012) ]. This would facilitate the recognized benefit of allowing the processor to operate at a higher speed in a much more reliable fashion which would improve the processing speed of the system overall. 

In regards to claim 15, the tensor computation dataflow accelerator semiconductor circuit of claim 14, is taught by Li/Das/Gao/Wu/Hyde in the rejection for claim 14 above. Hyde continues teaching the claim as below:
further comprising: one or more through silicon vias (TSVs) disposed through the plurality of NDP-DF accelerator unit dies and the base die,
[ (¶0013) “To further increase transfer rates and reduce an overall form factor for a multi-die package, the individual dies may be stacked on top of one another in vertical alignment and communicatively coupled using through silicon vias (TSVs)”
	Examiner notes that although the NDP-DF accelerator units are taught by the previous references, Hyde is relied upon to teach the connection of said dies through TSVs. ]
 wherein the one or more TSVs are configured to interconnect the plurality of NDP-DF accelerator unit dies with the base die, and the base die with the processor;
[ (¶0016) and (Fig. 1) “Furthermore, the close proximity of the CPU and memory dies 116, 118 in the corresponding compute stacks 110, 112, 114, and communicatively interconnected using TSVs”
	The citation and corresponding drawing show the stacked dies (reference number 106) adjacent to the processor (reference number 138) with the TSVs (reference number 108) connecting the dies. ]
 	and wherein the plurality of NDP-DF accelerator unit dies and the base die are configured to offload computation from the processor.
[ (¶0021) “Thus, if a GPU is to be implemented for general-purpose workloads (rather than merely graphics) with demand for high computation performance, the directly adjacent die 124 may be a GPU die with the second additional die 126 being an ICH die”
	This citation teaches the processor (graphics processer or GPU in this case) being utilized with a shared workload between the die stacks which is equivalent to offloading the computation from the processor. ]
	Please refer to the motivation to combine from claim 14.

In regards to claim 16, the tensor computation dataflow accelerator semiconductor circuit of claim 13, is taught by Li/Das/Gao/Wu in the rejection for claim 13 above. Hyde continues teaching the claim as below:
further comprising: a passive silicon interposer;
[ (¶0013) “In some examples, to reduce the concern of thermal issues and/or to increase performance of such systems, one or more logic and/or memory circuits are implemented in a silicon-based connector (e.g., an embedded silicon bridge or an interposer)”
	This citation from Hyde teaches the user of a passive silicon interposer for applications and/or apparatus trying to implement efficient memory storage in multi-die packages. ]
 a controller disposed on the passive silicon interposer;
[ (¶0021) “For example, the additional dies 124, 126 may correspond to another memory die, another CPU die, a graphics processing unit (GPU) chip, a 5G chip, an input/output (IO) controller hub (ICH) chip (e.g., a platform controller hub (PCH) chip or a fusion controller hub (FCH) chip)”
	This citation teaches the controller chip. ]
 and a base die disposed on the passive silicon interposer adjacent to the controller, wherein the plurality of NDP-DF accelerator unit dies are stacked atop the base die.
[ (¶0021) “The type of chip that is positioned closest to the die stack 106 may depend upon the intended use for the package 100”
	This citation follows directly from the last one and shows that the die stack is adjacent (positioned closest) to the chip (in this case the controller chip). ]
Therefore, it would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine a system for implementing convolution operations consisting of processing elements as taught by Li/Das/Gao/Wu with the system for distributed machine learning with the multi-die efficient hardware architecture as taught by Hyde. The reason it would be obvious is one of ordinary skill in the art would recognize, prior to the effective filing date, that combining the two would provide an improvement in thermal dissipation which can improve the thermal design power envelope for the main processor [ Hyde (¶0012) ]. This would facilitate the recognized benefit of allowing the processor to operate at a higher speed in a much more reliable fashion which would improve the processing speed of the system overall. 

In regards to claim 17, the tensor computation dataflow accelerator semiconductor circuit of claim 16, is taught by Li/Das/Gao/Wu/Hyde in the rejection for claim 16 above. Hyde continues teaching the claim as below:
further comprising: one or more through silicon vias (TSVs) disposed through the plurality of NDP-DF accelerator unit dies and the base die,
[ (¶0013) “To further increase transfer rates and reduce an overall form factor for a multi-die package, the individual dies may be stacked on top of one another in vertical alignment and communicatively coupled using through silicon vias (TSVs)”
	Examiner notes that although the NDP-DF accelerator units are taught by the previous references, Hyde is relied upon to teach the connection of said dies through TSVs. ]
wherein the one or more TSVs are configured to interconnect the plurality of NDP-DF accelerator unit dies with the base die, and the base die with the controller;
[ (¶0016) and (Fig. 1) “Furthermore, the close proximity of the CPU and memory dies 116, 118 in the corresponding compute stacks 110, 112, 114, and communicatively interconnected using TSVs”
	The citation and corresponding drawing show the stacked dies (reference number 106) adjacent to the processor (reference number 138) with the TSVs (reference number 108) connecting the dies. ]
[ (¶0048) “For example, the processor 812 can be implemented by one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer”
	This citation along with the previous citations that show the chip can be replaced from processor to GPU or controller chip teach that the controller can be connected to the dies and the base in a similar manner. ]
and wherein the plurality of NDP-DF accelerator unit dies and the base die are configured to offload computation from a host that is separate from the tensor computation dataflow accelerator semiconductor circuit.
[ (¶0021) “Thus, if a GPU is to be implemented for general-purpose workloads (rather than merely graphics) with demand for high computation performance, the directly adjacent die 124 may be a GPU die with the second additional die 126 being an ICH die”
	This citation teaches the processor (graphics processer or GPU in this case) being utilized with a shared workload between the die stacks which is equivalent to offloading the computation from the processor. Examiner notes that with the previous citations, a controller chip may be the secondary chip which would facilitate the transfer. ]
	Please refer to the motivation to combine from the rejection of claim 16.


Claim(s) 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Li/Das/Gao/Wu as above, and further in view of Lau (US 20190392297 A1).
In regards to claim 19, the tensor computation dataflow accelerator semiconductor circuit of claim 13, is taught by Li/Das/Gao/Wu in the rejection for claim 13 above. Lau continues teaching the claim as below:
wherein the plurality of stacked NDP-DF accelerator unit dies and the base die are configured to propagate partial output data in a backward direction.
[ (¶0054) “Each HBM interface (e.g., 320) may support a single HBM die stack (e.g., 310 a-d) up to the currently supported maximum HBM capacity (in one example it could be 8 GB per stack). Each HIM block 320 may be independent of the other HIM blocks on the chip. Data between the multiple interfaces is to be managed carefully by software to ensure that the storage capacity as well as the bandwidth is utilized effectively by the processing clusters of the DLH device. For instance, a HBM controller 415, arbiter circuitry 420 (connected to various client data buffers (e.g., 425, 430)), and other logic may be provided to manage data across the HIM block 320”
	This citation from Lau teaches the HBM (high bandwidth memory) interface supporting the die stack(s) and being able to manage data across the circuit(s). ]
[ (¶0491) “In one example embodiment of a method, the partial matrix data includes a partial result matrix determined by a first processing element in a particular stage of the partial matrix operations, and where the partial result matrix is used by a second processing element in a subsequent stage of the partial matrix operations. In one example embodiment of a method, the matrix operation is associated with a forward propagation operation in a neural network. In one example embodiment of a method, the matrix operation is associated with a backward propagation operation in a neural network” (emphasis added)
	This citation shows that the teachings of Lau further include processing operations such as partial sums being used for backwards propagation. ]
	Therefore, it would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine a system for implementing convolution operations consisting of processing elements as taught by Li/Das/Gao/Wu with the system for deep learning hardware as taught by Lau. The reason it would be obvious is one of ordinary skill in the art would recognize, prior to the effective filing date, that combining the two would provide the ability for backwards propagation allowing the neural network to learn from errors and be able to train for a higher accuracy count. This would facilitate the recognized benefit of a more accurate system overall.

In regards to claim 20, the tensor computation dataflow accelerator semiconductor circuit of claim 13, is taught by Li/Das/Gao/Wu in the rejection for claim 13 above. Lau continues teaching the claim as below:
wherein the plurality of stacked NDP-DF accelerator unit dies and the base die are configured to perform a partial matrix transposition.
[ (¶0054) “Each HBM interface (e.g., 320) may support a single HBM die stack (e.g., 310 a-d) up to the currently supported maximum HBM capacity (in one example it could be 8 GB per stack). Each HIM block 320 may be independent of the other HIM blocks on the chip. Data between the multiple interfaces is to be managed carefully by software to ensure that the storage capacity as well as the bandwidth is utilized effectively by the processing clusters of the DLH device. For instance, a HBM controller 415, arbiter circuitry 420 (connected to various client data buffers (e.g., 425, 430)), and other logic may be provided to manage data across the HIM block 320”
	This citation from Lau teaches the HBM (high bandwidth memory) interface supporting the die stack(s) and being able to manage data across the circuit(s). ]
[ (¶0465) “In one example embodiment of an apparatus, the memory controller is further configured to perform a transpose operation on the matrix. In one example embodiment of an apparatus, each of the plurality of storage locations are configured to store a particular number of matrix elements. In one example embodiment of an apparatus, each of the plurality of storage locations are further configured to store an error correction code” (emphasis added)
	This citation teaches the memory controller which is responsible for the data controls of the die stacks being able to perform a transpose operation on a matrix. ]
Therefore, it would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine a system for implementing convolution operations consisting of processing elements as taught by Li/Das/Gao/Wu with the system for deep learning hardware as taught by Lau. The reason it would be obvious is one of ordinary skill in the art would recognize, prior to the effective filing date, that combining the two would provide the ability for backwards propagation allowing the neural network to learn from errors and be able to train for a higher accuracy count. This would facilitate the recognized benefit of a more accurate system overall.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CLINT MULLINAX whose telephone number is 571-272-3241.  The examiner can normally be reached on Mon - Fri 8:00-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached on 571-270-3428.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/C.M./Examiner, Art Unit 2123                                                                                                                                                                                                        

/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123