DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Van Luntenen (US 2017/0139629, Van hereinafter) in view of Lau et al. (US 2019/0392297, Lau hereinafter).

As to claim 1, Van teaches system (e.g., see FIG. 10, “1000”, para 68, “a computing system 1000 which is implemented as a 3-dimensional chip stack 1010”), comprising: 
 	a plurality of memory units (e.g., See FIG. 10, para 68, “The vertical layers include a plurality of memory layers 1020” and “The proposed combination of 3D stacking with reconfigurable computing devices” in para 70.), wherein each of the plurality of memory units includes a request processing unit (e.g.,. e.g., para 68, ”The logic layer 1040 comprises an “memory bank”); and  
 	5a processor (e.g.,  “210”, FIG. 2) coupled to the plurality of memory units, wherein the processor includes a plurality of processing elements ( e.g., “211”, FIG. 2)and a communication network (e.g., “230”, FIG. 2) communicatively connecting the plurality of processing elements to the plurality of memory units (e.g., para 65, “an operation of the computing system 200 comprising the host processor 210 as described with reference to FIG. 2.” and “The host processor 210 is communicatively coupled with the access processor 140 via an interconnect system 230” in para 47) , and wherein at least a first processing element (e.g., one of “core 211”) of the plurality of processing elements includes a control logic unit  (e.g., Logic layer 1040”, FIG. 10) .
 	However, Van does not explicitly teach the first processing element includes a matrix compute engine, the control logic unit is configured to access data from the plurality of 10memory units using a dynamically programmable distribution scheme.  
 	Lau teaches wherein at least a first processing element of a plurality of processing elements includes a control logic unit  and a matrix compute engine (e.g., para 239, see FIG. 29, “the processing elements may include multiple matrix processing chips (e.g., matrix processing chips), multiple matrix processing clusters on each matrix processing chip (e.g., matrix processing clusters), and/or multiple matrix processing units (MPUs) on each matrix processing cluster (e.g., matrix processing units (MPUs)).” and “Although only one processor 3700 is illustrated in FIG. 37, a processing element may alternatively include more than one of processor 3700 illustrated in FIG. 37. Processor 3700 may be a single-threaded core or, for at least one embodiment, the processor 3700 may be multi-threaded The processing element may include I/O control logic and/or may include I/O control logic integrated with memory control logic.” Para 398, see FIG. 37), the control logic unit is configured to access data from the plurality of 10memory units using a dynamically programmable distribution scheme (e.g., para 71-76, “DLH device may support communication between clusters to allow mapping of distributed algorithms across many processing clusters. These clusters can be on the same chip, or different chips, or both. The control flow needs to support both on-chip and inter-chip cluster communication. Turning to FIG. 12, a diagram 1200 is shown illustrating the example operation of a DLH device”,  “matrix multiplication may utilize techniques such as SUMMA and Cannon's algorithm”, “Various algorithms may be used to distribute matrix multiplication across multiple nodes. Each algorithm has a different cost, and implied interconnect architecture. Algorithms may employ 2D grid interconnects, and 3D grid interconnects, among other examples. For instance, Cannon's Algorithm and Scalable Universal Matrix Multiplication Algorithm (SUMMA) may use a two-dimensional grid on interconnected nodes to distribute matrix multiplication. Data rotates or is broadcast east to west and north to south” see FIG. 12 and  “synchronous dynamic random access memory (SDRAM)” in para 368. Also, see para 44-45, “DLH device includes support for high-bandwidth and high-capacity off-chip memory so that large data sets can be loaded from the CPU into the PCIe adapter card, and re-used many times”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Van by adopting the teachings of Lau  to   “ enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing 

As to claim 2, Van teaches wherein the request processing unit of each of the plurality of memory units is configured to receive a broadcasted memory request (e.g., para 61, “access processor 140 is adapted to provide a broadcast function. The broadcast function may be used for transferring the same configuration data simultaneously to multiple partitions. Such a broadcast function allows parallel partitioning of several partitions for one and the same processing operation, and subsequently parallel processing of the operand data. This enhances the speed of the corresponding processing operation”. Also, see para 70-72, “Access to these devices may involve arbitration by the I/O device for scheduling and fairness. In addition, requests to these I/O devices must not saturate the on-chip network in a way that causes congestion to stay within the network rather than the I/O devices should the I/O device create back-pressure.)..  

As to claim 3, Van teaches wherein the broadcasted memory request references data stored in each of the plurality of memory units (para 61).  

isAs to claim 4, Van does not explicitly teach wherein the request processing unit of each of the plurality of memory units is configured to decompose the broadcasted memory request into a corresponding plurality of partial requests.  However, Lau teaches wherein the request processing unit of each of the plurality of memory units is configured to decompose the broadcasted memory request into a corresponding plurality of partial requests (e.g., para 101-

As to claim 5, Van does not teach wherein the request processing unit of each of the plurality of memory units is configured to determine whether each of the corresponding plurality of partial 20requests corresponds to data stored in a corresponding one of the plurality of memory banks associated with the corresponding request processing unit.  However, Lau teaches wherein the request processing unit of each of the plurality of memory units is configured to determine whether each of the corresponding plurality of partial 20requests corresponds to data stored in a corresponding one of the plurality of memory banks associated with the corresponding request processing unit (e.g., para 104, “a particular matrix processing cluster may use its associated matrix processing engine 1700 to perform matrix-based processing and operations, such as partial matrix operations associated with a particular matrix operation 

As to claim 6, Van does not teach, wherein the request processing unit of each of the plurality of memory units is configured to provide a partial response associated with a different one of the corresponding plurality of partial requests. However, Lau teaches wherein the request processing unit of each of the plurality of memory units is configured to provide a partial response associated with a different one of the corresponding plurality of partial requests (e.g., para 72, “instance, a particular processing cluster (or client) 305 may send a request to an IO device (e.g., an HBM (e.g.,)). The request (at 1) may be routed to a particular processing cluster (e.g., 305) through the on-chip control network. The I/O device (e.g., 310a) may buffer (at 2) the various requests it receives and perform arbitration and scheduling of responses to the requests”, see FIG. 12). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Van by adopting the teachings of Lau  to   “ enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 1610, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time ” (See Lau, para 100).


As to claim 8, Van does not teach wherein each of the partial responses includes a corresponding sequence identifier used to order the partial responses. However, Lau teaches wherein each of the partial responses includes a corresponding sequence identifier used to order the partial responses (e.g., para 338 “the partial matrix data may include a partial result matrix” for “perform 

As to claim 9, Van does not teach wherein the complete response is stored in a local memory of the first processing element. However, Lau teaches wherein the complete response is stored in a local memory of the first processing element (para 72, see FIG. 12). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Van by adopting the teachings of Lau  to   “ enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 1610, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time ” (See Lau, para 100).

 As to claim 10, Van does not teach wherein the plurality of memory units includes a north memory unit, an east memory unit, a south memory unit, and a west memory unit. However, Lau teaches wherein the plurality of memory units includes a north memory unit, an east memory unit, a south memory unit, and a west memory unit (e.g., para 75, “Data rotates or is broadcast east to west and north to south).  Thus, it would have been obvious to one of 

As to claim 11, Van does not teach wherein the dynamically programmable distribution scheme utilizes an identifier associated with a workload of the first processing element.  However, Lau teaches wherein the dynamically programmable distribution scheme utilizes an identifier associated with a workload of the first processing element (e.g., para 95, “master control CPU 1632 may receive an instruction to perform a matrix multiplication operation, such as C=A*B. The instruction may include the handles or identifiers for each matrix, and may also indicate how the matrices should be stored in memory resource blocks (MRBs) 1638”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Van by adopting the teachings of Lau  to   “ enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 1610, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time ” (See Lau, para 100).

As to claim 12, Van teaches wherein two or more processing elements of the plurality of processing elements share the identifier (See FIG. 8, para 65).  



As to claim 14, Van does not teach wherein the control logic unit of the first processing element is  further configured with an access unit size for distributing data across the plurality of memory units.  However, Lau teaches wherein the control logic unit of the first processing element is  further configured with an access unit size for distributing data across the plurality of memory units (e.g., para 54, “Each HBM interface (e.g., 320) may support a single HBM die stack (e.g., 310a-d) up to the currently supported maximum HBM capacity (in one example it could be 8 GB per stack)”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Van by adopting the teachings of Lau  to   “ enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 1610, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time ” (See Lau, para 100).

As to claim 15, Van does not teach wherein data elements of a machine learning weight matrix are distributed across the plurality of memory units using the dynamically programmable distribution scheme.  However, Lau teaches wherein data elements of a machine learning weight matrix are distributed across the plurality of memory units using the dynamically programmable distribution scheme (e.g., para 196-200, “FIGS. 25, 26A-26C, 27A-27C, and 28A-28C”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Van by adopting the teachings of Lau  to   
 
As to claim 16, Van teaches a method (See FIG. 3) comprising: 
 	receiving a memory configuration setting associated with a workload (e.g., para 49, “retrieval of the configuration data 3a and 3b and a quicker configuration of the respective partition for the complex simulation task compared with a scenario where the configuration data 3a and 3b are stored in one and the same memory bank. This will be explained below in more detail”, see FIG. 3); 
 	creating a memory access request that includes a workload identifier (e.g., para 39, “The access processor 140 is then adapted to execute this predefined program and perform address generation” and “the address generation for the retrieval/read operations and for the storage/write operations is usually done by the host processor while the address mapping and access scheduling is done by the memory controller” , “sends configuration data comprising a configuration data identifier or only a configuration data identifier to the computing system 100. The configuration data identifier comprises information regarding the type and identity of the configuration data “ in para 43 and 65); 
 	broadcasting the memory access request to a plurality of memory units (e.g., para 61, “access processor 140 is adapted to provide a broadcast function. The broadcast function may be used for transferring the same configuration data simultaneously to multiple partitions. Such a broadcast function allows parallel partitioning of several partitions for 
However, Van does not teach receiving a plurality of partial responses associated with the memory access request; and combining the plurality of partial responses to create a complete response to the memory access request. Lau teaches receiving a plurality of partial responses associated with the memory access request; and combining the plurality of partial responses to create a complete response to the memory access request (e.g., para 72, “a particular processing cluster (or client) 305 may send a request to an IO device (e.g., an HBM (e.g.,)). The request (at 1) may be routed to a particular processing cluster (e.g., 305) through the on-chip control network. The I/O device (e.g., 310a) may buffer (at 2) the various requests it receives and perform arbitration and scheduling of responses to the requests” for “a distributed matrix operation, the respective partial results determined by each processing resource may be consolidated on a particular memory component, such as a particular HBM 1740b of a matrix processing chip. For example, in some cases, the respective partial results determined by each cluster of a matrix processing chip may be consolidated on a particular HBM 1740b of the matrix processing chip. Moreover, the partial results may be stored on an HBM 1740b using a particular arrangement that collectively forms the complete result of the matrix operation” in para 378).  Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Van by adopting the teachings of Lau  to   “ enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 1610, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time ” (See Lau, para 100).


As to claim 18, Van does not teach wherein the memory access request has a memory request size that is a multiple of the access unit size configuration setting.  However, Lau teaches wherein the memory access request has a memory request size that is a multiple of the access unit size configuration setting (e.g., para 98, “each memory resource block (MRB) 1638 may be capable of storing a matrix of a certain size (e.g., a 256.times.512 matrix). In some embodiments, memory resource blocks (MRBs) 1638 may be shared by the matrix processing units (MPUs) 1634 of a particular matrix processing cluster 1630”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Van by adopting the teachings of Lau to  “ enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 1610, which significantly 

As to claim 19, Van teaches a method comprising: 
receiving a broadcasted memory request associated with a processing element workload (e.g., para 61, “access processor 140 is adapted to provide a broadcast function. The broadcast function may be used for transferring the same configuration data simultaneously to multiple partitions. Such a broadcast function allows parallel partitioning of several partitions for one and the same processing operation, and subsequently parallel processing of the operand data. This enhances the speed of the corresponding processing operation”). However, Van does not  teach decomposing the broadcasted memory request into a plurality of partial requests; determining for each of the plurality of partial requests whether the partial request is to be served from an associated memory bank;  discarding a first group of partial requests that is not to be served from the associated memory bank; 10for each partial request of a second group of partial requests that is to be served from the associated memory bank, retrieving data of the partial request; preparing one or more partial responses using the retrieved data; and  providing the prepared one or more partial responses. Lau teaches 5decomposing the broadcasted memory request into a plurality of partial requests (; determining for each of the plurality of partial requests whether the partial request is to be served from an associated memory bank (e.g., para 237, “command to perform a matrix operation. The matrix operation may comprise an operation associated with a plurality of input matrices (e.g., matrix operands), such as one or more matrix multiplication operations. In some embodiments, the matrix operation may be associated with an operation in a neural network, such as a forward propagation retires the request”); 10for each partial request of a second group of partial requests that is to be served from the associated memory bank, retrieving data of the partial request; preparing one or more partial responses using the retrieved data; and  providing the prepared one or more partial responses (e.g., para 240,-241 “to transmit partial matrix data between processing elements while performing the partial matrix operations”, “each processing element may transmit partial matrix data to its neighbor processing elements while performing a particular stage of the partial matrix operations”, “while in other matrix operations the partial matrix data may include a partial result matrix”, “determine a result of the matrix operation. For example, the result of the matrix operation may be determined based on the partial results collectively computed by the processing elements. At this point, the flowchart may be complete. In some embodiments, however, the flowchart may restart and/or certain blocks may be repeated. For example, in some embodiments, the flowchart may restart at block 2902 to continue receiving and processing commands to perform matrix operations”, see FIG. 29).   Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Van by adopting the teachings of Lau to  “ enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 1610, which significantly reduces the 

As to claim 20, Van does not teach wherein preparing the one or more partial responses using the isretrieved data includes inserting a sequence identifier into each of the one or more partial responses. However, Lau teaches preparing the one or more partial responses using the isretrieved data includes inserting a sequence identifier (e.g., “identifiers for each matrix”) into each of the one or more partial responses. preparing the one or more partial responses using the isretrieved data includes inserting a sequence identifier into each of the one or more partial responses (e.g., para 95, “The instruction may include the handles or identifiers for each matrix, and may also indicate how the matrices should be stored in memory resource blocks (MRBs) 1638. Matrices A and B may then be broken down into a series of smaller matrices (e.g., 32.times.32 matrices). Matrix operations may then be performed on the smaller matrices, and the partial results may be stored in memory resource blocks (MRBs) 1638, until the output matrix C has been fully computed” and “the remaining partial calculations identified above (e.g., the 2.sup.nd-4.sup.th partial calculations for the partial results corresponding to partitions p.sub.2-p.sub.4 of OFM 3106) may be executed in parallel and in a similar manner as the 1.sup.st partial calculation” in para 267). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Van by adopting the teachings of Lau to “ enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 1610, which significantly 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDOU K SEYE whose telephone number is (571)270-1062. The examiner can normally be reached M-F 9-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Dennis Chow can be reached on 5712727767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ABDOU K SEYE/Examiner, Art Unit 2194                                                                                                                                                                                                        
/CHARLES E ANYA/Primary Examiner, Art Unit 2194