DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .  
The present Office action is in response to Applicant’s amendment/request for reconsideration submitted on October 12, 2020, hereinafter “Reply”, after non-final rejection of July 10, 2020, hereinafter “Non-Final Rejection”.  Claims 1, 12, and 17 have been amended.  No claims have been added or cancelled.  Claims 1-20 remain pending in the application.


Response to Amendments and Arguments
The Reply has been fully considered, with the Examiner’s response set forth below.
In view of amendments to claims, objections of claims 1 and 12 of the Non-Final Rejection have been withdrawn.  However, please refer to additional objections of claims 12 and 17 below due to the amendments to claims.
In view of the electronic terminal disclaimer filed on October 12, 2020, provisional rejection of claims 1-19 of the current application on the ground of nonstatutory double patenting have been withdrawn.
Applicant's arguments in the Reply have been fully considered but they are not persuasive.
In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
Further, assuming Applicant refers to bolded limitations of claim 1 on p. 9 of the Reply, please refer to the rejection below.  Specifically, for the limitation of “wherein, in response to the first request, the arithmetic compute element matrix is configured to access a plurality of lists of operands stored in first memory regions in the plurality of memory regions, generate a list of results from the plurality of lists of operands, and store the list of results in a second memory region in the plurality of memory regions”, Nurvitadhi teaches the limitation in FIGs. 7 and 10, and ¶¶ 44 and 56-57.  For the limitation of “in response to the second request and during the time period, the integrated circuit memory device is configured to provide, in parallel, memory access to the first memory regions and the second memory region to the arithmetic compute element matrix in facilitating a computation of the list of results and memory access to 
As such, the rejections of the independent claims 1, 12, and 18 have been maintained due to the reasons stated above.  Accordingly, the rejections of dependent claims 2-11, 13-17, and 19-20 of the Non-Final Rejection have been maintained because the dependent claims are rejected as dependent on and do not cure the deficiency of the independent claims 1, 12, and 18.
Another iteration of claim analysis has been made due to the amendments to the claims in the Reply. Refer to the corresponding sections of the claim analysis below for details. 


Claim Objections
Claims 12 and 17 are objected to because of the following informalities:
In claim 12, line 22, “a computation of the list of results” may be amended to follow proper antecedent basis.  It appears that Applicant intends to recite “wherein the computing the list of results” or “wherein the computation of the list of results” to follow proper antecedent basis based on a limitation of “computing, by the arithmetic compute element matrix, a list of results” in claim 12, line 10.
In claim 17, lines 1-2, “wherein a computing of an output” may be amended to follow proper antecedent basis.  It appears that Applicant intends to recite “wherein the computing the list of results” or “wherein the computation of the 
Other claims with informalities that are the same as those above and not included here should be amended due to the same reasons set forth above.
Appropriate correction is required. 


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.

3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-2, 6-9, and 12-19 are rejected under 35 U.S.C. 103 as being unpatentable over Nurvitadhi et al. (US 2019/0042251 A1), hereinafter “Nurvitadhi”, in view of Yu et al. (US 2020/0004514 A1), hereinafter “Yu”.

	Regarding claim 1, Nurvitadhi teaches:
An integrated circuit memory device (FIGs. 1-2; ¶ 28, “Using the system 10, a designer may implement a circuit design functionality on an integrated circuit, such as a reconfigurable programmable logic device 12 [integrated circuit memory device], such as a field programmable gate array (FPGA)”; ¶ 29, “the programmable logic device 12 [integrated circuit memory device] may include a fabric die 22 that communicates with a base die 24. The base die 24 may perform compute-in-memory arithmetic computations in the memory of the base die 24, while the fabric die 22 may be used for general purposes”), comprising: 
a plurality of memory regions (FIG. 3; ¶ 43, “The FPGA 40 of FIG. 3 is shown to be sectorized, meaning that programmable logic resources may be distributed through a number of discrete programmable logic sectors 48 (e.g., region, portion) [memory regions]”); 
an arithmetic compute element matrix coupled to access the plurality of memory regions in parallel (FIGs. 6, 8, 9A-9B, 10; ¶ 43, “there may be N regions of sector-aligned memory 92 [memory regions] that can be accessible by N corresponding fabric sectors 80 at the same time (e.g., in parallel). … The sector-aligned memory 92 is shown in FIG. 6 as vertically stacked memory. This may allow a large amount of memory to be located within the base die 24”; ¶ 47, “the on-chip memory 126 may include memory banks divided into multiple memory sectors 136, which may include dedicated blocks of random access memory (RAM), such as the sector-aligned memory 92 [memory regions]. Some of the sector-aligned memory 92 may be integrated with compute-in-memory circuitry 71 [arithmetic compute element matrix]. The compute-in-memory circuitry 71 [arithmetic compute element matrix] associated with sector-aligned memory 92 [memory regions] may have a corresponding controller 138 … the controller 138 may control a sequence of compute-in-memory operations using multiple integrated sector-aligned memory 92 units [memory regions] and compute-in-memory circuitry 71 [arithmetic compute element matrix]”; ¶ 50, “the application 122 may communicate with the base die 24 to scatter specific different data to multiple instances of the compute-in-memory circuitry 71 [arithmetic compute element matrix] via multiple interfaces, performing a parallel scatter operation, as shown in FIG. 9B. In this manner, the compute-in-memory circuitry 71 may receive the multiple different data in parallel. Thus, the interconnect paths between the dies 22, 24, the multiple sectors 136, and the sector-aligned memory 92 [memory regions] may allow the 71 to efficiently receive scattered data from the application 122 to perform calculations in the sector-aligned memory 92”; ¶ 54, “FIG. 10 depicts using the compute-in-memory circuitry 71 [arithmetic compute element matrix] to perform a compute-in-memory operation that may be used for tensor operations. Tensors are data structures, such as matrices and vectors, that may be used to calculate arithmetic operations. Particularly, dot product of vectors and matrices (matrix multiplication) may be used for deep learning or training an algorithm”); and 
a communication interface coupled to the arithmetic compute element matrix and configured to receive a first request (FIGs. 2, 7; ¶ 44, “the on-chip memory 126 stores computational data 131 that may be used in computations by the compute-in-memory circuitry 71 [arithmetic compute element matrix] to carry out requests [first request] by the application 122. The application 122 may communicate with the on-chip memory 126 via an interconnect 132 [communication interface], which may represent the silicon bridge 36 of FIG. 2”); 
wherein, in response to the first request, the arithmetic compute element matrix is configured to access a plurality of lists of operands stored in first memory regions in the plurality of memory regions, generate a list of results from the plurality of lists of operands, and store the list of results in a second memory region in the plurality of memory regions (FIGs. 7, 10; ¶ 44, “much of the data may reside in on-chip memory 126 (e.g., which may represent memory of the sector-aligned memory 92) in the base die 24 (which may be understood to be off-chip from the fabric die 22) and/or in off-chip 127 located elsewhere. In the example of FIG. 7, the on-chip memory 126 stores computational data 131 that may be used in computations by the compute-in-memory circuitry 71 [arithmetic compute element matrix] to carry out requests [first request] by the application 122”; ¶ 56, “the controller 138 may control the compute-in-memory circuitry 71 [arithmetic compute element matrix] to perform the arithmetic computations. In this example, the compute-in-memory circuitry 71 is operated as a dot product engine (DPE) 142. The dot product engine 142 may compute the dot product of vectors and matrices [lists of operands] stored in the sector-aligned memory 91 [first memory regions]”; ¶ 57, “After the data is received by the dot product engine 142 and/or dot product has been computed, the dot product engine 142 may send the computed data [list of results] to the sector-aligned memory 92 [second memory region] to store the data for future use or additional computations”; note that note that the first memory region is where the vectors and matrices [lists of operands] stored in the sector-aligned memory 91; a second memory region is where the computed data [list of results] are stored in the sector-aligned memory 92 [memory regions]); 
wherein, during a time period after the first request and before completion of storing the list of results into the second memory region (FIG. 11A; ¶ 58, “To illustrate the type of dot product operations that may be performed using the compute-in-memory architecture described above, FIG. 11A shows a sequence of computations to perform matrix operations and FIG. 11B shows a sequence of computations to perform convolution operations. In FIG. 11A, multiple vectors may be simultaneously sent from the application 122 to the base die 24. As 24 memory may be grouped into multiple sectors 136, such as a first sector 150 (sector 0) and a second sector 152 (sector 1). The application 122 may send [first request] a first vector input 154 (Vi0) to a first sector 0 aligned memory 158 and second sector 0 aligned memory 159”; ¶ 59, “the dot product engines 142 corresponding to sector 0 aligned memories 158, 159 may compute a product of the first vector input 154 and the first matrix 162, and a product of the first vector input 154 and the second matrix 164, to determine M0,0 and M0,1. These partial computations may be gathered or accumulated by the accumulator 144, and reduced using the techniques described above, and read to the application 122 to be stored as a partial sum, first vector output 166, Vo0.”; note that a time period is after the application 122 sends [first request] a first vector input 154 (Vi0) to a first sector 0 aligned memory 158 and second sector 0 aligned memory 159 and before the generating and the sending of the first vector output 166, Vo0 to the application as shown in FIG. 11A), 
the communication interface is configured to receive a second request to access a third memory region in the plurality of memory regions (FIG. 11A; ¶ 44, “The application 122 may communicate with the on-chip memory 126 via an interconnect 132 [communication interface], which may represent the silicon bridge 36 of FIG. 2”; ¶ 58, “The application may also send [second request] a second vector input (Vi1) to first sector 1 aligned memory 160 and a second sector 1 aligned memory 161. The sector 0 aligned memories 158, 159 and the sector 1 aligned 160, 161 may already store matrix data, such as a first matrix 162 (M0) and a second matrix 164 (M1)”; ¶ 59, “the dot product engines 142 corresponding to sector 1 aligned memories 160, 161 [third memory region] may compute a product of the second vector input 156 and the first matrix 162, and a product of the second vector input 156 and the second matrix 164, to determine M1,0 and M1,1”); and 
in response to the second request and during the time period, the integrated circuit memory device is configured to provide, in parallel, memory access to the first memory regions and the second memory region to the arithmetic compute element matrix in facilitating a computation of the list of results and memory access to the third memory region to service the second request through the communication interface (FIG. 11A; ¶ 44, “The application 122 may communicate with the on-chip memory 126 via an interconnect 132 [communication interface], which may represent the silicon bridge 36 of FIG. 2”; ¶ 56, “the controller 138 may control the compute-in-memory circuitry 71 [arithmetic compute element matrix] to perform the arithmetic computations. In this example, the compute-in-memory circuitry 71 is operated as a dot product engine (DPE) 142. The dot product engine 142 may compute the dot product of vectors and matrices [lists of operands] stored in the sector-aligned memory 91 [first memory regions]”; ¶ 57, “After the data is received by the dot product engine 142 and/or dot product has been computed, the dot product engine 142 may send the computed data [list of results] to the sector-aligned memory 92 [second memory region] to store the data for future use or additional computations. Additionally or alternatively, the dot product engine 142 may send the computed data to an accumulator 148”; ¶ 58, “multiple vectors may be simultaneously [parallel] sent from the application 122 to the base die 24. As shown, the base die 24 memory may be grouped into multiple sectors 136, such as a first sector 150 (sector 0) and a second sector 152 (sector 1). The application 122 may send [second request] a first vector input 154 (Vi0) to a first sector 0 aligned memory 158 and second sector 0 aligned memory 159. The application may also send [second request] a second vector input (Vi1) to first sector 1 aligned memory 160 and a second sector 1 aligned memory 161. The sector 0 aligned memories 158, 159 and the sector 1 aligned memories 160, 161 [third memory region] may already store matrix data, such as a first matrix 162 (M0) and a second matrix 164 (M1)”); and 
wherein the integrated circuit memory device is encapsulated within an integrated circuit package (FIG. 13; ¶ 63, “The programmable logic device 12 [integrated circuit memory device] may be, or may be a component of, a data processing system. … The data processing system 220 may include several different packages or may be contained within a single package on a single package substrate.”).  



However, Yu teaches:
a time period after the first request and before completion of storing the list of results into the second memory region (FIGs. 8, 11B; ¶ 1332, “The data loading engine 831 can execute a data loading instruction that loads data for performing neural network computation from the external memory to the internal buffer. The loaded data may include parameter data and feature map data. The parameter data may include weight data (e.g., convolution kernels) and other parameters such as biases. The feature map data may include input image data, and may also include intermediate calculation results of the respective convolutional layers. The data operation engine 832 can execute a data operation instruction that reads the weight data and the feature map data from the internal buffer 820 to perform an operation and stores the operational result back to the internal buffer 820. The data storage engine 833 can then execute a data storage instruction that stores the operational result from internal buffer 820 back to the external memory 840”; ¶ 133, “the acquired instructions for neural network computation may include: a data loading instruction that loads data for neural network computation from the external memory to the internal buffer, the data for neural network computation includes parameter data and feature map data; a data operation instruction [first request] that reads the parameter data and the feature map data from the internal buffer to perform an operation and stores the result of the operation back to second memory region]; and a data storage instruction [second request] that stores the operational result from the internal buffer back to the external memory”; ¶ 135, “in a neural network specialized processor, the execution of the subsequent instruction [second request] may be started using other engines before the execution of the current instruction [first request] is completed, as shown in FIG. 11B. Thus, the overall computational efficiency of the computing system is improved by temporally partially superimposing the execution of the instructions that originally have dependency relationships”; note that a time period includes a duration of execution of the data operation instruction [first request], and the data storage instruction [second request] as the subsequent instruction that stores the operational result from the internal buffer back to the external memory using the data storage engine 833 occurs before the data operation instruction [first request] as the current instruction executed by the data operation engine 832 is completed as shown in FIG. 11B).

	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nurvitadhi to incorporate the teachings of Yu to provide an integrated circuit device having circuitry to perform arithmetic computations in memory of a first integrated circuit die accessible by a separate integrated circuit die of Nurvitadhi that may be used for artificial intelligence (AI) matrix multiplication operations, with a high parallelism computing system for artificial intelligence applications of Yu having the data loading engine 831, the data operation engine 832, and the data storage engine 833 implement respective instruction functions under the scheduling of internal instruction reading and distribution 810.  Doing so with the high parallelism computing system of Yu would make full use of the parallel execution capability of each module in the computing platform to improve the system computing efficiency to optimize high parallelism computation.  (Yu, ¶¶ 3-5) 

Regarding claim 12, the claimed method comprises substantially the same steps or elements as those in claim 1.  Accordingly, the claim is also rejected for the same reasons as set forth for those in claim 1 above.

	Regarding claim 2, the combination of Nurvitadhi teaches the integrated circuit memory device of claim 1.

Nurvitadhi further teaches:
wherein the plurality of memory regions provides dynamic random access memory (DRAM), cross point memory, or flash memory, or any combination therein (FIG. 13; ¶ 63, “The programmable logic device 12 may be, or may be a component of, a data processing system. For example, the programmable logic device 12 may be a component of a data processing system 220, shown in FIG. 13. … The memory and/or storage circuitry 224 may include random access memory (RAM), read-only memory (ROM), one or more hard drives, flash memory”).  

Regarding claim 6, the combination of Nurvitadhi teaches the integrated circuit memory device of claim 1.

Nurvitadhi further teaches:
wherein the arithmetic compute element matrix comprises: 
an array of arithmetic logic units configured to perform an operation on a plurality of data sets in parallel (FIG. 11A; ¶ 58, “To illustrate the type of dot product operations that may be performed using the compute-in-memory architecture described above, FIG. 11A shows a sequence of computations to perform matrix operations and FIG. 11B shows a sequence of computations to perform convolution operations. In FIG. 11A, multiple vectors may be simultaneously sent from the application 122 to the base die 24”), wherein each of the data sets includes one data element from each of the lists of operands (FIG. 11A; ¶ 59, “the dot product engines 142 corresponding to sector 0 aligned memories 158, 159 may compute a product of the first vector input 154 and the first matrix 162, and a product of the first vector input 154 and the second matrix 164, to determine M0,0 and M0,1.”).  

Regarding claim 7, the combination of Nurvitadhi teaches the integrated circuit memory device of claim 6.

Nurvitadhi further teaches:
wherein the arithmetic compute element matrix comprises: 
a state machine configured to control the array of arithmetic logic units to perform different computations identified by different codes of operations (FIG. 8; ¶ 47, “The compute-in-memory circuitry 71 associated with sector-aligned memory 92 may have a corresponding controller 138 (e.g., a state machine, an instruction set architecture (ISA) based processor, a reduced instruction set computer (RISC) processor, or the like). The controller 138 may be used to move computational data 131 between the sectors 136 and dies 22, 24. … the controller 138 may control a sequence of compute-in-memory operations using multiple integrated sector-aligned memory 92 units and compute-in-memory circuitry 71. In this manner, the fabric die 22 may offload application-specific commands to the compute-in-memory circuitry 71 in the base die 24”).  

Regarding claim 8, the combination of Nurvitadhi teaches the integrated circuit memory device of claim 7.

Nurvitadhi further teaches:
wherein the state machine is further configured to control the array of arithmetic logic units to perform computations for the lists of operands that have more data sets than the plurality of data sets that can be processed in parallel by the array of arithmetic logic units (FIG. 11A; ¶ 58, “multiple vectors may be simultaneously sent from the application 122 to the base die 24. As shown, the base 24 memory may be grouped into multiple sectors 136, such as a first sector 150 (sector 0) and a second sector 152 (sector 1). The application 122 may send a first vector input 154 (Vi0) to a first sector 0 aligned memory 158 and second sector 0 aligned memory 159.”; ¶ 59, “the dot product engines 142 corresponding to sector 0 aligned memories 158, 159 may compute a product of the first vector input 154 and the first matrix 162, and a product of the first vector input 154 and the second matrix 164, to determine M0,0 and M0,1.”; note that for each sector (such as sector 0), there are only 2 dot product engines 142 to compute a product of the first vector input 154 and the first matrix 162, and a product of the first vector input 154 and the second matrix 164, but there are a total of 2 vectors 150 and 151 for the data sets to processed and thus one sector is not enough to processed both vectors).  

Regarding claim 9, the combination of Nurvitadhi teaches the integrated circuit memory device of claim 7.

Nurvitadhi further teaches:
wherein the arithmetic compute element matrix further comprises: 
a cache memory configured to store a list of results generated in parallel by the array of arithmetic logic units (FIGs. 7, 10; ¶ 57, “After the data is received by the dot product engine 142 and/or dot product has been computed, the dot product engine 142 may send the computed data [list of results] to the sector-aligned memory 92 [second memory region] to store the data for future use or additional computations”; note that the computed list of results] is stored in the sector-aligned memory 92 [second memory region], which is thus considered a cache memory).  

Regarding claim 13, the combination of Nurvitadhi teaches the method of claim 12.

Nurvitadhi further teaches:
wherein the first request is a memory access command configured to access a memory location in the integrated circuit memory device (FIGs. 9A-9E; ¶ 48, “To illustrate some different application-specific compute-in-memory calculations that may be performed using the integrated sector-aligned memory 92 and compute-in-memory circuitry 71 architecture, FIGS. 9A, 9B, 9C, 9D, and 9E depict various operation sequences that may support the computations, such as gather and scatter operations. Briefly, gather and scatter operations are two data-transfer operations, transferring a number of data items by reading from (gathering) or writing to (scattering) a given location.”).  

Regarding claim 14, the combination of Nurvitadhi teaches the method of claim 13.

Nurvitadhi further teaches:
wherein the memory location stores a code identifying a computation to be performed by the arithmetic compute element matrix to generate the list of results 172, 174, 176, 178 may store functions, f1 and f2, which may be used by the dot product engines 142 for convolution computations”).  

Regarding claim 15, the combination of Nurvitadhi teaches the method of claim 14.

Nurvitadhi further teaches:
wherein the memory location is predefined to store the code (FIGs. 11A-11B; ¶ 60, “The sector-aligned memories 172, 174, 176, 178 may store functions, f1 and f2, which may be used by the dot product engines 142 for convolution computations”; note that since locations in the sector-aligned memories 172, 174, 176, 178 must be determined prior to storing the functions, f1 and f2, the locations are considered as predefined).  

Regarding claim 16, the combination of Nurvitadhi teaches the method of claim 14.

Nurvitadhi further teaches:
wherein the second request is a memory read command, or a memory write command, or any combination thereof (FIGs. 9A-9E, 11A; ¶ 44, “The application 122 may communicate with the on-chip memory 126 via an interconnect 132 [communication interface], which may represent the silicon bridge 36 of FIG. 2”; ¶ 58, second request] a second vector input (Vi1) to first sector 1 aligned memory 160 and a second sector 1 aligned memory 161. The sector 0 aligned memories 158, 159 and the sector 1 aligned memories 160, 161 may already store matrix data, such as a first matrix 162 (M0) and a second matrix 164 (M1)”; ¶ 48, “To illustrate some different application-specific compute-in-memory calculations that may be performed using the integrated sector-aligned memory 92 and compute-in-memory circuitry 71 architecture, FIGS. 9A, 9B, 9C, 9D, and 9E depict various operation sequences that may support the computations, such as gather and scatter operations. Briefly, gather and scatter operations are two data-transfer operations, transferring a number of data items by reading from (gathering) or writing to (scattering) a given location.”).  

Regarding claim 17, the combination of Nurvitadhi teaches the method of claim 12.

Nurvitadhi further teaches:
wherein a computing of an output comprises: 
performing an operation on a plurality of data sets in parallel to generate a plurality of results respectively (FIGs. 7, 10, 11A; ¶ 56, “the controller 138 may control the compute-in-memory circuitry 71 [arithmetic compute element matrix] to perform the arithmetic computations. In this example, the compute-in-memory circuitry 71 is operated as a dot product engine (DPE) 142. The dot product engine 142 may compute the dot 91 [first memory regions]”; ¶ 57, “After the data is received by the dot product engine 142 and/or dot product has been computed, the dot product engine 142 may send the computed data [results] to the sector-aligned memory 92 [second memory region] to store the data for future use or additional computations”; ¶ 58, “To illustrate the type of dot product operations that may be performed using the compute-in-memory architecture described above, FIG. 11A shows a sequence of computations to perform matrix operations and FIG. 11B shows a sequence of computations to perform convolution operations. In FIG. 11A, multiple vectors may be simultaneously sent from the application 122 to the base die 24”), wherein each of the data sets includes one data element from each of the lists of operands (FIG. 11A; ¶ 59, “the dot product engines 142 corresponding to sector 0 aligned memories 158, 159 may compute a product of the first vector input 154 and the first matrix 162, and a product of the first vector input 154 and the second matrix 164, to determine M0,0 and M0,1.”).  

	Regarding claim 18, Nurvitadhi teaches:
A computing apparatus, comprising: 
a processing device (FIGs. 1-2; ¶ 29, “the programmable logic device 12 may include a fabric die 22 [processing device] that communicates with a base die 24”); 
a memory device encapsulated within an integrated circuit package (FIG. 13; ¶ 30, “the programmable logic device 12 includes the fabric die 22 and the base die 24 [memory device] that are connected to one another via microbumps 26”; ¶ 63, “The programmable logic device 12 may be, or may be a component of, a data processing system. … The data processing system 220 may include several different packages or may be contained within a single package on a single package substrate.”); and 
a communication connection between the memory device and the processing device (FIGs. 1-2; ¶ 29, “the programmable logic device 12 may include a fabric die 22 [processing device] that communicates with a base die 24 [memory device]”; ¶ 30, “The base die 24 may attach to a package substrate 32 via C4 bumps 34. In the example of FIG. 2, two pairs of fabric die 22 and base die 24 are shown communicatively connected to one another via a silicon bridge 36 (e.g., an embedded multi-die interconnect bridge (EMIB)) and microbumps 38 at a silicon bridge interface 39.”; ¶ 30, “the programmable logic device 12 includes the fabric die 22 and the base die 24 that are connected to one another via microbumps 26”); 
wherein the memory device comprises: 
a plurality of memory regions (FIG. 3; ¶ 43, “The FPGA 40 of FIG. 3 is shown to be sectorized, meaning that programmable logic resources may be distributed through a number of discrete 48 (e.g., region, portion) [memory regions]”); 
an arithmetic compute element matrix coupled to access the plurality of memory regions in parallel (FIGs. 6, 8, 9A-9B, 10; ¶ 43, “there may be N regions of sector-aligned memory 92 [memory regions] that can be accessible by N corresponding fabric sectors 80 at the same time (e.g., in parallel). … The sector-aligned memory 92 is shown in FIG. 6 as vertically stacked memory. This may allow a large amount of memory to be located within the base die 24”; ¶ 47, “the on-chip memory 126 may include memory banks divided into multiple memory sectors 136, which may include dedicated blocks of random access memory (RAM), such as the sector-aligned memory 92 [memory regions]. Some of the sector-aligned memory 92 may be integrated with compute-in-memory circuitry 71 [arithmetic compute element matrix]. The compute-in-memory circuitry 71 [arithmetic compute element matrix] associated with sector-aligned memory 92 [memory regions] may have a corresponding controller 138 … the controller 138 may control a sequence of compute-in-memory operations using multiple integrated sector-aligned memory 92 units [memory regions] and compute-in-memory circuitry 71 [arithmetic compute element matrix]”; ¶ 50, “the application 122 may communicate with the base die 24 to scatter specific different data to multiple instances of the 71 [arithmetic compute element matrix] via multiple interfaces, performing a parallel scatter operation, as shown in FIG. 9B. In this manner, the compute-in-memory circuitry 71 may receive the multiple different data in parallel. Thus, the interconnect paths between the dies 22, 24, the multiple sectors 136, and the sector-aligned memory 92 [memory regions] may allow the compute-in-memory circuitry 71 to efficiently receive scattered data from the application 122 to perform calculations in the sector-aligned memory 92”; ¶ 54, “FIG. 10 depicts using the compute-in-memory circuitry 71 [arithmetic compute element matrix] to perform a compute-in-memory operation that may be used for tensor operations. Tensors are data structures, such as matrices and vectors, that may be used to calculate arithmetic operations. Particularly, dot product of vectors and matrices (matrix multiplication) may be used for deep learning or training an algorithm”); and 
a communication interface coupled to the arithmetic compute element matrix and configured to receive a first request from the processing device through the communication connection (FIGs. 2, 7; ¶ 44, “A circuit design define an application 122 (e.g., an accelerator function such as an artificial intelligence (AI) function) that may involve a large amount of data, as in the example shown in FIG. 7. In this case, much of the data may reside 126 (e.g., which may represent memory of the sector-aligned memory 92) in the base die 24 (which may be understood to be off-chip from the fabric die 22 [processing device]) … the on-chip memory 126 stores computational data 131 that may be used in computations by the compute-in-memory circuitry 71 [arithmetic compute element matrix] to carry out requests [first request] by the application 122. The application 122 may communicate with the on-chip memory 126 via an interconnect 132 [communication interface], which may represent the silicon bridge 36 of FIG. 2”; note that the application 122 is in the fabric die 22 [processing device] as illustrated in FIG. 7); 
wherein, in response to the first request, the arithmetic compute element matrix is configured to access a plurality of lists of operands stored in first memory regions in the plurality of memory regions, generate a list of results from the plurality of -- 5 --App. Ser. No.: 16/158,593Attorney Docket No.: 120426-158900/US lists of operands, and store the list of results in a second memory region in the plurality of memory regions (FIGs. 7, 10; ¶ 44, “much of the data may reside in on-chip memory 126 (e.g., which may represent memory of the sector-aligned memory 92) in the base die 24 (which may be understood to be off-chip from the fabric die 22) and/or in off-chip memory 127 located elsewhere. In the example of FIG. 7, the on-chip memory 126 stores computational data 131 that may be used in computations by the compute-in-memory 71 [arithmetic compute element matrix] to carry out requests [first request] by the application 122”; ¶ 56, “the controller 138 may control the compute-in-memory circuitry 71 [arithmetic compute element matrix] to perform the arithmetic computations. In this example, the compute-in-memory circuitry 71 is operated as a dot product engine (DPE) 142. The dot product engine 142 may compute the dot product of vectors and matrices [lists of operands] stored in the sector-aligned memory 91 [first memory regions]”; ¶ 57, “After the data is received by the dot product engine 142 and/or dot product has been computed, the dot product engine 142 may send the computed data [list of results] to the sector-aligned memory 92 [second memory region] to store the data for future use or additional computations”; note that note that the first memory region is where the vectors and matrices [lists of operands] stored in the sector-aligned memory 91; a second memory region is where the computed data [list of results] are stored in the sector-aligned memory 92 [memory regions]); 
wherein, during a time period after the first request and before completion of storing the list of results into the second memory region (FIG. 11A; ¶ 58, “To illustrate the type of dot product operations that may be performed using the compute-in-memory architecture described above, FIG. 11A shows a sequence of computations to perform matrix operations and FIG. 11B shows a sequence of computations to perform convolution operations. In FIG. 11A, multiple vectors may be simultaneously sent from 122 to the base die 24. As shown, the base die 24 memory may be grouped into multiple sectors 136, such as a first sector 150 (sector 0) and a second sector 152 (sector 1). The application 122 may send [first request] a first vector input 154 (Vi0) to a first sector 0 aligned memory 158 and second sector 0 aligned memory 159”; ¶ 59, “the dot product engines 142 corresponding to sector 0 aligned memories 158, 159 may compute a product of the first vector input 154 and the first matrix 162, and a product of the first vector input 154 and the second matrix 164, to determine M0,0 and M0,1. These partial computations may be gathered or accumulated by the accumulator 144, and reduced using the techniques described above, and read to the application 122 to be stored as a partial sum, first vector output 166, Vo0.”; note that a time period is after the application 122 sends [first request] a first vector input 154 (Vi0) to a first sector 0 aligned memory 158 and second sector 0 aligned memory 159 and before the generating and the sending of the first vector output 166, Vo0 to the application as shown in FIG. 11A), the communication interface is configured to receive from the processing device, a second request to access a third memory region in the plurality of memory regions (FIG. 11A; ¶ 44, “The application 122 [in the fabric die 22, which is the processing device] may communicate with the on-chip memory 126 via an interconnect 132 [communication interface], which may represent the silicon bridge 36 of FIG. 2”; ¶ 58, “The application may also send [second request] a second vector input (Vi1) to first sector 1 aligned memory 160 and a second sector 1 aligned memory 161. The sector 0 aligned memories 158, 159 and the sector 1 aligned memories 160, 161 may already store matrix data, such as a first matrix 162 (M0) and a second matrix 164 (M1)”; ¶ 59, “the dot product engines 142 corresponding to sector 1 aligned memories 160, 161 [third memory region] may compute a product of the second vector input 156 and the first matrix 162, and a product of the second vector input 156 and the second matrix 164, to determine M1,0 and M1,1”); and 
wherein, in response to the second request and during the time period, the memory device is configured to provide, in parallel, memory access to the first memory regions and the second memory region to the arithmetic compute element matrix in facilitating the computation of the list of results and memory access to the third memory region to service the second request through the communication interface (FIG. 11A; ¶ 44, “The application 122 may communicate with the on-chip memory 126 via an interconnect 132 [communication interface], which may represent the silicon bridge 36 of FIG. 2”; ¶ 56, “the controller 138 may control the compute-in-memory circuitry 71 [arithmetic compute element matrix] to perform the arithmetic computations. In this example, the compute-in-memory circuitry 71 is operated as a dot product engine (DPE) 142. The dot product engine 142 may compute the dot product of vectors and matrices [lists of operands] stored in the sector-91 [first memory regions]”; ¶ 57, “After the data is received by the dot product engine 142 and/or dot product has been computed, the dot product engine 142 may send the computed data [list of results] to the sector-aligned memory 92 [second memory region] to store the data for future use or additional computations. Additionally or alternatively, the dot product engine 142 may send the computed data to an accumulator 148”; ¶ 58, “multiple vectors may be simultaneously [parallel] sent from the application 122 to the base die 24. As shown, the base die 24 memory may be grouped into multiple sectors 136, such as a first sector 150 (sector 0) and a second sector 152 (sector 1). The application 122 may send [second request] a first vector input 154 (Vi0) to a first sector 0 aligned memory 158 and second sector 0 aligned memory 159. The application may also send [second request] a second vector input (Vi1) to first sector 1 aligned memory 160 and a second sector 1 aligned memory 161. The sector 0 aligned memories 158, 159 and the sector 1 aligned memories 160, 161 [third memory region] may already store matrix data, such as a first matrix 162 (M0) and a second matrix 164 (M1)”).  

Nurvitadhi teaches a time period.  Nevertheless, Nurvitadhi does not explicitly teach a time period after the first request and before completion of storing the list of results into the second memory region.


a time period after the first request and before completion of storing the list of results into the second memory region (FIGs. 8, 11B; ¶ 1332, “The data loading engine 831 can execute a data loading instruction that loads data for performing neural network computation from the external memory to the internal buffer. The loaded data may include parameter data and feature map data. The parameter data may include weight data (e.g., convolution kernels) and other parameters such as biases. The feature map data may include input image data, and may also include intermediate calculation results of the respective convolutional layers. The data operation engine 832 can execute a data operation instruction that reads the weight data and the feature map data from the internal buffer 820 to perform an operation and stores the operational result back to the internal buffer 820. The data storage engine 833 can then execute a data storage instruction that stores the operational result from internal buffer 820 back to the external memory 840”; ¶ 133, “the acquired instructions for neural network computation may include: a data loading instruction that loads data for neural network computation from the external memory to the internal buffer, the data for neural network computation includes parameter data and feature map data; a data operation instruction [first request] that reads the parameter data and the feature map data from the internal buffer to perform an operation and stores the result of the operation back to the internal buffer [second memory region]; and a data storage instruction [second request] that stores the operational result from the internal buffer back to the external memory”; ¶ 135, “in a neural network specialized processor, the execution of the subsequent instruction [second request] may be started using other engines before the first request] is completed, as shown in FIG. 11B. Thus, the overall computational efficiency of the computing system is improved by temporally partially superimposing the execution of the instructions that originally have dependency relationships”; note that a time period includes a duration of execution of the data operation instruction [first request], and the data storage instruction [second request] as the subsequent instruction that stores the operational result from the internal buffer back to the external memory using the data storage engine 833 occurs before the data operation instruction [first request] as the current instruction executed by the data operation engine 832 is completed as shown in FIG. 11B).

	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nurvitadhi to incorporate the teachings of Yu to provide an integrated circuit device having circuitry to perform arithmetic computations in memory of a first integrated circuit die accessible by a separate integrated circuit die of Nurvitadhi that may be used for artificial intelligence (AI) matrix multiplication operations, with a high parallelism computing system for artificial intelligence applications of Yu having the data loading engine 831, the data operation engine 832, and the data storage engine 833 implement respective instruction functions under the scheduling of internal instruction reading and distribution module 810.  Doing so with the high parallelism computing system of Yu would make full use of the parallel execution capability of each module in the computing platform to improve the system computing efficiency to optimize high parallelism computation.  (Yu, ¶¶ 3-5)

Regarding claim 19, the combination of Nurvitadhi teaches the computing apparatus of claim 18.

Nurvitadhi further teaches:
wherein the processing device is configured to load input data into the third memory region via the second request during the time period in which the list of results are computed in the arithmetic compute element matrix (FIG. 11A; ¶ 44, “The application 122 [in the fabric die 22, which is the processing device] may communicate with the on-chip memory 126 via an interconnect 132 [communication interface], which may represent the silicon bridge 36 of FIG. 2”; ¶ 58, “The application may also send [second request] a second vector input (Vi1) [input data] to first sector 1 aligned memory 160 and a second sector 1 aligned memory 161. The sector 0 aligned memories 158, 159 and the sector 1 aligned memories 160, 161 may already store matrix data, such as a first matrix 162 (M0) and a second matrix 164 (M1)”; ¶ 59, “the dot product engines 142 corresponding to sector 1 aligned memories 160, 161 [third memory region] may compute a product of the second vector input 156 and the first matrix 162, and a product of the second vector input 156 and the second matrix 164, to determine M1,0 and M1,1”).  


Claims 3-4 are rejected under 35 U.S.C. 103 as being unpatentable over Nurvitadhi et al. (US 2019/0042251 A1), hereinafter “Nurvitadhi”, in view of Yu et al. (US 2020/0004514 A1), hereinafter “Yu”, as applied to claim 2 above, and further in view of Jayasena et al. (US 2015/0106574 A1), hereinafter “Jayasena”.

Regarding claim 3, the combination of Nurvitadhi teaches the integrated circuit memory device of claim 2.

	The combination of Nurvitadhi does not teach wherein the plurality of memory regions is formed on a first integrated circuit die; and the arithmetic compute --2--App. Ser. No.: 16/158,593Attorney Docket No.: 120426-158900/USelement matrix is formed on a second integrated circuit die different from the first integrated circuit die.

However, Jayasena teaches:
wherein the plurality of memory regions is formed on a first integrated circuit die; and the arithmetic compute --2--App. Ser. No.: 16/158,593Attorney Docket No.: 120426-158900/USelement matrix is formed on a second integrated circuit die different from the first integrated circuit die (FIG. 1; ¶ 31, “processor 102, logic 104 [arithmetic compute element matrix], and memory 106 [plurality of memory regions] are each implemented using one or more integrated circuit dies (or, more simply, “dies”). In other words, processor 102, logic 104, and memory 106 are implemented as semiconductor integrated circuits that are fabricated on one or more corresponding dies”; note that a first integrated circuit die is for the memory 106 [plurality of memory regions] and a second integrated circuit die is for the logic 104 [arithmetic compute element matrix]).  

	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Nurvitadhi to 

Regarding claim 4, the combination of Nurvitadhi teaches the integrated circuit memory device of claim 3.

Jayasena further teaches:
further comprising: 
a set of through-silicon vias (TSVs) coupled between the first integrated circuit die and the second integrated circuit die to connect the arithmetic compute element matrix to the plurality of memory regions (FIG. 7; ¶ 31, supra; ¶ 51, “the dies in stack 700 are communicatively coupled using through-silicon vias (TSVs)”).  

	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Nurvitadhi to incorporate the teachings of Jayasena to provide an integrated circuit device having circuitry to perform arithmetic computations in memory of a first integrated circuit die accessible by a separate integrated circuit die of Nurvitadhi that may be used for artificial intelligence (AI) matrix multiplication operations, with the memory die processing circuits and the logic die processing circuits used to offload a portion of the operations from the processor of Jayasena.  Doing so with the offloading of Jayasena would be beneficial because, in comparison to existing computing devices, the processor is freed to perform other computational operations and a communication link between the processor, the logic die, and/or the memory die may carry less traffic, which generally improves the performance and energy efficiency of the computing device.  (Jayasena, ¶ 22)


Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Nurvitadhi et al. (US 2019/0042251 A1), hereinafter “Nurvitadhi”, in view of Yu et al. (US 2020/0004514 A1), hereinafter “Yu”, and Jayasena et al. (US 2015/0106574 A1), hereinafter “Jayasena”, as applied to claim 3 above, and further in view of Ye et al. (US 2016/0148918 A1), hereinafter “Ye”.

Regarding claim 5, the combination of Nurvitadhi teaches the integrated circuit memory device of claim 3.

	The combination of Nurvitadhi does not teach further comprising: wires encapsulated within the integrated circuit package and coupled between the first integrated circuit die and the second integrated circuit die to connect the arithmetic compute element matrix to the plurality of memory regions.

However, Ye teaches:
further comprising: 
wires encapsulated within the integrated circuit package and coupled between the first integrated circuit die and the second integrated circuit die to connect the arithmetic compute element matrix to the plurality of memory regions (FIG. 2; ¶ 14, “The memory device 100 can further include a package casing 115 comprising an encapsulant 116 that at least partially encapsulates the memory packages 108 and the wire bonds 142.”; ¶ 15, “FIG. 2 is a cross-sectional view of a memory package 108 … The package substrate 202 can include a plurality of first bond pads 208 a and a plurality of second bond pads 208 b. The first bond pads 208 a can be coupled (e.g., wire bonded) to corresponding bond pads 209 a (one identified) of a first group of the semiconductor dies 200 (e.g., two sets of four dies)”).  

	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Nurvitadhi to 100 having a package casing 115 comprising an encapsulant 116 that at least partially encapsulates the memory packages 108 and the wire bonds 142 of Ye.  Doing so with the memory device 100 having the memory packages 108 of Ye would increase product yields because individual components can be tested before assembly.  (Ye, ¶ 25)


Claims 10-11 are rejected under 35 U.S.C. 103 as being unpatentable over Nurvitadhi et al. (US 2019/0042251 A1), hereinafter “Nurvitadhi”, in view of Yu et al. (US 2020/0004514 A1), hereinafter “Yu”, as applied to claim 9 above, and further in view of Elliott et al. (US 6,279,088 B1), hereinafter “Elliott”.

Regarding claim 10, the combination of Nurvitadhi teaches the integrated circuit memory device of claim 9.

The combination of Nurvitadhi does not teach wherein the third memory region is the same as the second memory region.

However, Elliott teaches:
wherein the third memory region is the same as the second memory region (FIG. 2; col. 6, ln. 7-10, “The processor elements 12 can then store the result of the process instruction back into the same memory elements as provided the sensed bits, all in one cycle”).  

	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Nurvitadhi to incorporate the teachings of Elliott to provide an integrated circuit device having circuitry to perform arithmetic computations in memory of a first integrated circuit die accessible by a separate integrated circuit die of Nurvitadhi that may be used for artificial intelligence (AI) matrix multiplication operations, with the processors locating on the same chip as the memory of Elliott.  Doing so by locating the processors on the same chip as the memory of Elliott would exploit the extremely wide data path and high data bandwidth available at the sense amplifiers.  (Elliott, col. 2, ln. 38-41)

Regarding claim 11, the combination of Nurvitadhi teaches the integrated circuit memory device of claim 9.

Nurvitadhi further teaches:
wherein the third memory region is different from the second memory region (FIG. 11A; ¶ 57, “After the data is received by the dot product engine 142 and/or dot product has been computed, the dot product engine 142 may send the computed data [list of results] to the sector-aligned memory 92 [second memory region] to store 142 may send the computed data to an accumulator 148”; ¶ 58, “To illustrate the type of dot product operations that may be performed using the compute-in-memory architecture described above, FIG. 11A shows a sequence of computations to perform matrix operations and FIG. 11B shows a sequence of computations to perform convolution operations. In FIG. 11A, multiple vectors may be simultaneously sent from the application 122 to the base die 24. As shown, the base die 24 memory may be grouped into multiple sectors 136, such as a first sector 150 (sector 0) and a second sector 152 (sector 1). The application 122 may send [first request] a first vector input 154 (Vi0) to a first sector 0 aligned memory 158 and second sector 0 aligned memory 159 [second memory region]”; ¶ 59, “the dot product engines 142 corresponding to sector 1 aligned memories 160, 161 [third memory region] may compute a product of the second vector input 156 and the first matrix 162, and a product of the second vector input 156 and the second matrix 164, to determine M1,0 and M1,1”; note that the a first sector 0 aligned memory 158 and second sector 0 aligned memory 159 [second memory region] are different from the sector 1 aligned memories 160, 161 [third memory region]).  


Allowable Subject Matter
Claim 20 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Tong B Vo whose telephone number is (571)272-7568.  The examiner can normally be reached on M-F 8:00 AM - 4:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/T.B.V./Patent Examiner, Art Unit 2136


/CHARLES RONES/Supervisory Patent Examiner, Art Unit 2136