Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
           This action is in response to the communication filed on 4/9/2020. 
Claims 2-6 and 8-20 are allowed. 
Claims 1 and 7 are cancelled. 

Allowable Subject Matter
Claims are 2-6, 8-20 are allowed. 
				
Information Disclosure Statement
The Information Disclosure Statement (IDS) submitted on 8/10/2021, 4/8/2022 and 5/23/2022 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the IDS statement has been considered by the Examiner.

EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.  
Authorization for this examiner’s amendment was given in a telephone interview with the applicant’s representative, Mr. Lehua Wang (Reg No 48,023) on 5/27/2022. 

AMENDMENTS TO THE CLAIMS
The listing of claims will replace all prior versions, and listings, of claims in the application:

Listing of Claims:

1.	(Canceled) 
2.	(Currently Amended) The device of claim 8, further comprising:
an interposer affixed to the substrate, wherein the central processing unit is coupled to the at least one processing unit or the random access memory, or both, via the interposer.
3.	(Currently Amended) The device of claim 8, wherein the at least one processing unit is configured to store an output from the artificial neural network in the random access memory as input to the application; and the central processing unit includes a memory controller configured to map logical memory allocated during execution of the application to physical memory in the random access memory.
4.	(Currently Amended) The device of claim 3, wherein the memory controller is configured to load, through a first connection between the central processing unit and the random access memory, first instructions from random access memory for execution by the at least one processing unit.
5.	(Currently Amended) The device of claim 4, wherein the at least one processing unit is configured to load, through a second connection to the random access memory, matrix operands from the random access memory.
6.	(Currently Amended) The device of claim 4, wherein the central processing unit includes at least one arithmetic-logic unit; and the memory controller is configured to load, through the first connection between the central processing unit and the random access memory, second instructions from random access memory for execution by the at least one arithmetic-logic unit.
7.	(Canceled) 
8.	(Currently Amended) A device to accelerate computations in deep learning, the device  comprising:
an integrated circuit package comprising a substrate and enclosing the device;
a central processing unit configured on the substrate;
at least one processing unit configured on the substrate and configured to execute instructions having matrix operands;
random access memory coupled to the at least one processing unit and the central processing unit via the substrate and configured to store:
matrices of an artificial neural network;
instructions executable by the at least one processing unit to implement the artificial neural network; and
at least one application programmed for execution by the central processing unit; and
an interface coupled to the central processing unit and the random access memory via the substrate and couplable to a bus that is external to the device;
a first integrated circuit die having configured thereon a deep learning accelerator having the at least one processing unit, a control unit, local memory configured to store matrix operands, and a memory interface to the random access memory; 
at least one second integrated circuit die having configured thereon the random access memory; and;
a third integrated circuit die having configured thereon the central processing unit;
wherein the memory interface is connected to the random access memory in the at least one second integrated circuit die by through-silicon vias;
wherein the central processing unit is connected to the random access memory in the at least one second integrated circuit die by through-silicon vias; and
wherein the at least one second integrated circuit die is stacked between the first integrated circuit die and the third integrated circuit die.
9.	(Currently Amended) The device of claim 8, wherein the central processing unit is configured on the first integrated circuit die; and the central processing unit and the Deep Learning Accelerator share an interface to the random access memory.
10.	(Currently Amended) The device of claim 9, wherein the central processing unit and the deep learning accelerator further share a logic circuit to load instructions from the random access memory.
11.	(Currently Amended) The device of claim 8, further comprising:
a fourth integrated circuit die having configured thereon wires;
wherein the deep learning accelerator r in the first integrated circuit die and the random access memory in the at least second integrated circuit die are connected using the wires in the fourth integrated circuit die and through-silicon vias from the fourth integrated circuit die.
12.	(Currently Amended) The device of claim 11, further comprising:
a third integrated circuit die having configured thereon the central processing unit;
wherein the first integrated circuit die, the at least second integrated circuit die, and the third integrated circuit die are stacked on the fourth integrated circuit die and connected to the fourth integrated circuit using separate sets of through-silicon vias.
13.	(Original) The device of claim 12, wherein a circuit of the interface is configured on the fourth integrated circuit die to process signals on the bus; and the bus is in accordance with a protocol of Universal Serial Bus (USB), Serial Advanced Technology Attachment (SATA) bus, or Peripheral Component Interconnect express (PCIe).
14.	(Currently Amended) The device of claim 8, wherein the at least one processing unit includes a matrix-matrix unit configured to operate on two matrix operands of an instruction;
wherein the matrix-matrix unit includes a plurality of matrix-vector units configured to operate in parallel;
wherein each of the plurality of matrix-vector units includes a plurality of vector-vector units configured to operate in parallel; and
wherein each of the plurality of vector-vector units includes a plurality of multiply-accumulate units configured to operate in parallel.
15.	(Currently Amended) A method to accelerate computations in deep learning, the method comprising: 
storing, in random access memory configured on at least one second integrated circuit die in an integrated circuit device:
matrices of an artificial neural network;
first instructions executable by at least one processing unit of a deep learning accelerator on a first integrated circuit die enclosed within the integrated circuit device to implement the artificial neural network using the matrices, wherein the deep learning accelerator has the at least one processing unit, a control unit, local memory configured to store matrix operands, and a memory interface to the random access memory, and wherein the memory interface is connected to the random access memory in the at least one second integrated circuit die by through-silicon vias; and
second instructions of at least one application programmed for execution by a Central Processing Unit enclosed within the integrated circuit device;
loading, through an interface of the integrated circuit device couplable to a bus that is external to the integrated circuit device, sensor data into the random access memory as input to the Artificial Neural Network;
executing, by the at least one processing unit, the first instructions to generate output from the Artificial Neural Network based on the input;
storing, in the random access memory, the output from the artificial neural network; and
executing, by the central processing unit on a third integrated circuit die and connected the random access memory in the at least one second integrated circuit die by through-silicon vias, the second instructions of the at least one application that uses output from the artificial neural network;
wherein the at least one second integrated circuit die is stacked between the first integrated circuit die and the third integrated circuit die.
16.	(Currently Amended) The method of claim 15, wherein at least one processing unit includes at least a matrix-matrix unit configured to execute an instruction having two matrix operands; the matrix-matrix unit includes a plurality of matrix-vector units configured to operate in parallel; each of the matrix-vector units includes a plurality of vector-vector units configured to operate in parallel; and each of the vector-vector units includes a plurality of multiply-accumulate units configured to operate in parallel.
17.	(Currently Amended) The method of claim 16, wherein the deep learning accelerator executes the first instructions in parallel with the central processing unit executing the second instructions.
18.	(Currently Amended) The method of claim 16, wherein execution of the first instructions by the deep learning accelerator includes a call to a routine executed in the central processing unit.
19.	(Currently Amended) An apparatus to accelerate computations in deep learning, the apparatus comprising:
random access memory configured on at least one second integrated circuit die;
a central processing unit on a third integrated circuit die and connected the random access memory in the at least one second integrated circuit die by through-silicon vias, the central processing unit having at least one arithmetic-logic unit
a deep learning accelerator configured on a first integrated circuit die and having: 
at least one processing unit configured to operate on two matrix operands of an instruction executable in the deep learning accelerator;
a control unit; 
local memory configured to store matrix operands; and 
a memory interface to the random access memory;
wherein the memory interface is connected to the random access memory in the at least one second integrated circuit die by through-silicon vias; and 
wherein the at least one second integrated circuit die is stacked between the first integrated circuit die and the third integrated circuit die;
an interface configured to be connected to a peripheral bus;
wherein the apparatus is configured to receive sensor data from the peripheral bus using the interface, store the sensor data as input to first instructions executed in the deep learning accelerator, store output generated from execution of the first instruction in the random access memory as input to an application executed in the central processing unit.
20.	(Currently Amended) The apparatus of claim 19, wherein the random access memory includes non-volatile memory configured to store model data of an artificial neural network; the model data includes the first instructions executable by the deep learning accelerator; and the central processing unit and the deep learning accelerator are configured to operate in parallel.

AMENDMENTS TO THE SPECIFICATION
Please use the following amended paragraph(s) to replace the paragraph(s) of the same number(s) in the specification.
[0049]	The Deep Learning Accelerator (103) in FIG. 1 includes processing units (111), a control unit (113), and local memory (115).  When vector and matrix operands are in the local memory (115), the control unit (113) can use the processing units (111) to perform vector and matrix operations in accordance with instructions.  Further, the control unit (113) can load instructions and operands from the random access memory (105) through a memory interface (117) and a high speed/bandwidth connection (119).
[0055]	In some implementations, the number of data elements of a vector or matrix that can be accessed in parallel over the connection (119) corresponds to the granularity of the Deep Learning Accelerator (DLA) operating on vectors or matrices.  For example, when the processing units (111) can operate on a number of vector/matrix elements in parallel, the connection (119) is configured to load or store the same number, or multiples of the number, of elements via the connection (119) in parallel.
[0058]	In one embodiment, when the input data is loaded or updated in the random access memory (105), the control unit (113) of the Deep Learning Accelerator (DLA) (103) can automatically execute the instructions for the Artificial Neural Network (ANN) to generate an output of the Artificial Neural Network (ANN).  The output is stored into a predefined region in the random access memory (105).  The Deep Learning Accelerator (DLA) (103) can execute the instructions without help from a Central Processing Unit (CPU).  Thus, communications for the coordination between the Deep Learning Accelerator (DLA) (103) and a processor outside of the integrated circuit device (101) (e.g., a Central Processing Unit (CPU)) can be reduced or eliminated.
[0062]	The random access memory (105) can be volatile memory or non-volatile memory, or a combination of volatile memory and non-volatile memory.  Examples of non-volatile memory include flash memory, memory cells formed based on negative-and (NAND) logic gates, negative-or (NOR) logic gates, Phase-Change Memory (PCM), magnetic memory (MRAM), resistive random-access memory, cross point storage and memory devices.  A cross point memory device can use transistor-less memory elements, each of which has a memory cell and a selector that are stacked together as a column.  Memory element columns are connected via two layers of wires running in perpendicular directions, where wires of one layer run in one direction in the layer that is located above the memory element columns, and wires of the other layer run in another direction and are located below the memory element columns.  Each memory element can be individually selected at a cross point of one wire on each of the two layers.  Cross point memory devices are fast and non-volatile and can be used as a unified memory pool for processing and storage.  Further examples of non-volatile memory include Read-Only Memory (ROM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM) and Electronically Erasable Programmable Read-Only Memory (EEPROM) memory, etc.  Examples of volatile memory include Dynamic Random-Access Memory (DRAM) and Static Random-Access Memory (SRAM).
[0085]	In FIG. 5, after the results of the DLA compiler (203) are stored in the random access memory (105), the application of the trained ANN (201) to process an input (211) to the trained ANN (201) to generate the corresponding output (213) of the trained ANN (201) can be triggered by the presence of the input (211) in the random access memory (105), or another indication provided in the random access memory (105).
[0095]	In a method according to one embodiment, random access memory (105) of a computing device (e.g., 101) can be accessed using an interface (107) of the computing device (e.g., 101) to a memory controller.  The computing device (e.g., 101) can have processing units (e.g., 111) configured to perform at least computations on matrix operands, such as a matrix operand stored in maps banks (151 to 153) and a matrix operand stored in kernel buffers (131 to 133).
[0115]	In one embodiment, the connections (108 and 109) have separate sets of buses or wires.  Thus, the external interfaces (106 and 107) do not share buses or wires in accessing the different portions of the random access memory (105) for the input (211) and the output (213).  Alternatively, an access controller is configured to use separate buffers for the interfaces (106 and 107) and use the high bandwidth connection (119) to transfer data between the random access memory (105) and the buffers for the interfaces (106 and 107) such that the interfaces (106 and 107) can service write and read requests concurrently.  Since the bandwidth of the connection (119) is substantially higher than the bandwidth used by the connections (108 and 109) to the external interfaces (106 and 107) of the integrated circuit device (101), a small portion of the bandwidth can be allocated to the connections (108 and 109).  For example, the interfaces (106 and 107) can be connected to the memory interface (117) of the Deep Learning Accelerator (DLA) (103) to access the random access memory (105) via the connection (119).
[0124]	For example, a compiler (203) can be used to convert a description of the Artificial Neural Network (201) into the instructions (205) and the matrices (207) to implement the Artificial Neural Network (201) using the Deep Learning Accelerator (103).
[0126]	For example, the random access memory (105) can have multiple portions that are capable of being used concurrently and independent from each other.  A first portion is configured to store the first output from the Artificial Neural Network (201); a second portion configured to store third input to the Artificial Neural Network (201); a third portion configured to store the second output from the Artificial Neural Network (201); and a fourth portion configured to store the second input to the Artificial Neural Network (201).  When the third and fourth portions are being used by the Deep Learning Accelerator (103) in execution of the instructions (205), the first interface and the second interface can be connected concurrently to the first portion and second portion respectively.
[0168]	Additional external devices can be connected to the bus accessible to the input/output interface (236).  Such devices can include a communication device configured to communicate over a wired or wireless computer connection, such as a wired or wireless local area network, a wireless personal area network, a wireless wide area network, a cellular communications network, and/or the Internet.  Such devices can also include a display device, a monitor, a touch screen, a speaker, a keyboard, a mouse, a touch pad, and/or a track ball, etc. to present a user interface of the application (215).  Through the input/output interface (236), the application (215) executed in the Central Processing Unit (225) can access the devices connected on the bus.
[0175]	For example, the Central Processing Unit (225) can have logic circuit configured to load instructions (e.g., 215 and/or 205) from the random access memory (105) for execution.  Matrix/vector instructions are dispatched to processing units (111); and other instructions are dispatched to the Arithmetic-Logic Units (ALUs) of the Central Processing Unit (225) for execution.  The processing units (111) can have additional circuits to load matrix/vector operands from the random access memory (105) and/or store results to the random access memory (105).  Thus, the Deep Learning Accelerator (103) and the Central Processing Unit (225) can cooperate with each other in executing the instructions (205) of the Artificial Neural Network (201).
[0180]	At block 305, the at least one processing unit (111) executes the first instructions (205) to generate output (213) from the Artificial Neural Network (201) based on the input (211).

Prior Art of Record
            The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Lekivetz et al US Patent 10,754,764 teaches to generate model to predict responses based on different group identifiers for distributing the test cases for the system with generated data sets for the validation(s). 
Liu et al US Patent 10,885,314 teaches face identification (edges Identification) with artificial intelligence chip with first light signal and second light signal reflected by the object with training data. 

Lai et al US Patent 10,978,382 teaches integrated circuit die with encapsulant with redistribution structure on the encapsulant with module socket and other components attached to the interposer. 
Garegrat et al US Publication 2019/0391811 teaches matrix processor with strided read sequence, in correct order of memory with first operation on the matrix operand; read the matrix operand and perform the first operation on the matrix operand. 
Hou et al US Publication 2021/0150317 teaches neuron circuit and artificial neural network chip with memristor and integrator with multiple layers and analyzing sensor data. 
Venkatesh et al US Publication 2021/0019591 teaches receiving input data to generate a plurality of outputs for a layer of neural network with multiple dimensions of processing unit with multiple arrays and subarrays. 
           		
REASONS FOR ALLOWANCE
          The following is an examiner’s statement of reasons for allowance:
Examiner finds amended claims dated 5/27/2022 are persuasive for reason of allowance.  
The search for amended claim(s) does not explicitly disclose, in light of other features recited in independent claims 1 and 11 as follows :
ROA – where none of the prior art reference’s or combination of do not teach – where in domain of three integrated circuit(s) with deep learning accelerator having the at least one processing unit, a control unit, local memory configured to store matrix operands, and a memory interface to the random access memory with deep learning and artificial neural network for the purpose of identification of edges between multiple layers as further as described in amended claims 5/23/2022.
Claims ‘ .. random access memory coupled to the at least one processing unit and the central processing unit via the substrate and configured to store:
matrices of an artificial neural network;
instructions executable by the at least one processing unit to implement the artificial neural network; and
at least one application programmed for execution by the central processing unit; and
an interface coupled to the central processing unit and the random access memory via the substrate and couplable to a bus that is external to the device;
a first integrated circuit die having configured thereon a deep learning accelerator having the at least one processing unit, a control unit, local memory configured to store matrix operands, and a memory interface to the random access memory; 
at least one second integrated circuit die having configured thereon the random access memory; and;
a third integrated circuit die having configured thereon the central processing unit;
wherein the memory interface is connected to the random access memory in the at least one second integrated circuit die by through-silicon vias;
wherein the central processing unit is connected to the random access memory in the at least one second integrated circuit die by through-silicon vias; and
wherein the at least one second integrated circuit die is stacked between the first integrated circuit die and the third integrated circuit die.’ with additional detailed steps in claim(s) as described in independent claim(s) on 5/27/2022. 
However, each of the cited references or reference from the updated search, at least, fails to teach or suggest in combination with the rest of the limitations recited in the independent claim(s).
Dependent claims depend on allowed independent claims, therefore they are allowed. 
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VIRAL S LAKHIA whose telephone number is (571)270-3363.  The examiner can normally be reached on 8 am - 6 pm.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Lynn Feild can be reached on 571-272-2092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/VIRAL S LAKHIA/Examiner, Art Unit 2431