DETAILED ACTION
Claims 1-26 are pending in the present application and Claims 16-26 have been withdrawn.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 09/01/2021 and 01/13/2022 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

	Election/Restrictions
Claims 16-26 are withdrawn from further consideration pursuant to 37 CFR 1.142(b) as being drawn to a nonelected Species, there being no allowable generic or linking claim. Election was made without traverse in the reply filed on 06/22/2022.
Applicant’s election without traverse of claims 1-15 in the reply filed on 06/22/2022 is acknowledged.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 6-12, and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2018/0293701 to Appu et al. in view of U.S. PGPubs 2018/0284186 to Chadha et al..

	Regarding claim 1, Appu et al. teach a graphics multiprocessor (Fig 8, par 0092-0094, “graphics processor 800 includes a graphics pipeline 820, a media pipeline 830, a display engine 840, thread execution logic 850, and a render output pipeline 870”, Fig 28C, par 0242-0245, “The illustrated graphics multiprocessor 2834 is an exemplary instance of a SIMT parallel processor. However, various types of SIMT parallel processors of differing architectures may be included within the processing cluster 2814.”), comprising: 
a plurality of compute engines to perform first computations to generate a first set of data (par 0092-0094, “execution units 852A-852B are an array of vector processors having an instruction set for performing graphics and media operations”, par 0242-0245, “Each graphics multiprocessor 2834 within the processing cluster 2814 can include an identical set of functional execution logic (e.g., arithmetic logic units, load-store units, etc.). The functional execution logic can be configured in a pipelined manner in which new instructions can be issued before previous instructions are complete. The functional execution logic supports a variety of operations including integer and floating point arithmetic, comparison operations, Boolean operations, bit-shifting, and computation of various algebraic functions”); 
cache for storing data (par 0094, “execution units 852A-852B have an attached L1 cache 851 that is specific for each array or shared between the arrays. The cache can be configured as a data cache, an instruction cache, or a single cache that is partitioned to contain data and instructions in different partitions”, par 0246-0248, “the graphics multiprocessor 2834 can forego an internal cache and use a cache memory (e.g., L1 cache 308) within the processing cluster 2814. Each graphics multiprocessor 2834 also has access to L2 caches within the partition units (e.g., partition units 2820A-2820N of FIG. 28) that are shared among all processing clusters 2814 and may be used to transfer data between threads”); and 
a memory that is integrated on chip with the plurality of compute engines and the cache (par 0126, “FIG. 12 is a block diagram illustrating an exemplary system on a chip integrated circuit 1200 that may be fabricated using one or more IP cores, according to an embodiment … Memory interface may be provided via a memory controller 1265 for access to SDRAM or SRAM memory devices” … a system-on-a-chip (SoC) integrated circuit include CPU, GPU, memory and so on, par 0232, “The parallel processing unit 2802 can transfer data from system memory via the I/O unit 2804 for processing. During processing the transferred data can be stored to on-chip memory (e.g., parallel processor memory 2822) during processing, then written back to system memory”, par 0246-0248, “The graphics multiprocessor 2834 may also access off-chip global memory, which can include one or more of local parallel processor memory and/or system memory”), the memory to receive the first set of data, to temporarily store the first set of data, and to provide the first set of data to the cache during a first time period that is prior to a second time period when the plurality of compute engines will use the first set of data for second computations (par 0246-0248, “Embodiments in which the processing cluster 2814 includes multiple instances of the graphics multiprocessor 2834 can share common instructions and data, which may be stored in the L1 cache 2908 … a processing cluster 2814 may be configured such that each graphics multiprocessor 2834 is coupled to a texture unit 2836 for performing texture mapping operations, e.g., determining texture sample positions, reading texture data, and filtering the texture data. Texture data is read from an internal texture L1 cache (not shown) or in some embodiments from the L1 cache within graphics multiprocessor 2834 and is fetched from an L2 cache, local parallel processor memory, or system memory, as needed. Each graphics multiprocessor 2834 outputs processed tasks to the data crossbar 2840 to provide the processed task to another processing cluster 2814 for further processing or to store the processed task in an L2 cache, local parallel processor memory, or system memory via the memory crossbar 2816” …obvious the data would be move from L2 cache, local parallel processor memory, or system memory to L1 cache before multiprocessor receive data from L1 cache to process to improve running performance).  
	But Appu et al. keep silent regarding that the memory is a high density memory.

    PNG
    media_image1.png
    307
    492
    media_image1.png
    Greyscale

	In related endeavor, Chadha et al. teach the memory is a high density memory (par 0003) and a high density memory that is integrated on chip with the plurality of compute engines and the cache (par 0003, “In 2D IC packages, multiple chips are mounted on a printed circuit board, where high-performance logic, lower-performance logic, memory, and analog/RF functions, and other functional elements are presented as discrete devices in separate chip packages. By contrast, in 2.5D ICs and 3D IC packages, multiple IC chips are mounted on a silicon interposer instead of a conventional package substrate. The silicon interposer, which is typically a silicon wafer, allows very small and high-density conductive traces to be formed between the multiple IC chips because the fabrication processes used to form the conductive traces are the same processes used to form the metal interconnects in the metalization layers of a silicon chip“, par 0049-0055, “IC chip 410 is a logic chip, such as a CPU or GPU, and IC chips 421-423 are memory chips associated with IC chip 410. In such embodiments, 3D IC chip stack 420 may include identical dynamic random-access memory (DRAM) or other random access memory chips that are each electrically coupled to IC chip 410 via a plurality of conductive traces 435 (described below) formed in interposer 430 … the wide-interface architecture of a high-bandwidth DRAM system may have one thousand or more conductive traces 435 for each IC chip in 3D IC chip stack 420. Thus, when 3D IC chip stack 420 includes four such DRAM chips, four thousand or more conductive traces 435 are formed between IC chip 410 and 3D IC chip stack 420, and are necessarily closely spaced, e.g., having a line pitch on the order of 10-100 microns”).
		It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Appu et al. to include the memory is a high density memory and a high density memory that is integrated on chip with the plurality of compute engines and the cache as taught by Chadha et al. to build a 3D IC stack packages to implement a memory bus residing in between a processor and a high-bandwidth memory chip  with high-bandwidth in a high density.  

Regarding claim 2, Appu et al. as modified Chadha et al. teach all the limitation of claim 1, and further teach wherein the plurality of compute engines use the first set of data for second computations to generate a second set of data (Appu et al.: par 0248, “Each graphics multiprocessor 2834 outputs processed tasks to the data crossbar 2840 to provide the processed task to another processing cluster 2814 for further processing or to store the processed task in an L2 cache, local parallel processor memory, or system memory via the memory crossbar 2816.”,  Chadha et al.: par 0046, “each SM 310 transmits a processed task to work distribution crossbar 330 in order to provide the processed task to another GPC 208 for further processing or to store the processed task in an L2 cache (not shown), parallel processing memory 204, or system memory 104 via crossbar unit 210”).

Regarding claim 6, Appu et al. as modified Chadha et al. teach all the limitation of claim 1, but keep silent for teaching wherein the first time period is approximately 5 to 10 milliseconds prior to the second time period. It would have been obvious to one of ordinary skill in the art at the time of the invention to include in Appu et al. as modified Chadha et al.’s system to fetch data to the cache prior to be processed by processor with wherein the first time period is approximately 5 to 10 milliseconds prior to the second time period, as Applicant has not disclosed that wherein the first time period is approximately 5 to 10 milliseconds prior to the second time period provides an advantage, solves any stated problem, or is for any particular purpose.  Accordingly, wherein the first time period is approximately 5 to 10 milliseconds prior to the second time period is considered a design consideration that fails to patentably distinguish over the prior art of Appu et al. as modified Chadha et al..

Regarding claim 7, Appu et al. as modified Chadha et al. teach all the limitation of claim 1, and further teach wherein the high density memory comprises an embedded dynamic random access memory (DRAM) (Appu et al.: par 0047, par 0055, “interconnect which facilitates communication between various processor components and a high-performance embedded memory module 218, such as an eDRAM module. In some embodiments, each of the processor cores 202A-202N and graphics processor 208 use embedded memory modules 218 as a shared Last Level Cache”, par 0236, par 0264, “each multi-core processor 3005-3006 is communicatively coupled to a processor memory 3001-3002, via memory interconnects 3030-3031, respectively, and each GPU 3010-3013 is communicatively coupled to GPU memory 3020-3023 over GPU memory interconnects 3050-3053, respectively. The memory interconnects 3030-3031 and 3050-3053 may utilize the same or different memory access technologies. By way of example, and not limitation, the processor memories 3001-3002 and GPU memories 3020-3023 may be volatile memories such as dynamic random access memories (DRAMs) (including stacked DRAMs), Graphics DDR SDRAM (GDDR) (e.g., GDDR5, GDDR6), or High Bandwidth Memory (HBM) and/or may be non-volatile memories such as 3D XPoint or Nano-Ram. In one embodiment, some portion of the memories may be volatile memory and another portion may be non-volatile memory (e.g., using a two-level memory (2LM) hierarchy)”,  Chadha et al.: par 0034-0035, “Each partition unit 215 is coupled to one or more dynamic random access memories (DRAMs) 220 residing within PPM memory 204. In one embodiment, the number of partition units 215 equals the number of DRAMs 220, and each partition unit 215 is coupled to a different DRAM 220”, par 0052-0055, “IC chip 410 is a logic chip, such as a CPU or GPU, and IC chips 421-423 are memory chips associated with IC chip 410. In such embodiments, 3D IC chip stack 420 may include identical dynamic random-access memory (DRAM) or other random access memory chips that are each electrically coupled to IC chip 410 via a plurality of conductive traces 435 (described below) formed in interposer 430”).

Regarding claim 8, Appu et al. as modified Chadha et al. teach all the limitation of claim 1, and further teach wherein the cache comprises an embedded dynamic random access memory (DRAM) (Appu et al.: par 0047, par 0055, “interconnect which facilitates communication between various processor components and a high-performance embedded memory module 218, such as an eDRAM module. In some embodiments, each of the processor cores 202A-202N and graphics processor 208 use embedded memory modules 218 as a shared Last Level Cache”, par 0236, par 0264, “each multi-core processor 3005-3006 is communicatively coupled to a processor memory 3001-3002, via memory interconnects 3030-3031, respectively, and each GPU 3010-3013 is communicatively coupled to GPU memory 3020-3023 over GPU memory interconnects 3050-3053, respectively. The memory interconnects 3030-3031 and 3050-3053 may utilize the same or different memory access technologies. By way of example, and not limitation, the processor memories 3001-3002 and GPU memories 3020-3023 may be volatile memories such as dynamic random access memories (DRAMs) (including stacked DRAMs), Graphics DDR SDRAM (GDDR) (e.g., GDDR5, GDDR6), or High Bandwidth Memory (HBM) and/or may be non-volatile memories such as 3D XPoint or Nano-Ram. In one embodiment, some portion of the memories may be volatile memory and another portion may be non-volatile memory (e.g., using a two-level memory (2LM) hierarchy)”,  Chadha et al.: par 0034-0035, par 0043-0044, “Each partition unit 215 is coupled to one or more dynamic random access memories (DRAMs) 220 residing within PPM memory 204. In one embodiment, the number of partition units 215 equals the number of DRAMs 220, and each partition unit 215 is coupled to a different DRAM 220”, par 0052-0055, “IC chip 410 is a logic chip, such as a CPU or GPU, and IC chips 421-423 are memory chips associated with IC chip 410. In such embodiments, 3D IC chip stack 420 may include identical dynamic random-access memory (DRAM) or other random access memory chips that are each electrically coupled to IC chip 410 via a plurality of conductive traces 435 (described below) formed in interposer 430”).

Regarding claim 9, the claim 9 is similar in scope to claim 1 and is rejected under the same rational (only different is replace high density memory with  stack-based memory and Chadha et al. disclose stack-based memory (0049-0055)).

Regarding claims 10 and 15, Appu et al. as modified Chadha et al. teach all the limitation of claim 9, the claims 10 and 15 are similar in scope to claims 2 and 8 and are rejected under the same rational.

    PNG
    media_image2.png
    324
    463
    media_image2.png
    Greyscale

Regarding claim 11, Appu et al. as modified Chadha et al. teach all the limitation of claim 1, Chadha et al. further teach wherein the stack-based memory comprises a last-in-first-out (LIFO) stack with data being added or removed in a last-in- first-out manner (Fig 4A, par 0049-0052, “IC chip 410 is a logic chip, such as a CPU or GPU, and IC chips 421-423 are memory chips associated with IC chip 410. In such embodiments, 3D IC chip stack 420 may include identical dynamic random-access memory (DRAM) or other random access memory chips that are each electrically coupled to IC chip 410 via a plurality of conductive traces 435 (described below) formed in interposer 430”). This would be obvious for the same reason given in the rejection for claim 1.

Regarding claim 12, Appu et al. as modified Chadha et al. teach all the limitation of claim 11, and further teach wherein instructions or data are prefetched from the stack-based memory into the cache before the instructions or data are needed in order to save bandwidth for reading and writing to off chip memory (par 0246-0248, “Embodiments in which the processing cluster 2814 includes multiple instances of the graphics multiprocessor 2834 can share common instructions and data, which may be stored in the L1 cache 2908 … a processing cluster 2814 may be configured such that each graphics multiprocessor 2834 is coupled to a texture unit 2836 for performing texture mapping operations, e.g., determining texture sample positions, reading texture data, and filtering the texture data. Texture data is read from an internal texture L1 cache (not shown) or in some embodiments from the L1 cache within graphics multiprocessor 2834 and is fetched from an L2 cache, local parallel processor memory, or system memory, as needed. Each graphics multiprocessor 2834 outputs processed tasks to the data crossbar 2840 to provide the processed task to another processing cluster 2814 for further processing or to store the processed task in an L2 cache, local parallel processor memory, or system memory via the memory crossbar 2816” …obvious the data would be move from L2 cache, local parallel processor memory, or system memory to L1 cache before multiprocessor receive data from L1 cache to process to improve running performance,  Chadha et al.: par 0043-0046, “each SM 310 contains a level one (L1) cache or uses space in a corresponding L1 cache outside of the SM 310 to support, among other things, load and store operations performed by the execution units. Each SM 310 also has access to level two (L2) caches (not shown) that are shared among all GPCs 208 in PPU 202. The L2 caches may be used to transfer data between threads. Finally, SMs 310 also have access to off-chip “global” memory, which may include PP memory 204 and/or system memory 104. It is to be understood that any memory external to PPU 202 may be used as global memory. Additionally, as shown in FIG. 3, a level one-point-five (L1.5) cache 335 may be included within GPC 208 and configured to receive and hold data requested from memory via memory interface 214 by SM 310. Such data may include, without limitation, instructions, uniform data, and constant data. In embodiments having multiple SMs 310 within GPC 208, the SMs 310 may beneficially share common instructions and data cached in L1.5 cache 335”).

Claims 3-5 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2018/0293701 to Appu et al. in view of U.S. PGPubs 2018/0284186 to Chadha et al., further in view of U.S. PGPubs 2008/0225603 to Hein.

Regarding claim 3, Appu et al. as modified Chadha et al. teach all the limitation of claim 1, but do not explicitly teach wherein the high density memory comprises a first serial port to receive the first set of data and a second serial port to provide the first set of data.
In related endeavor, Hein teaches wherein the high density memory comprises a first serial port to receive the first set of data (Fig 6A, par 0091, “Output 130 of output buffer 110 works with the core speed of 500 MHz at a bus width of 72 bits and is coupled to an eight-fold parallel/serial converter 580 (x8 Par2Ser) which, in turn, performs a conversion of the incoming 72 data signals of the core domain into a 9 bits wide data stream in the WCK domain, a transmission speed of 4 Gps being achieved again per pin. Converter 580 is then coupled to data interface 140 via transmit driver circuit 410”) and a second serial port to provide the first set of data (Fig 6A, par 0086-0087, “Conversion circuit 510 then reduces the transmission frequency of 2 GHz to the core speed of, for example, 500 MHz, and at the same time represents a transition in an SDR architecture, so that a total of eight data signals are generated on eight data lines from each incoming data line. This data which is present at the output of conversion circuit 510 is also referred to as write data”).
		It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Appu et al. as modified Chadha et al. to include wherein the high density memory comprises a first serial port to receive the first set of data as taught by Hein to transmit data from a memory core coupled to the input of the output buffer through a serial port to cause data stored within the output buffer to be output to the data interface upon reception of a first signal  and cause data stored within the memory core to be output to the input of the output buffer upon reception of a second signal to synchronize the data interface to a clock on the basis of the transmit data pattern and the receive data pattern between memories and processors.  

Regarding claim 4, Appu et al. as modified Chadha et al. teach all the limitation of claim 1, but do not explicitly teach wherein the first set of data circulates through the high density memory in a serial manner.
In related endeavor, Hein teaches wherein the first set of data circulates through the high density memory in a serial manner (Fig 3, par 0040-0041, “Examples of such application-specific memory systems are, for example, cache memory systems, which, having a particularly high system clock and/or a particularly fast data storage/reading speed, allow latching of data frequently accessed by a processor, for example a CPU (central processing unit) or GPU (graphics processing unit)“, Fig 6A, par 0091, “Output 130 of output buffer 110 works with the core speed of 500 MHz at a bus width of 72 bits and is coupled to an eight-fold parallel/serial converter 580 (x8 Par2Ser) which, in turn, performs a conversion of the incoming 72 data signals of the core domain into a 9 bits wide data stream in the WCK domain, a transmission speed of 4 Gps being achieved again per pin. Converter 580 is then coupled to data interface 140 via transmit driver circuit 410”, Fig 6A, par 0086-0087, “Conversion circuit 510 then reduces the transmission frequency of 2 GHz to the core speed of, for example, 500 MHz, and at the same time represents a transition in an SDR architecture, so that a total of eight data signals are generated on eight data lines from each incoming data line. This data which is present at the output of conversion circuit 510 is also referred to as write data”).
		It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Appu et al. as modified Chadha et al. to include wherein the high density memory comprises a first serial port to receive the first set of data as taught by Hein to transmit data from a memory core coupled to the input of the output buffer through a parallel to serial converter to cause data stored within the output buffer to be output to the data interface upon reception of a first signal  and cause data stored within the memory core to be output to the input of the output buffer upon reception of a second signal to synchronize the data interface to a clock on the basis of the transmit data pattern and the receive data pattern between memories and processors.  

Regarding claim 5, Appu et al. as modified Chadha et al. and Hein teach all the limitation of claim 1, and Hein teaches further comprising: a parallel to serial converter to receive the first set of data in a parallel format from the plurality of compute engines  and to provide the first set of data in a serial format to the high density memory (Fig 3, par 0040-0041, “Examples of such application-specific memory systems are, for example, cache memory systems, which, having a particularly high system clock and/or a particularly fast data storage/reading speed, allow latching of data frequently accessed by a processor, for example a CPU (central processing unit) or GPU (graphics processing unit)“, Fig 6A, par 0091, “Output 130 of output buffer 110 works with the core speed of 500 MHz at a bus width of 72 bits and is coupled to an eight-fold parallel/serial converter 580 (x8 Par2Ser) which, in turn, performs a conversion of the incoming 72 data signals of the core domain into a 9 bits wide data stream in the WCK domain, a transmission speed of 4 Gps being achieved again per pin. Converter 580 is then coupled to data interface 140 via transmit driver circuit 410”) and a serial to parallel converter to receive the first set of data in a serial format from the high density memory and to provide the first set of data in a parallel format to the cache (Fig 6A, par 0086-0087, “Conversion circuit 510 then reduces the transmission frequency of 2 GHz to the core speed of, for example, 500 MHz, and at the same time represents a transition in an SDR architecture, so that a total of eight data signals are generated on eight data lines from each incoming data line. This data which is present at the output of conversion circuit 510 is also referred to as write data”). This would be obvious for the same reason given in the rejection for claim 4.

Allowable Subject Matter
Claims 13-14 are objected to as being dependent upon a rejected base, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter: The cited prior art fails to teach the combination of elements recited in claim 13, including "wherein the plurality of computation engines generate first and second set of data for machine learning layers with data for a first layer being pushed into the stack-based memory while data for a second layer is not being pushed into the stack-based memory".

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jin Ge whose telephone number is (571)272-5556. The examiner can normally be reached 8:00 to 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee M Tung can be reached on (571)272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

JIN . GE
Examiner
Art Unit 2616



/JIN GE/           Primary Examiner, Art Unit 2616