DETAILED ACTION
This Office Action is in response to the Applicants' communication filed on May 20, 2022, which amends the dependent claims 2-3, 5, and 12, and presents arguments, is hereby acknowledged. Claims 1-20 are currently pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Applicant’s arguments filed on May 20, 2022, have been fully considered.
	Applicant argues that by this response, the independent claims 1, 11, and 18 cite the limitations “a resource manager configured to selectively allocate a first memory portion or a second memory portion to the at least one processing unit based on memory access characteristics” and the prior arts on record do not teach this cited limitation.
“Claim 1 recites “a resource manager configured to selectively allocate a first memory portion or a second memory portion to the at least one processing unit based on memory access characteristics.” Claims 11 and 18 recite similar features. At pages 3 and 4 of the Office Action, the Office asserts that these features are taught at FIG. 11 and paragraphs [0109] and [0140] of Newburn. However, the cited portions of Newburn teach only that “affinitizing tasks” and managing where memory is allocated can impact performance, and that data collection can include a number of different properties, including a type of memory. The cited portions of Newburn do not teach allocating different memory portions to a particular processing unit based on memory access characteristics, as provided by claim 1. Further, Dwivedi does not remedy the deficiencies of Newburn. Accordingly, the cited references, individually and in combination, fail to disclose or suggest at least the above-cited features of claim 1, and the similar features of claims 11 and 18”.
Examiner replies that the limitation at issue is “a resource manager configured to selectively allocate a first memory portion or a second memory portion to the at least one processing unit based on memory access characteristics”, which is broad in several aspects: 1) the term “selectively allocate” may be selectively allocating one memory in static or dynamic fashions, system design memory chip selection or memory allocation in application initialization process, etc.; 2) “based on memory access characteristics” may be local memory, system memory, memory types, memory properties, etc.; 3) “selectively allocate a first memory portion or a second memory portion to the at least one processing unit based on memory access characteristics” does not specify how to allocate first memory or second memory based on what memory access characteristics. Thus, allocating some local memory to the processing unit for fast data access and some other memory (system memory) to the processing unit for other data or processing task is a good map to this cited limitation, which is in the Office action. Therefore, applicant’s arguments are not persuasive.
Examiner respectfully further replies that the Applicant's arguments have been fully considered  and they are not persuasive. The present action is made final.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Newburn (US 20170262320 A1) in view of Dwivedi, etc. (US 20210192287 A1).
Regarding claim 1, Newburn teaches that an apparatus (See Newburn: Figs. 1-3, and [0024], "FIG. 2 is a block diagram illustrating an exemplary system 200 upon which the functionality of any of the processes and methodologies discussed herein may be performed, whether in whole, in part, or in combination with each other. The system 200 may operate as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the system 200 may operate in the capacity of either a server or a client machine in server-client network environments, or it may act as a peer machine in peer-to-peer (or distributed) network environments. The system 200 may be (or include) a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a mobile telephone, a web appliance, a network router, a switch, a bridge, and/or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term "machine" or "system" shall also be taken to include any collection of machines or and/or systems that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein") comprising:
at least one processing unit (See Newburn: Fig. 2, and [0025], "In the example depicted in FIG. 2, computer system 200 includes a processor 202 (e.g., a central processing unit (CPU), a graphics processing unit (GPU) or both), a main memory 201 and a static memory 206, which communicate with each other via a bus 208. The computer system 200 may further include a display unit 210, an alphanumeric input device 217 (e.g., a keyboard), and a user interface (UI) navigation device 211 (e.g., a mouse). In one embodiment, the display unit 210, input device 217 and UI navigation device 211 are a touch screen display. The computer system 200 may additionally include a storage device (e.g., drive unit) 216, a signal generation device 218 (e.g., a speaker), a network interface device 220, and one or more sensors 221, such as a global positioning system (GPS) sensor, compass, accelerometer, or other sensor"); and
a resource manager configured to selectively allocate a first memory portion or a second memory portion to the at least one processing unit based on memory access characteristics (See Newburn: Fig. 11, and [0109], "Each year, with new memory types to choose from (including high bandwidth and non-volatile memory), platform complexity and heterogeneity is increasing, with different node types, different amounts, and different kinds of memory per node. Furthermore, some platforms have memory hierarchies where some computing resources have lower latencies and higher bandwidth to some memory components than other computing resources. On such platforms, appropriately affinitizing tasks and their data can have a double-digit performance impact, e.g. in sub-NUMA clustering on KNL. Embodiments of the present disclosure may be used to help take optimal advantage of these memory resources by managing where memory is allocated and how it is affinitized"; and [0140], "The data collection may include any number and type of properties, such as one or more of: writeability of the data collection, whether local storage needs to be updated on a write, whether a partial result needs to be combined with other distributed operations, whether access is atomic, whether access requires strict ordering, a reference pattern, a hardware characteristic, a type of memory associated with the data collection, a memory allocation policy, a size of the data collection, and a state of the data collection. In some embodiments, the properties may include information on a type of memory (which may include one or more of volatile or persistent memory, large or small capacity, high- or low-bandwidth memory, streamed or random-accessed, cached memory, HW-coherent or software-managed memory, special structures for gather/scatter, and structures customized to particular access patterns)". Note that different hardware platforms have different memories, and the application with different tasks that require different memory access properties may be affinitized to the different memories, which is mapped to allocate the memory to the processing units (processing tasks)), wherein the first memory portion has a first latency that is lower that a second latency of the second memory portion.
However, Newburn fails to explicitly disclose that wherein the first memory portion has a first latency that is lower that a second latency of the second memory portion.
However, Dwivedi teaches that wherein the first memory portion has a first latency that is lower that a second latency of the second memory portion (See Dwivedi: Figs. 22A-D, and [0349], "In at least one embodiment, memory and cache interconnect 2268 is an interconnect network that connects each functional unit of graphics multiprocessor 2234 to register file 2258 and to shared memory 2270. In at least one embodiment, memory and cache interconnect 2268 is a crossbar interconnect that allows load/store unit 2266 to implement load and store operations between shared memory 2270 and register file 2258. In at least one embodiment, register file 2258 can operate at a same frequency as GPGPU cores 2262, thus data transfer between GPGPU cores 2262 and register file 2258 is very low latency. In at least one embodiment, shared memory 2270 can be used to enable communication between threads that execute on functional units within graphics multiprocessor 2234. In at least one embodiment, cache memory 2272 can be used as a data cache for example, to cache texture data communicated between functional units and texture unit 2236. In at least one embodiment, shared memory 2270 can also be used as a program managed cached. In at least one embodiment, threads executing on GPGPU cores 2262 can programmatically store data within shared memory in addition to automatically cached data that is stored within cache memory 2272").
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Newburn to have wherein the first memory portion has a first latency that is lower that a second latency of the second memory portion as taught by Dwivedi in order to improve efficiency for memory ranges shared between processors (See Dwivedi: Fig. 37, and [0483], "Combining data cache and shared memory functionality into a single memory block provides improved performance for both types of memory accesses, in at least one embodiment. In at least one embodiment, capacity is used or is usable as a cache by programs that do not use shared memory, such as if shared memory is configured to use half of capacity, texture and load/store operations can use remaining capacity. Integration within shared memory/Ll cache 3718 enables shared memory/Ll cache 3718 to function as a high-throughput conduit for streaming data while simultaneously providing high-bandwidth and low-latency access to frequently reused data, in accordance with at least one embodiment. In at least one embodiment, when configured for general purpose parallel computation, a simpler configuration can be used compared with graphics processing. In at least one embodiment, fixed function graphics processing units are bypassed, creating a much simpler programming model. In general purpose parallel computation configuration, work distribution unit assigns and distributes blocks of threads directly to DPCs, in at least one embodiment. In at least one embodiment, threads in a block execute same program, using a unique thread ID in calculation to ensure each thread generates unique results, using SM 3700 to execute program and perform calculations, shared memory/Ll cache 3718 to communicate between threads, and LSU 3714 to read and write global memory through shared memory/Ll cache 3718 and memory partition unit. In at least one embodiment, when configured for general purpose parallel computation, SM 3700 writes commands that scheduler unit 3704 can use to launch new work on DPCs"). Newburn teaches a method and system that may dynamically allocate computer resources to the processor to execute the set of actions and order each respective action for execution on the computing resources by the software component, and Dwivedi teaches a system and method that may allocate different memory (cache, register file, and/or shared memory) with different latencies to the processor based on the operations/functions/data executed on the memory by the programs. Therefore, it is obvious to one of ordinary skill in the art to modify Newburn by Dwivedi to allocate different memory with different latency to the processor according to the characteristics of the memory access and the data to be processed. The motivation to modify Newburn by Dwivedi is "Use of known technique to improve similar devices (methods, or products) in the same way".
Regarding claim 2, Newburn and Dwivedi teach all the features with respect to claim 1 as outlined above. Further, Dwivedi teaches that the apparatus of claim 1, wherein the memory access characteristics indicate a latency sensitivity of an application (See Dwivedi: Fig. 37, and [0483], "Combining data cache and shared memory functionality into a single memory block provides improved performance for both types of memory accesses, in at least one embodiment. In at least one embodiment, capacity is used or is usable as a cache by programs that do not use shared memory, such as if shared memory is configured to use half of capacity, texture and load/store operations can use remaining capacity. Integration within shared memory/L1 cache 3718 enables shared memory/L1 cache 3718 to function as a high­ throughput conduit for streaming data while simultaneously providing high-bandwidth and low- latency access to frequently reused data, in accordance with at least one embodiment. In at least one embodiment, when configured for general purpose parallel computation, a simpler configuration can be used compared with graphics processing. In at least one embodiment, fixed function graphics processing units are bypassed, creating a much simpler programming model. In general purpose parallel computation configuration, work distribution unit assigns and distributes blocks of threads directly to DPCs, in at least one embodiment. In at least one embodiment, threads in a block execute same program, using a unique thread ID in calculation to ensure each thread generates unique results, using SM 3700 to execute program and perform calculations, shared memory/L1 cache 3718 to communicate between threads, and LSU 3714 to read and write global memory through shared memory/L1 cache 3718 and memory partition unit. In at least one embodiment, when configured for general purpose parallel computation, SM 3700 writes commands that scheduler unit 3704 can use to launch new work on DPCs").
Regarding claim 3, Newburn and Dwivedi teach all the features with respect to claim 1 as outlined above. Further, Newburn teaches that the apparatus of claim 1, wherein the resource manager allocates the first memory portion in response to memory access requests having a low degree of locality or irregular memory access patterns (See Newburn: Fig. 13, and [0140], "The data collection may include any number and type of properties, such as one or more of: writeability of the data collection, whether local storage needs to be updated on a write, whether a partial result needs to be combined with other distributed operations, whether access is atomic, whether access requires strict ordering, a reference pattern, a hardware characteristic, a type of memory associated with the data collection, a memory allocation policy, a size of the data collection, and a state of the data collection. In some embodiments, the properties may include information on a type of memory (which may include one or more of volatile or persistent memory, large or small capacity, high- or low-bandwidth memory, streamed or random-accessed, cached memory, HW-coherent or software-managed memory, special structures for gather/scatter, and structures customized to particular access patterns)").
Regarding claim 4, Newburn and Dwivedi teach all the features with respect to claim 3 as outlined above. Further, Dwivedi teaches that the apparatus of claim 3, wherein the resource manager allocates the second memory portion in response to the memory access requests having a relatively high degree of locality or regular memory access patterns (See Dwivedi: Figs. 17A-F, and [0286], "In at least one embodiment, a bias table entry associated with each access to GPU-attached memory 1720-1723 is accessed prior to actual access to a GPU memory, causing the following operations. First, local requests from GPU 1710-1713 that find their page in GPU bias are forwarded directly to a corresponding GPU memory 1720-1723. Local requests from a GPU that find their page in host bias are forwarded to processor 1705 (e.g., over a high- speed link as discussed above). In one embodiment, requests from processor 1705 that find a requested page in host processor bias complete a request like a normal memory read. Alternatively, requests directed to a GPU-biased page may be forwarded to GPU 1710-1713. In at least one embodiment, a GPU may then transition a page to a host processor bias if it is not currently using a page. In at least one embodiment, bias state of a page can be changed either by a software-based mechanism, a hardware-assisted software-based mechanism, or, for a limited set of cases, a purely hardware-based mechanism").
Regarding claim 5, Newburn and Dwivedi teach all the features with respect to claim 3 as outlined above. Further, Dwivedi teaches that the apparatus of claim 3, wherein the irregularity of memory access requests n is determined based on hints included in corresponding program code (See Dwivedi: Figs. 12A-D, and [0139], "In at least one embodiment, GPU(s) 1208 may include any number of access counters that may keep track of frequency of access of GPU(s) 1208 to memory of other processors. In at least one embodiment, access counter(s) may help ensure that memory pages are moved to physical memory of processor that is accessing pages most frequently, thereby improving efficiency for memory ranges shared between processors"; and [Figs. 22A-D, and [0345], "In at least one embodiment, instruction cache 2252 receives a stream of instructions to execute from pipeline manager 2232. In at least one embodiment, instructions are cached in instruction cache 2252 and dispatched for execution by instruction unit 2254. In at least one embodiment, instruction unit 2254 can dispatch instructions as thread groups (e.g., warps), with each thread of thread group assigned to a different execution unit within GPGPU core 2262. In at least one embodiment, an instruction can access any of a local, shared, or global address space by specifying an address within a unified address space. In at least one embodiment, address mapping unit 2256 can be used to translate addresses in a unified address space into a distinct memory address that can be accessed by load/store units 2266").
Regarding claim 6, Newburn and Dwivedi teach all the features with respect to claim 1 as outlined above. Further, Dwivedi teaches that the apparatus of claim 1, wherein the resource manager is configured to monitor memory access requests and measure statistics for the memory access requests (See Dwivedi: Fig. 35, and [0466], "In at least one embodiment, MMU 3518 provides an interface between GPC 3500 and memory partition unit (e.g., partition unit 3422 of FIG. 34) and MMU 3518 provides translation of virtual addresses into physical addresses, memory protection, and arbitration of memory requests. In at least one embodiment, MMU 3518 provides one or more translation lookaside buffers ("TLBs") for performing translation of virtual addresses into physical addresses in memory").
Regarding claim 7, Newburn and Dwivedi teach all the features with respect to claim 6 as outlined above. Further, Dwivedi teaches that the apparatus of claim 6, wherein the statistics represent a cache miss rate or a row buffer miss rate for the monitored memory access requests (See Dwivedi: Figs. 22A-D, and [0332], "FIG. 22B is a block diagram of a partition unit 2220 according to at least one embodiment. In at least one embodiment, partition unit 2220 is an instance of one of partition units 2220A-2220N of FIG. 22A. In at least one embodiment, partition unit 2220 includes an L2 cache 2221, a frame buffer interface 2225, and a ROP 2226 (raster operations unit). L2 cache 2221 is a read/write cache that is configured to perform load and store operations received from memory crossbar 2216 and ROP 2226. In at least one embodiment, read misses and urgent write-back requests are output by L2 cache 2221 to frame buffer interface 2225 for processing. In at least one embodiment, updates can also be sent to a frame buffer via frame buffer interface 2225 for processing. In at least one embodiment, frame buffer interface 2225 interfaces with one of memory units in parallel processor memory, such as memory units 2224A-2224N of FIG. 22 (e.g., within parallel processor memory 2222)"; and Fig. 25, and [0368], "In at least one embodiment, uop schedulers 2502, 2504, 2506, dispatch dependent operations before parent load has finished executing. In at least one embodiment, as uops may be speculatively scheduled and executed in processor 2500, processor 2500 may also include logic to handle memory misses. In at least one embodiment, if a data load misses in data cache, there may be dependent operations in flight in pipeline that have left scheduler with temporarily incorrect data. In at least one embodiment, a replay mechanism tracks and re- executes instructions that use incorrect data. In at least one embodiment, dependent operations might need to be replayed and independent ones may be allowed to complete. In at least one embodiment, schedulers and replay mechanism of at least one embodiment of a processor may also be designed to catch instruction sequences for text string comparison operations").
Regarding claim 8, Newburn and Dwivedi teach all the features with respect to claim 6 as outlined above. Further, Dwivedi teaches that the apparatus of claim 6, wherein the resource manager is configured to allocate or reallocate the first memory portion or the second memory portion based on the statistics (See Dwivedi: Fig. 37, and [0483], "Combining data cache and shared memory functionality into a single memory block provides improved performance for both types of memory accesses, in at least one embodiment. In at least one embodiment, capacity is used or is usable as a cache by programs that do not use shared memory, such as if shared memory is configured to use half of capacity, texture and load/store operations can use remaining capacity. Integration within shared memory/Ll cache 3718 enables shared memory/Ll cache 3718 to function as a high-throughput conduit for streaming data while simultaneously providing high-bandwidth and low-latency access to frequently reused data, in accordance with at least one embodiment. In at least one embodiment, when configured for general purpose parallel computation, a simpler configuration can be used compared with graphics processing. In at least one embodiment, fixed function graphics processing units are bypassed, creating a much simpler programming model. In general purpose parallel computation configuration, work distribution unit assigns and distributes blocks of threads directly to DPCs, in at least one embodiment. In at least one embodiment, threads in a block execute same program, using a unique thread ID in calculation to ensure each thread generates unique results, using SM 3700 to execute program and perform calculations, shared memory/Ll cache 3718 to communicate between threads, and LSU 3714 to read and write global memory through shared memory/Ll cache 3718 and memory partition unit. In at least one embodiment, when configured for general purpose parallel computation, SM 3700 writes commands that scheduler unit 3704 can use to launch new work on DPCs").
Regarding claim 9, Newburn and Dwivedi teach all the features with respect to claim 1 as outlined above. Further, Dwivedi teaches that the apparatus of claim 1, further comprising:
at least one of a heterogeneous memory chip or a heterogeneous memory stack that comprises the first memory portion and the second memory portion (See Dwivedi: Figs. 9A-B, and [0083], "In at least one embodiment, inference and/or training logic 915 may include, without limitation, code and/or data storage 901 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which the code corresponds. In at least one embodiment code and/or data storage 901 stores weight parameters and/or input/output data of each layer of a neural network trained or used in conjunction with one or more embodiments during forward propagation of input/output data and/or weight parameters during training and/or inferencing using aspects of one or more embodiments. In at least one embodiment, any portion of code and/or data storage 901 may be included with other on-chip or off-chip data storage, including a processor's Ll, L2, or L3 cache or system memory").
Regarding claim 10, Newburn and Dwivedi teach all the features with respect to claim 1 as outlined above. Further, Dwivedi teaches that the apparatus of claim 1, further comprising:
a plurality of chiplets comprising a plurality of processing units and a plurality of coprocessors configured to implement instances of the resource manager, wherein a first subset of the chiplets comprises the first memory portion and a second subset of the chiplets comprises the second memory portion (See Dwivedi: Fig. 21, and [0313], "FIG. 21 is a block diagram illustrating a computing system 2100 according to at least one embodiment. In at least one embodiment, computing system 2100 includes a processing subsystem 2101 having one or more processor(s) 2102 and a system memory 2104 communicating via an interconnection path that may include a memory hub 2105. In at least one embodiment, memory hub 2105 may be a separate component within a chipset component or may be integrated within one or more processor(s) 2102. In at least one embodiment, memory hub 2105 couples with an 1/O subsystem 2111 via a communication link 2106. In at least one embodiment, 1/O subsystem 2111 includes an 1/O hub 2107 that can enable computing system 2100 to receive input from one or more input device(s) 2108. In at least one embodiment, 1/O hub 2107 can enable a display controller, which may be included in one or more processor(s) 2102, to provide outputs to one or more display device(s) 2110A. In at least one embodiment, one or more display device(s) 2110A coupled with 1/O hub 2107 can include a local, internal, or embedded display device"). 
Regarding claim 11, Newburn and Dwivedi teach all the features with respect to claim 1 as outlined above. Further, Newburn and Dwivedi teach that a method (See Newburn: Figs. 1-3, and [0024], "FIG. 2 is a block diagram illustrating an exemplary system 200 upon which the functionality of any of the processes and methodologies discussed herein may be performed, whether in whole, in part, or in combination with each other. The system 200 may operate as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the system 200 may operate in the capacity of either a server or a client machine in server-client network environments, or it may act as a peer machine in peer-to-peer (or distributed) network environments. The system 200 may be (or include) a personal computer (PC), a tablet PC, a set-top box (5TB), a Personal Digital Assistant (PDA), a mobile telephone, a web appliance, a network router, a switch, a bridge, and/or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term "machine" or "system" shall also be taken to include any collection of machines or and/or systems that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein") comprising:
selectively allocating a first memory portion or a second memory portion to at least one processing unit based on memory access characteristics (See Newburn: Fig. 11, and [0109], "Each year, with new memory types to choose from (including high bandwidth and non-volatile memory), platform complexity and heterogeneity is increasing, with different node types, different amounts, and different kinds of memory per node. Furthermore, some platforms have memory hierarchies where some computing resources have lower latencies and higher bandwidth to some memory components than other computing resources. On such platforms, appropriately affinitizing tasks and their data can have a double-digit performance impact, e.g. in sub-NUMA clustering on KNL. Embodiments of the present disclosure may be used to help take optimal advantage of these memory resources by managing where memory is allocated and how it is affinitized"; and [0140], "The data collection may include any number and type of properties, such as one or more of: writeability of the data collection, whether local storage needs to be updated on a write, whether a partial result needs to be combined with other distributed operations, whether access is atomic, whether access requires strict ordering, a reference pattern, a hardware characteristic, a type of memory associated with the data collection, a memory allocation policy, a size of the data collection, and a state of the data collection. In some embodiments, the properties may include information on a type of memory (which may include one or more of volatile or persistent memory, large or small capacity, high- or low-bandwidth memory, streamed or random-accessed, cached memory, HW-coherent or software-managed memory, special structures for gather/scatter, and structures customized to particular access patterns)". Note that different hardware platforms have different memories, and the application with different tasks that require different memory access properties may be affinitized to the different memories, which is mapped to allocate the memory to the processing units (processing tasks)), wherein the first memory portion has a first latency that is lower that a second latency of the second memory portion (See Dwivedi: Figs. 22A-D, and [0349], "In at least one embodiment, memory and cache interconnect 2268 is an interconnect network that connects each functional unit of graphics multiprocessor 2234 to register file 2258 and to shared memory 2270. In at least one embodiment, memory and cache interconnect 2268 is a crossbar interconnect that allows load/store unit 2266 to implement load and store operations between shared memory 2270 and register file 2258. In at least one embodiment, register file 2258 can operate at a same frequency as GPGPU cores 2262, thus data transfer between GPGPU cores 2262 and register file 2258 is very low latency. In at least one embodiment, shared memory 2270 can be used to enable communication between threads that execute on functional units within graphics multiprocessor 2234. In at least one embodiment, cache memory 2272 can be used as a data cache for example, to cache texture data communicated between functional units and texture unit 2236. In at least one embodiment, shared memory 2270 can also be used as a program managed cached. In at least one embodiment, threads executing on GPGPU cores 2262 can programmatically store data within shared memory in addition to automatically cached data that is stored within cache memory 2272"); and
executing, on the at least one processing unit, at least one of an application or a kernel using the allocated first or second memory portion (See Dwivedi: Figs. 34-36, and [0495], "In at least one embodiment, a host processor executes a driver kernel that implements an application programming interface ("API") that enables one or more applications executing on host processor to schedule operations for execution on PPU 3400. In at least one embodiment, multiple compute applications are simultaneously executed by PPU 3400 and PPU 3400 provides isolation, quality of service ("QoS"), and independent address spaces for multiple compute applications. In at least one embodiment, an application generates instructions (e.g., in form of API calls) that cause driver kernel to generate one or more tasks for execution by PPU 3400 and driver kernel outputs tasks to one or more streams being processed by PPU 3400. In at least one embodiment, each task comprises one or more groups of related threads, which may be referred to as a warp. In at least one embodiment, a warp comprises a plurality of related threads (e.g., 32 threads) that can be executed in parallel. In at least one embodiment, cooperating threads can refer to a plurality of threads including instructions to perform task and that exchange data through shared memory. In at least one embodiment, threads and cooperating threads are described in more detail, in accordance with at least one embodiment, in conjunction with FIG. 36").
Regarding claim 12, Newburn and Dwivedi teach all the features with respect to claim 11 as outlined above. Further, Newburn teaches that the method of claim 11, wherein 
selectively allocating the first memory portion or the second memory portion comprises allocating the first memory portion in response to memory access requests having a low degree of locality or irregular memory access patterns (See Newburn: Fig. 13, and [0140], "The data collection may include any number and type of properties, such as one or more of: writeability of the data collection, whether local storage needs to be updated on a write, whether a partial result needs to be combined with other distributed operations, whether access is atomic, whether access requires strict ordering, a reference pattern, a hardware characteristic, a type of memory associated with the data collection, a memory allocation policy, a size of the data collection, and a state of the data collection. In some embodiments, the properties may include information on a type of memory (which may include one or more of volatile or persistent memory, large or small capacity, high- or low-bandwidth memory, streamed or random-accessed, cached memory, HW-coherent or software-managed memory, special structures for gather/scatter, and structures customized to particular access patterns)").
Regarding claim 13, Newburn and Dwivedi teach all the features with respect to claim 12 as outlined above. Further, Dwivedi teaches that the method of claim 12, wherein selectively allocating the first memory portion or the second memory portion comprises allocating the second memory portion in response to the memory access requests having a relatively high degree of locality or regular memory access patterns (See Dwivedi: Figs. 17A-F, and [0286], "In at least one embodiment, a bias table entry associated with each access to GPU-attached memory 1720-1723 is accessed prior to actual access to a GPU memory, causing the following operations. First, local requests from GPU 1710-1713 that find their page in GPU bias are forwarded directly to a corresponding GPU memory 1720-1723. Local requests from a GPU that find their page in host bias are forwarded to processor 1705 (e.g., over a high-speed link as discussed above). In one embodiment, requests from processor 1705 that find a requested page in host processor bias complete a request like a normal memory read. Alternatively, requests directed to a GPU-biased page may be forwarded to GPU 1710-1713. In at least one embodiment, a GPU may then transition a page to a host processor bias if it is not currently using a page. In at least one embodiment, bias state of a page can be changed either by a software-based mechanism, a hardware-assisted software-based mechanism, or, for a limited set of cases, a purely hardware-based mechanism").
Regarding claim 14, Newburn and Dwivedi teach all the features with respect to claim 12 as outlined above. Further, Dwivedi teaches that the method of claim 12, further comprising:
determining the irregularity of memory access requests from the application based on hints included in corresponding program code (See Dwivedi: Figs. 12A-D, and [0139], "In at least one embodiment, GPU(s) 1208 may include any number of access counters that may keep track of frequency of access of GPU(s) 1208 to memory of other processors. In at least one embodiment, access counter(s) may help ensure that memory pages are moved to physical memory of processor that is accessing pages most frequently, thereby improving efficiency for memory ranges shared between processors"; and [Figs. 22A-D, and [0345], "In at least one embodiment, instruction cache 2252 receives a stream of instructions to execute from pipeline manager 2232. In at least one embodiment, instructions are cached in instruction cache 2252 and dispatched for execution by instruction unit 2254. In at least one embodiment, instruction unit 2254 can dispatch instructions as thread groups (e.g., warps), with each thread of thread group assigned to a different execution unit within GPGPU core 2262. In at least one embodiment, an instruction can access any of a local, shared, or global address space by specifying an address within a unified address space. In at least one embodiment, address mapping unit 2256 can be used to translate addresses in a unified address space into a distinct memory address that can be accessed by load/store units 2266").
Regarding claim 15, Newburn and Dwivedi teach all the features with respect to claim 11 as outlined above. Further, Dwivedi teaches that the method of claim 11, further comprising: monitoring memory access requests; and measuring statistics for the memory access requests (See Dwivedi: Fig. 35, and [0466], "In at least one embodiment, MMU 3518 provides an interface between GPC 3500 and memory partition unit (e.g., partition unit 3422 of FIG. 34) and MMU 3518 provides translation of virtual addresses into physical addresses, memory protection, and arbitration of memory requests. In at least one embodiment, MMU 3518 provides one or more translation lookaside buffers ("TLBs") for performing translation of virtual addresses into physical addresses in memory").
Regarding claim 16, Newburn and Dwivedi teach all the features with respect to claim 15 as outlined above. Further, Dwivedi teaches that the method of claim 15, wherein measuring the statistics comprises measuring a cache miss rate or a row buffer miss rate for the monitored memory access requests (See Dwivedi: Figs. 22A-D, and [0332], "FIG. 22B is a block diagram of a partition unit 2220 according to at least one embodiment. In at least one embodiment, partition unit 2220 is an instance of one of partition units 2220A-2220N of FIG. 22A. In at least one embodiment, partition unit 2220 includes an L2 cache 2221, a frame buffer interface 2225, and a ROP 2226 (raster operations unit). L2 cache 2221 is a read/write cache that is configured to perform load and store operations received from memory crossbar 2216 and ROP 2226. In at least one embodiment, read misses and urgent write-back requests are output by L2 cache 2221 to frame buffer interface 2225 for processing. In at least one embodiment, updates can also be sent to a frame buffer via frame buffer interface 2225 for processing. In at least one embodiment, frame buffer interface 2225 interfaces with one of memory units in parallel processor memory, such as memory units 2224A-2224N of FIG. 22 (e.g., within parallel processor memory 2222)"; and Fig. 25, and [0368], "In at least one embodiment, uop schedulers 2502, 2504, 2506, dispatch dependent operations before parent load has finished executing. In at least one embodiment, as uops may be speculatively scheduled and executed in processor 2500, processor 2500 may also include logic to handle memory misses. In at least one embodiment, if a data load misses in data cache, there may be dependent operations in flight in pipeline that have left scheduler with temporarily incorrect data. In at least one embodiment, a replay mechanism tracks and re-executes instructions that use incorrect data. In at least one embodiment, dependent operations might need to be replayed and independent ones may be allowed to complete. In at least one embodiment, schedulers and replay mechanism of at least one embodiment of a processor may also be designed to catch instruction sequences for text string comparison operations").
Regarding claim 17, Newburn and Dwivedi teach all the features with respect to claim 15 as outlined above. Further, Dwivedi teaches that the method of claim 15, further comprising: allocating or reallocating the first memory portion or the second memory portion based on the statistics (See Dwivedi: Fig. 37, and [0483], "Combining data cache and shared memory functionality into a single memory block provides improved performance for both types of memory accesses, in at least one embodiment. In at least one embodiment, capacity is used or is usable as a cache by programs that do not use shared memory, such as if shared memory is configured to use half of capacity, texture and load/store operations can use remaining capacity. Integration within shared memory/Ll cache 3718 enables shared memory/Ll cache 3718 to function as a high-throughput conduit for streaming data while simultaneously providing high-bandwidth and low-latency access to frequently reused data, in accordance with at least one embodiment. In at least one embodiment, when configured for general purpose parallel computation, a simpler configuration can be used compared with graphics processing. In at least one embodiment, fixed function graphics processing units are bypassed, creating a much simpler programming model. In general purpose parallel computation configuration, work distribution unit assigns and distributes blocks of threads directly to DPCs, in at least one embodiment. In at least one embodiment, threads in a block execute same program, using a unique thread ID in calculation to ensure each thread generates unique results, using SM 3700 to execute program and perform calculations, shared memory/Ll cache 3718 to communicate between threads, and LSU 3714 to read and write global memory through shared memory/Ll cache 3718 and memory partition unit. In at least one embodiment, when configured for general purpose parallel computation, SM 3700 writes commands that scheduler unit 3704 can use to launch new work on DPCs").
Regarding claim 18, Newburn and Dwivedi teach all the features with respect to claim 1 as outlined above. Further, Newburn and Dwivedi teach that a method (See Newburn: Figs. 1-3, and [0024], "FIG. 2 is a block diagram illustrating an exemplary system 200 upon which the functionality of any of the processes and methodologies discussed herein may be performed, whether in whole, in part, or in combination with each other. The system 200 may operate as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the system 200 may operate in the capacity of either a server or a client machine in server-client network environments, or it may act as a peer machine in peer-to-peer (or distributed) network environments. The system 200 may be (or include) a personal computer (PC), a tablet PC, a set-top box (5TB), a Personal Digital Assistant (PDA), a mobile telephone, a web appliance, a network router, a switch, a bridge, and/or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term "machine" or "system" shall also be taken to include any collection of machines or and/or systems that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein") comprising:
selectively allocating a first memory portion or a second memory portion to at least one processing unit based on a latency sensitivity (See Newburn: Fig. 11, and [0109], "Each year, with new memory types to choose from (including high bandwidth and non-volatile memory), platform complexity and heterogeneity is increasing, with different node types, different amounts, and different kinds of memory per node. Furthermore, some platforms have memory hierarchies where some computing resources have lower latencies and higher bandwidth to some memory components than other computing resources. On such platforms, appropriately affinitizing tasks and their data can have a double-digit performance impact, e.g. in sub-NUMA clustering on KNL. Embodiments of the present disclosure may be used to help take optimal advantage of these memory resources by managing where memory is allocated and how it is affinitized"; and [0140], "The data collection may include any number and type of properties, such as one or more of: writeability of the data collection, whether local storage needs to be updated on a write, whether a partial result needs to be combined with other distributed operations, whether access is atomic, whether access requires strict ordering, a reference pattern, a hardware characteristic, a type of memory associated with the data collection, a memory allocation policy, a size of the data collection, and a state of the data collection. In some embodiments, the properties may include information on a type of memory (which may include one or more of volatile or persistent memory, large or small capacity, high- or low- bandwidth memory, streamed or random-accessed, cached memory, HW-coherent or software-managed memory, special structures for gather/scatter, and structures customized to particular access patterns)". Note that different hardware platforms have different memories, and the application with different tasks that require different memory access properties may be affinitized to the different memories, which is mapped to allocate the memory to the processing units (processing tasks)), wherein the first memory portion has a first latency that is lower that a second latency of the second memory portion (See Dwivedi: Figs. 22A-D, and [0349], "In at least one embodiment, memory and cache interconnect 2268 is an interconnect network that connects each functional unit of graphics multiprocessor 2234 to register file 2258 and to shared memory 2270. In at least one embodiment, memory and cache interconnect 2268 is a crossbar interconnect that allows load/store unit 2266 to implement load and store operations between shared memory 2270 and register file 2258. In at least one embodiment, register file 2258 can operate at a same frequency as GPGPU cores 2262, thus data transfer between GPGPU cores 2262 and register file 2258 is very low latency. In at least one embodiment, shared memory 2270 can be used to enable communication between threads that execute on functional units within graphics multiprocessor 2234. In at least one embodiment, cache memory 2272 can be used as a data cache for example, to cache texture data communicated between functional units and texture unit 2236. In at least one embodiment, shared memory 2270 can also be used as a program managed cached. In at least one embodiment, threads executing on GPGPU cores 2262 can programmatically store data within shared memory in addition to automatically cached data that is stored within cache memory 2272"); and
executing, on at least one processing unit, at least one of an application or a kernel using the allocated first or second memory portion (See Dwivedi: Figs. 34-36, and [0495], "In at least one embodiment, a host processor executes a driver kernel that implements an application programming interface ("API") that enables one or more applications executing on host processor to schedule operations for execution on PPU 3400. In at least one embodiment, multiple compute applications are simultaneously executed by PPU 3400 and PPU 3400 provides isolation, quality of service ("QoS"), and independent address spaces for multiple compute applications. In at least one embodiment, an application generates instructions (e.g., in form of API calls) that cause driver kernel to generate one or more tasks for execution by PPU 3400 and driver kernel outputs tasks to one or more streams being processed by PPU 3400. In at least one embodiment, each task comprises one or more groups of related threads, which may be referred to as a warp. In at least one embodiment, a warp comprises a plurality of related threads (e.g., 32 threads) that can be executed in parallel. In at least one embodiment, cooperating threads can refer to a plurality of threads including instructions to perform task and that exchange data through shared memory. In at least one embodiment, threads and cooperating threads are described in more detail, in accordance with at least one embodiment, in conjunction with FIG. 36").
Regarding claim 19, Newburn and Dwivedi teach all the features with respect to claim 18 as outlined above. Further, Dwivedi teaches that the method of claim 18, further comprising: 
determining latency sensitivity based on hints included in corresponding program code (See Dwivedi: Fig. 37, and [0483], "Combining data cache and shared memory functionality into a single memory block provides improved performance for both types of memory accesses, in at least one embodiment. In at least one embodiment, capacity is used or is usable as a cache by programs that do not use shared memory, such as if shared memory is configured to use half of capacity, texture and load/store operations can use remaining capacity. Integration within shared memory/L1 cache 3718 enables shared memory/L1 cache 3718 to function as a high­ throughput conduit for streaming data while simultaneously providing high-bandwidth and low- latency access to frequently reused data, in accordance with at least one embodiment. In at least one embodiment, when configured for general purpose parallel computation, a simpler configuration can be used compared with graphics processing. In at least one embodiment, fixed function graphics processing units are bypassed, creating a much simpler programming model. In general purpose parallel computation configuration, work distribution unit assigns and distributes blocks of threads directly to DPCs, in at least one embodiment. In at least one embodiment, threads in a block execute same program, using a unique thread ID in calculation to ensure each thread generates unique results, using SM 3700 to execute program and perform calculations, shared memory/L1 cache 3718 to communicate between threads, and LSU 3714 to read and write global memory through shared memory/L1 cache 3718 and memory partition unit. In at least one embodiment, when configured for general purpose parallel computation, SM 3700 writes commands that scheduler unit 3704 can use to launch new work on DPCs").
Regarding claim 20, Newburn and Dwivedi teach all the features with respect to claim 18 as outlined above. Further, Dwivedi teaches that the method of claim 18, further comprising:
monitoring memory access requests (See Dwivedi: Fig. 7, and [0077], "In at least one embodiment, a resource monitoring engine 716 monitors available memory used during transform profiling 714 on specific input data 704. In at least one embodiment, a resource monitoring engine 716 imposes restrictions on transform resource usage in order to determine compute and memory performance impact of restrictions for transform profiling 714. In at least one embodiment, a resource monitoring engine 716 provides other resource usage information utilized by a transform controller to determine resource consumption profiles for individuals and sequences of data transforms when applied to specific input data 704");
measuring a cache miss rate or a row buffer miss rate for the monitored memory access requests (See Dwivedi: Figs. 22A-D, and [0332], "FIG. 22B is a block diagram of a partition unit 2220 according to at least one embodiment. In at least one embodiment, partition unit 2220 is an instance of one of partition units 2220A-2220N of FIG. 22A. In at least one embodiment, partition unit 2220 includes an L2 cache 2221, a frame buffer interface 2225, and a ROP 2226 (raster operations unit). L2 cache 2221 is a read/write cache that is configured to perform load and store operations received from memory crossbar 2216 and ROP 2226. In at least one embodiment, read misses and urgent write-back requests are output by L2 cache 2221 to frame buffer interface 2225 for processing. In at least one embodiment, updates can also be sent to a frame buffer via frame buffer interface 2225 for processing. In at least one embodiment, frame buffer interface 2225 interfaces with one of memory units in parallel processor memory, such as memory units 2224A-2224N of FIG. 22 (e.g., within parallel processor memory 2222)"; and Fig. 25, and [0368], "In at least one embodiment, uop schedulers 2502, 2504, 2506, dispatch dependent operations before parent load has finished executing. In at least one embodiment, as uops may be speculatively scheduled and executed in processor 2500, processor 2500 may also include logic to handle memory misses. In at least one embodiment, if a data load misses in data cache, there may be dependent operations in flight in pipeline that have left scheduler with temporarily incorrect data. In at least one embodiment, a replay mechanism tracks and re- executes instructions that use incorrect data. In at least one embodiment, dependent operations might need to be replayed and independent ones may be allowed to complete. In at least one embodiment, schedulers and replay mechanism of at least one embodiment of a processor may also be designed to catch instruction sequences for text string comparison operations"); and
determining a latency sensitivity based on the cache miss rate or the row buffer miss rate (See Dwivedi: Fig. 37, and [0483], "Combining data cache and shared memory functionality into a single memory block provides improved performance for both types of memory accesses, in at least one embodiment. In at least one embodiment, capacity is used or is usable as a cache by programs that do not use shared memory, such as if shared memory is configured to use half of capacity, texture and load/store operations can use remaining capacity. Integration within shared memory/Ll cache 3718 enables shared memory/Ll cache 3718 to function as a high­ throughput conduit for streaming data while simultaneously providing high-bandwidth and low- latency access to frequently reused data, in accordance with at least one embodiment. In at least one embodiment, when configured for general purpose parallel computation, a simpler configuration can be used compared with graphics processing. In at least one embodiment, fixed function graphics processing units are bypassed, creating a much simpler programming model. In general purpose parallel computation configuration, work distribution unit assigns and distributes blocks of threads directly to DPCs, in at least one embodiment. In at least one embodiment, threads in a block execute same program, using a unique thread ID in calculation to ensure each thread generates unique results, using SM 3700 to execute program and perform calculations, shared memory/L1 cache 3718 to communicate between threads, and LSU 3714 to read and write global memory through shared memory/L1 cache 3718 and memory partition unit. In at least one embodiment, when configured for general purpose parallel computation, SM 3700 writes commands that scheduler unit 3704 can use to launch new work on DPCs").





Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GORDON G LIU whose telephone number is (571)270-0382. The examiner can normally be reached Monday - Friday 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on 571-272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GORDON G LIU/Primary Examiner, Art Unit 2612