DETAILED ACTION
It is hereby acknowledged that the following papers have been received and placed of record in the file:
Amended Claims						-Receipt Date 03/23/2021
Applicant Arguments						-Receipt Date 03/23/2021		
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
This office action is in response to the amendment filed on 03/23/2021. Claims 1-35 are pending. Claims 1, 6-8, 10-11, 16-19, 24, 28-30, and 34 are amended. Claim 35 is new. Applicant’s amendments to the claims have overcome the previous claim objections and the previous 112(a) written description rejections given in the Non-Final mailed on 12/23/2020. 

Response to Arguments
Applicant's arguments filed 03/23/2021 have been fully considered but they are not persuasive. 
Firstly, Examiner notes that no indication of allowable subject matter was given for claims 4-7, 21, and 22 as Applicant incorrectly acknowledges on page 11 of the Remarks. Applicant acknowledges the rejections of these claims on pages 12-13 of the Remarks, so it is unclear why Applicant would gather that these claims were indicated allowable. 
Applicant submits:
“The Office Action alleges that the "barrier between a compute phase and an exchange phase of the computer subsystem" is taught by the memory barriers in Gadre. Office Action, 5. The memory barriers in Gadre are barriers between different memory transactions. When a 
However, this argument is not persuasive because the broadest reasonable interpretation of a compute/exchange phase is a phase in which computing/exchanging is performed. Gadre teaches performing a memory transaction such that its results are visible to memory transactions after the barrier instruction ([0067]), the Office Action maps the time period in which the memory transaction is performed before the barrier to produce the results as the “compute phase” and the time period after the barrier when the results of the previous memory transaction are visible to other transactions is the “exchange phase”.
Examiner suggests adding further details that define the compute phase and exchange phase, such as details found in the Specification at [0055], will narrow the broadest reasonable interpretation of these terms and help to distinguish the claims from the applied references. 

Applicant submits:
“Williams does not teach to pre-load data in advance of a barrier. Rather, Williams teaches to convert memory access operations that follow a barrier operation into a prefetch request. Williams, [0065]. A prefetch is therefore performed after a barrier operation has been established (see paras [0062] to [0064]).” (Remarks, page 14)

Examiner suggests adding further language defining the pre-compiled data exchange point and details about using a register to indicate a sync zone for an upcoming synchronization, as described in the Specification at [0063] and [0066], would narrow the broadest reasonable interpretation and distinguish the claims from the applied references. 

Applicant submits:
“There is no teaching, however, that the MMU 328 in Gadre is external to a chip on which a general processing cluster 208 is formed. In para [0035] of Gadre it is taught that the parallel processing unit 202 is implemented on a single chip. As shown in Figure 2 of Gadre, the parallel processing unit 202 includes the general processing clusters 208. As explained in para [0051], each MMU 328 is included with one of the general processing clusters 208. Therefore, in the apparatus of Gadre, the MMU 328 (which the office has mapped to the claimed "gateway") is included on the same chip as the general processing cluster 208 (which the office has mapped to the claimed "computer subsystem"). For this reason, there is no teaching of "wherein the 
However, this argument is not persuasive. Gadre’s MMU is clearly shown as being external to the GPC in Fig. 3A. Gadre at [0035] only discloses “In other embodiments, a PPU 202 can be integrated on a single chip” which is clearly not meant to be limiting the PPU to only being integrated on a single chip. Gadre at [0051] discloses “In other embodiments, MMU(s)328 may reside within the memory interface 214” and “The MMU 328 may include address translation lookaside buffers (TLB) or caches which may reside within multiprocessor SPM310 or the L1 cache or GPC 208.” Further, Fig. 2 shows the MMU to be external to the GPC in Fig. 2 and only connected via crossbar as disclosed in [0041]. This indicates that the MMU of Gadre may be external to the chip that the GPC is implemented on. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 5-7, 9-11, 15-16, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Williams III US 2014/0025892 (hereinafter, Williams) and "Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling" (hereinafter, Chen).
Regarding claim 1, Gadre teaches:
1. A computer system comprising: 
a computer subsystem configured to act as a work accelerator ([0034], [0037], and Fig. 2: a GPC 208  of parallel processing subsystem 112 acts as a work accelerator for CPU 102), and 
a gateway connected to the computer subsystem ([0051]: MMU 328 is connected to an SPM of a GPC of the subsystem), the gateway enabling data transfer of data to the computer subsystem from external storage in relation to pre-compiled data exchange synchronization points attained by the computer subsystem ([0051], [0067], and [0107]-[0108]: the MMU enables transfer of data, i.e. loads, from L2/external storage in relation to barriers, i.e. pre-compiled data exchange synchronization points, attained by the subsystem; the MMU enables these loads by translating virtual addresses and by indicating MEMBAR completion which enables loads waiting on the MEMBAR to be performed), which act as a barrier between a compute phase and an exchange phase of the computer subsystem ([0067]: the MEMBAR instruction acts as a barrier so that results of memory transactions issued before the MEMBAR, i.e. a compute phase, are sufficiently performed and are visible to memory transactions issued after the MEMBER, i.e. an exchange phase), 
wherein the computer subsystem comprises a plurality of processing units and a plurality of memories associated with the processing units ([0046] and [0060]: the GPC/computer subsystem comprises a plurality of SPMs, i.e. processing units, and a plurality of L1 caches associated with the SPMs, see also Fig. 3C), at least one of the memories including a first compiled code sequence comprising at least one instruction executable by at least one of the plurality of processing units to pull data from a gateway transfer memory of the gateway ([0047], [0060], and [0064]: the SPMs include an instruction cache for storing instructions, i.e. a first compiled code sequence, and the instructions include load instructions to be executed by the LSU of the SPM, i.e. at least one instruction executable by at least of the processing units to pull data, where the load instruction may load data from L1.5 cache, i.e. a gateway transfer memory of/for the gateway, see also [0050], [0065], and Figs 3A, 3C) in response to one of the pre-compiled data exchange synchronization points ([0051], [0067], [0102], and [0107]-[0108]: a load instruction after the MEMBAR instruction attained by an SPM of the GPC will pull data in response to the MEMBAR completing), 
wherein the gateway configured to perform at least one operation to load at least some of the data from a first memory of the gateway to the gateway transfer memory ([0050], [0052], [0054]: the MMU 328 loads data from L2, i.e. a first memory of the gateway, to the L1.5, see also Figs 3A and 3B) 
	Although Gadre teaches loading in response to a barrier completing and the MMU loading data from an L2 into an L1.5, Gadre does not teach the SPMs pulling data in response to attaining the barrier instruction, the MMU comprises at least one processor, or the MMU pre-loading at least some of the data from L2. That is, Gadre does not teach:
at least one instruction to pull data from a gateway transfer memory of the gateway in response to one of the pre-compiled data exchange synchronization points attained by the computer subsystem, 
wherein the gateway comprises at least one processor configured to perform at least one operation to pre-load at least some of the data from a first memory of the gateway to the gateway transfer memory in advance of the one of the pre-compiled data exchange synchronization points being reached attained by the computer subsystem.
	However, in the analogous art of accessing memory with barriers, Williams teaches:
at least one instruction to pull data from a memory in response to the pre-compiled data exchange synchronization point attained by the subsystem ([0030], [0039]-[0040], and [0049]-[0050]: a memory access instruction is converted into a prefetch instruction in response to a barrier operation, i.e. a pre-compiled data exchange synchronization point, being established/attained), 
to perform at least one operation to pre-load at least some of the data in advance of the one of the pre-compiled data exchange synchronization points being reached attained by the computer subsystem ([0030] and [0039]-[0040]: the prefetch requests preloads requested data in advance of the barrier instruction being reached completing/attained by the computer subsystem).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre to prefetch data into the L1 from L1.5/L2 in response to a barrier attained by the system as taught by Williams. One of ordinary skill in the art would have been motivated to make this modification to reduce memory latency (Willaims [0030]).
	Further, in the analogous art of accessing memory, Chen teaches: 
wherein the gateway comprises at least one processor (page 5 section A paragraphs 1-2: an access unit generates memory addresses and sends them to a memory unit to be forwarded to memory, the access unit may employ an OOO core)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre in view of Williams to include a decoupled access unit including a core as taught by Chen such that the MMU includes an access unit and a decoupled memory unit. One of ordinary skill in the art would have been motivated to make this modification to reduce pipeline stalling due to load misses. 

	Regarding claim 2, Gadre in view of Williams and Chen teaches:
2. A computer system as claimed in claim 1, wherein the data to be pulled from the gateway transfer memory belongs to a plurality of streams (Gadre [0133] and Chen page 4 section III paragraph 1: the stream of memory commands in Williams may be interleaved instructions for different thread groups, i.e. a plurality of streams, see discussion of global/local streams in Chen, thus, in the combination of Gadre in view of Williams and Chen, the prefetched data pulled from L2 belongs to the plurality of streams interleaved in the stream of memory commands for the thread groups).

Regarding claim 5, Gadre in view of Williams and Chen teaches:
5. A computer system as claimed in claim 2, wherein the first compiled code sequence is configured to cause only one of the plurality of processing units processors to issue read requests to pull data of a first of the plurality of streams from the gateway transfer memory (Gadre [0060] and [0133]: the instructions/compiled code sequence in the instruction cache 370 of an SPM/processing unit may include load instructions which cause only that SPM to issue those load instructions/read requests to pull data from a stream from L1.5/L2).

	Regarding claim 6, Gadre in view of Williams and Chen teaches:
6. A computer system as claimed in claim 1, wherein the at least one processor is configured to, in advance of the one of the pre-compiled data exchange synchronization pointd attained by the computer subsystem, pre-load data to be pulled from the gateway transfer memory in response to each of a plurality of upcoming pre-compiled data exchange synchronization points attained by the computer subsystem (Williams [0048]-[0050]: load instructions waiting on barriers are converted into prefetches to preload data to be pulled from L2 in advance of the barriers/pre-compiled data exchange synchronization points attained by the subsystem, in the case that a plurality of barriers are attained the load instructions that come after each barrier will be converted into prefetches for the data).

	Regarding claim 7, Gadre in view of Williams and Chen teaches:
7. A computer system as claimed in claim 1, wherein the computer subsystem is configured to execute the at least one instruction to pull data from the gateway transfer memory by issuing at least one read request to the gateway (Gadre [0050]-[0051], [0071], and [0133]: the SPM executes loads to pull data from L2 by issuing reads to memory via MMU 328).

Regarding claim 9, Gadre in view of Williams and Chen teaches:
9. A computer system as claimed in claim 7, wherein the at least one read request comprises at least one of: 
an address of the first memory (Gadre [0051]-[0052], [0064] and Williams [0049]: the read request caused by the load instruction includes an address of L2/first memory to load the data when it is located in L2); and 
a number of bytes to be pulled from the gateway transfer memory.

	Regarding claim 10, Gadre in view of Williams and Chen teaches:
10. A computer system as claimed in claim 1, wherein the first compiled code sequence comprises at least one instruction executable by the computer subsystem to pull a second set of data from the first memory in response to the one of the pre-compiled data exchange synchronization points attained by the computer subsystem (Gadre [0060], [0133], and [0158]: a second load may be included in the instruction memory of an SPM, see loads 807 and 808, where, in the combination with Williams, the second load will pull a second set of data from L2 in response to the barrier 806 being attained).

	Regarding claim 11, Gadre in view of Williams and Chen teaches:
11. A computer system as claimed in claim 10, wherein the least one processor is configured to pre-load data of a first data stream from the first memory to the gateway transfer memory in advance of the one of the pre-compiled data exchange synchronization points attained by the computer subsystem, wherein the second set of data pulled from the first memory comprises data of a second data stream (Gadre [0060], [0133], [0158], and Williams [0048]-[0050]: the SPM prefetches/preloads data of a first data stream from L2 in advance of a barrier and the second set of data pulled from L2 may be data for a second stream for a different thread group of the SPM).

	Regarding claim 15, Gadre in view of Williams and Chen teaches: 
15. A computer system as claimed in claim 1, wherein the gateway comprises a streaming engine configured to execute a set of data transfer instructions to stream data through the gateway from the external storage to the computer subsystem, wherein the streaming engine comprises the at least one processor (Gadre [0042] and [0050], Chen page 5 section A paragraphs 1-2: Gadre teaches from L2, DRAM, and system memory to on chip memory and, in the combination with Chen, will use the processor as a streaming engine to execute the load instructions, i.e. a set of data transfer instructions, to load/stream data through the MMU into the SPMs).

	Regarding claim 16, Gadre in view of Williams and Chen teaches:
16. A computer system as claimed in claim 1, wherein the computer subsystem is configured to, in response to attaining the one of the pre-compiled data exchange synchronization points, transmit a synchronization request to the gateway ([Gadre [0106]-[0108]: the tracking unit of the SPM outputs a MEMBAR command to the MMU, i.e. transmits a synchronization request to the gateway in response to attaining the pre-compiled data exchange synchronization point), 
wherein the gateway is configured to, in response to receiving the synchronization request, transmit a synchronization acknowledgment to the computer subsystem (Gadre [0107]: the MMU/gateway sends a MEMBAR ACK to the tracking unit in the SPM/computer subsystem), 
wherein the computer subsystem is configured to pull the at least some of the data from the gateway transfer memory in response to receiving the synchronization acknowledgement (Gadre [0108]: execution of threads that were waiting on the MEMBAR is resumed, which includes execution of load instructions that were waiting on a MEMBAR.SYS or GL, in the combination with Williams, loads waiting on a MEMBAR.CTA may be prefetched and loads waiting on lower level MEMBARs will load/pull data after the barrier completion).

Regarding claim 18, Gadre in view of Williams and Chen teaches:
18. A computer system as claimed in claim 1, wherein the gateway is configured to interface the computer subsystem with a host to enable the computer subsystem to act as a work accelerator to the host (Gadre [0034]-[0035], [0041], and [0051]: the MMU interfaces the SPMs with CPU 102, i.e. a host, via the I/O unit and crossbar to enable the SPMs of the PPU to act as a work accelerator to the CPU by executing the commands the CPU pushes to the PPUs), wherein the computer system comprises an accelerator interface configured to connect the computer subsystem to the gateway to enable the transfer of the data from the gateway to the computer subsystem (Gadre [0051] and Fig. 3A: the connection between the SPMs and the MMU is an accelerator interface since it enables the transfer of data from the MMU to the SPMs).

	Regarding claim 35, Gadre in view of Williams and Chen teaches:
35. A computer system as claimed in claim 1, wherein the computer subsystem is implemented on a chip, and wherein the gateway is external to the chip (Gadre [0041] and [0051]: the MMU resides in memory interface 214 which is connected to the GPC via crossbar, i.e. external to the chip the GPC is implemented on).

Claims 3 and 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Williams III US 2014/0025892 (hereinafter, Williams), "Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling" (hereinafter, Chen), and Goodwin et al. US 5,659,713 (hereinafter, Goodwin)
Regarding claim 3, Gadre in view of Williams and Chen teaches:
3. A computer system as claimed in claim 2, 
	Gadre in view of Williams and Chen does not explicitly teach:
wherein the gateway transfer memory comprises a plurality of buffers, wherein each of the buffers is configured to store data belonging to an associated one of the plurality of streams.
	However, Goodwin further teaches:
wherein the gateway transfer memory comprises a plurality of buffers (Abstract and col 6 lines 15-16: a memory controller includes a stream buffer that comprises four FIFO buffers)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of 
wherein each of the buffers is configured to store data belonging to an associated one of the plurality of streams (the buffers of Goodwin will store data belonging to the streams corresponding to the thread groups in Gadre)
	One of ordinary skill in the art would have been motivated to make this modification to reduce access time for multiple streams (Goodwin col 3 line 47- col 4 line 7).

	Regarding claim 8, Gadre in view of Williams and Chen teaches:
8.  A computer system as claimed in claim 7, 
	Gadre in view of Williams and Chen does not explicitly teach:
wherein the gateway is configured to pre-load some of the data, wherein the gateway is configured to receive the at least one read request and, in response to the at least one read request, load remaining data into the gateway transfer memory from the first memory to be pulled from the gateway transfer memory in response to the one of the pre-compiled data exchange synchronization points attained by the computer subsystem.
	However, Goodwin further teaches:
wherein the gateway is configured to pre-load some of the data, wherein the gateway is configured to receive the at least one read request and, in response to the at least one read request, load remaining data into the gateway transfer memory from the first memory to be pulled from the gateway transfer memory (col 3 lines 47-49 and col 9 lines 20-25: the memory controller/gateway prefetches/preloads a stream buffer/gateway transfer memory, and in response to a read request emptying the buffer remaining data is loaded from DRAM to be pulled from the buffer).
.

Claim 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Williams III US 2014/0025892 (hereinafter, Williams), "Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling" (hereinafter, Chen), Goodwin et al. US 5,659,713 (hereinafter, Goodwin), and "Impulse: Building a Smarter Memory Controller" (hereinafter, Carter).
	Regarding claim 4, Gadre in view of Williams, Chen, and Goodwin teaches:
4.  A computer system as claimed in claim 3, 
	Gadre in view of Williams, Chen, and Goodwin does not explicitly teach:
wherein each of the buffers is a virtual data buffer, wherein at least one of the virtual data buffers store data in a physically discontiguous space in the gateway transfer memory.
	However, Carter teaches:
a virtual data buffer that stores data in a physically discontiguous space in the gateway transfer memory (section 1 paragraph 3, section 2.1 paragraph 2, section 3.2 paragraph 3: shadow memory, i.e. a virtual data buffer, store data in a physically discontiguous space in physical memory)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the buffers of Gadre in view of Williams, Chen, and Goodwin to be .
	
Claim 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Williams III US 2014/0025892 (hereinafter, Williams), "Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling" (hereinafter, Chen), and Maiyuran et al. US 6,898,674 (hereinafter, Maiyuran).
Regarding claim 12, Gadre in view of Williams and Chen teaches:
12.  A computer system as claimed in claim 10, 
	Gadre in view of Williams and Chen does not teach:
wherein the at least one processor of the gateway is configured to: 
check whether memory availability requirements are met for pre-loading the at least some of the data and the second set of data into the gateway transfer memory; and 
pre-load the at least some of the data from a first memory of the gateway to the gateway transfer memory in response to determining that the memory availability requirements are met for pre-loading the at least some of the data.
	However, Maiyuran teaches:
		a memory controller to:
check whether memory availability requirements are met for pre-loading the at least some of the data and the second set of data into the gateway transfer memory (col 5 line 66- col 6 line 5: the memory controller monitors/checks the available memory bandwidth, i.e. memory availability requirements, and schedules prefetches based on there being available memory bandwidth); and 
pre-load the at least some of the data from a first memory of the gateway to the gateway transfer memory in response to determining that the memory availability requirements are met for pre-loading the at least some of the data (col 5 line 66- col 6 line 5: the memory controller monitors/checks the available memory bandwidth, i.e. memory availability requirements, and schedules prefetches based on there being available memory bandwidth).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre in view of Williams and Chen to schedule prefetches based on available memory bandwidth as taught by Maiyuran. One of ordinary skill in the art would have been motivated to make this modification to utilize bandwidth prefetch information without significant impact on overall system performance (Maiyuran col 5 line 66- col 6 line 5)

Claim 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Williams III US 2014/0025892 (hereinafter, Williams), "Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling" (hereinafter, Chen), and Kulkarni US 8,356,138 et al. (hereinafter, Kulkarni)
	Regarding claim 13, Gadre in view of Williams and Chen teaches: 
13.  A computer system as claimed in claim 1, 
	Gadre in view of Williams and Chen does not teach:
wherein the at least one processor of the gateway comprises a field programmable gate array.
	However, Kulkarni teaches:
wherein the at least one processor of the gateway comprises a field programmable gate array (col 1 lines 16-26: an FPGA may be used to provide a variety of memory controllers for various processor systems).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processor of the MMU of Gadre in view of Williams and Chen to be an FPGA as taught by Kulkarni. One of ordinary skill in the art would have been motivated to make this modification to enable compatibility of the MMU with a variety of processor systems (Kulkarni col 1 line 16-26).

Claim 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Williams III US 2014/0025892 (hereinafter, Williams), "Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling" (hereinafter, Chen), "Dynamic Access Ordering for Streamed Computations" (hereinafter, McKee), Bowater et al. US 5,301,278 (hereinafter, Bowater).
	Regarding claim 14, Gadre in view of Williams and Chen teaches:
14.  A computer system as claimed in claim 1, 
	Gadre in view of Williams and Chen does not teach:
wherein the gateway comprises at least one instruction memory configured to store a second compiled code sequence expressing the at least one operation, wherein the first and second compiled code sequences are generated as a related set at compile time.
	However, McKee teaches:
wherein the first and second compiled code sequences are generated as a related set at compile time (section 3.1 paragraph 3: a compiler determines the sequence of references to be issued and buffered while the actual access issue is executed by the memory controller, i.e. a compiler generates a second compiled code sequence that is decoupled from the sequence of accesses generated by the processor, i.e. a first compiled code sequence generated as a related set).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre in view of Williams and Chen to be decoupled from the processor and to generate a compiled code sequence for the memory accesses of the MMU in addition to the compiled code sequence generated for the processor as taught by McKee. One of ordinary skill in the art would have been motivated to make this modification to automate prefetching and relieves register pressure (McKee, section 3.1 paragraph 3).
	Further, Bowater teaches:
wherein the gateway comprises at least one instruction memory configured to store a second compiled code sequence expressing the at least one operation (col 6 lines 34-38: the memory controller includes a control store, i.e. an instruction memory, that contains microcode that describes the sequences the memory control can perform, i.e. a second compiled code sequence expressing the operation)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre in view of Williams, Chen, and McKee to include the control store of Bowater such that the MMU will store a compiled code sequence that expresses a read operation. One of ordinary skill in the art would have been motivated to make this modification to increase the flexibility of the memory controller interfacing with different memories (Bowater lines 52-57).

Claim 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Williams III US 2014/0025892 (hereinafter, Williams), "Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling" .
Regarding claim 17, Gadre in view of Williams and Chen teaches:
17. (Currently Amended) A computer system as claimed in claim 16, 
	Gadre in view of Williams and Chen does not teach:
wherein the gateway is configured to: 
store a number of credits indicating availability of data for transfer to the computer subsystem at each of the pre-compiled data exchange synchronization point; and 
transmit the synchronization acknowledgment to the computer subsystem in response to determining that the number of credits comprises a non-zero number of credits.
	However, Gusat teaches: 
store a number of credits indicating availability of data for transfer ([0027] and [0030]-[0031]: a number of credits is stored indicating availability of data for transfer by a sender); and 
transmit in response to determining that the number of credits comprises a non-zero number of credits ([0027] and [0030]-[0031]: the data is transferred in credited mode in response to determining there is a non-zero number of credits).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre in view of Williams and Chen to include a sender buffer that allows the transfer of data when there are available credits as taught by Gusat such that credits are used to indicate if data is available for transfer when a barrier completes, i.e. at each precompiled data exchange synchronization point, and the MMU will transmit the acknowledgement in response to the barrier completing and there being available data as indicated by a non-zero number of credits. One of ordinary skill in the art would have been motivated to make this modification to .

Claim 19, 21-22, 24-26, 29-32 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Goodwin et al. US 5,659,713 (hereinafter, Goodwin) and Williams III US 2014/0025892 (hereinafter, Williams).
	Regarding claim 19, Gadre teaches:
19.  A method performed by a system having a gateway connected to a work accelerator ([0034], [0037], and Fig. 2: a GPC 208  of parallel processing subsystem 112 acts as a work accelerator for CPU 102; [0051]: MMU 328 is a gateway connected to an SPM of a GPC ), the gateway enabling transfer of data to the work accelerator from external storage in relation to data exchange synchronization points attained by the work accelerator ([0051], [0067], and [0107]-[0108]: the MMU enables transfer of data, i.e. loads, from L2/external storage in relation to barriers, i.e. pre-compiled data exchange synchronization points, attained by the accelerator; the MMU enables these loads by translating virtual addresses and by indicating MEMBAR completion which enables loads waiting on the MEMBAR to be performed), the data exchange synchronization points acting as a barrier between a compute phase and an exchange phase of the work accelerator ([0067]: the MEMBAR instruction acts as a barrier so that results of memory transactions issued before the MEMBAR, i.e. a compute phase, are sufficiently performed and are visible to memory transactions issued after the MEMBER, i.e. an exchange phase), the method comprising: 
loading the data from a first memory in the gateway to a L1.5 memory ([0050], [0052], [0054]: the MMU 328 loads data from L2, i.e. a first memory of the gateway in the embodiment the MMU is part of the interface 214, to the L1.5, see also Figs 3A and 3B);
receiving a synchronization request at the gateway from the work accelerator after the first data exchange synchronization point is attained (Gadre [0106]-[0108]: the tracking unit of the SPM outputs a MEMBAR command to the MMU, i.e. a synchronization request is received at the gateway from the accelerator after attaining the pre-compiled data exchange synchronization point) and responding to the synchronization request with a synchronization acknowledgement (Gadre [0107]: the MMU/gateway responds with a MEMBAR ACK to the tracking unit in the SPM); and 
at the work accelerator, executing an instruction to pull the data from the L1.5 memory to the work accelerator after receiving the synchronization acknowledgement (Gadre [0108]: execution of threads that were waiting on the MEMBAR is resumed, which includes execution of load instructions, i.e. an instruction to pull data, of threads that were waiting on the MEMBAR, where the load instruction may pull data from L1.5).
	Gadre does not teach:
pre-loading the data from a first memory in the gateway to a gateway transfer memory in advance of a first one of the data exchange synchronization points being reached attained by the work accelerator; 
at the work accelerator, executing an instruction to pull the data from the gateway transfer memory to the work accelerator after receiving the synchronisation acknowledgement
	However, Goodwin teaches:
a gateway transfer memory (Abstract and col 6 lines 15-16: a memory controller includes a stream buffer, i.e. a gateway transfer memory)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre to include the stream buffer of Goodwin. One of 
Further, in the analogous art of accessing memory with barriers, Williams teaches:
pre-loading the data from a first memory in the gateway to a gateway transfer memory in advance of a first data exchange synchronization point being reached attained by the work accelerator ([0030] and [0039]-[0040]: the prefetch requests preloads requested data in advance of the barrier instruction)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre in view of Goodwin to prefetch data into the stream buffer from L2 in response to a barrier attained by the system as taught by Williams. This combination would teach: 
pre-loading the data from a first memory in the gateway to a gateway transfer memory in advance of a first one of the data exchange synchronization points being reached attained by the work accelerator (data would be prefetched from L2 to the stream buffer of the MMU in advanced of a MEMBAR attained by an SPM); 
receiving a synchronization request at the gateway from the work accelerator after the first data exchange synchronization point is attained and responding to the synchronization request with a synchronization acknowledgement (the MMU receives a MEMBAR command from an SPM after the MEMBER instruction is attained and response to it with a MEMBAR ACK); and 
at the work accelerator, executing an instruction to pull the data from the gateway transfer memory to the work accelerator after receiving the synchronization acknowledgement (the SPM will pull data from the stream buffer after receiving the ACK)


	Regarding claim 21, Gadre in view of Goodwin and Williams teaches: 
21.  A method as claimed in claim 19, wherein the data pulled from the gateway transfer memory belongs to a plurality of streams (Gadre [0133]: the stream of memory commands may be interleaved instructions for different thread groups, i.e. a plurality of streams).

	Regarding claim 22, Gadre in view of Goodwin and Williams teaches:
22.  A method as claimed in claim 19, wherein the gateway transfer memory comprises a plurality of buffers, the method further comprising: 
each of the buffers storing data belonging to an associated one of a plurality of streams (Goodwin Abstract and col 6 lines 15-16: the memory controller/gateway includes a stream buffer/gateway transfer memory that comprises four FIFO buffers, each stream in Gadre is associated with one of the buffers).

	Regarding claim 24, Gadre in view of Goodwin and Williams teaches:
24.  A method as claimed in claim 19, further comprising: 
pre-loading further data to the gateway transfer memory in response an upcoming subsequent data exchange synchronization point (Williams [0030] and [0048]-[0050]: load instructions waiting on barriers are converted into prefetches to preload data to be pulled from lower level memory in advance of the barriers/pre-compiled data exchange synchronization points attained by the subsystem, in the case that another upcoming barrier is attained the load instructions that come after each barrier will be converted into prefetches to preload further data into the buffers).

	Regarding claim 25, Gadre in view of Goodwin and Williams teaches:
25.  A method as claimed in claim 19, further comprising: 
pre-loading partial additional data (Goodwin col 3 lines 47-49 and col 9 lines 20-25: the memory controller/gateway prefetches/preloads the stream buffer/gateway transfer memory, with partial additional data);
receiving a read request from the work accelerator (Goodwin col 3 lines 47-49 and col 9 lines 20-25: a read request from a continuing stream may be received and empty the buffer, causing additional data to be loaded into the buffer); and 
in response to the read request, loading remaining additional data into the gateway transfer memory (Goodwin col 3 lines 47-49 and col 9 lines 20-25: a read request from a continuing stream may be received and empty the buffer, causing additional data to be loaded into the buffer).

	Regarding claim 26, Gadre in view of Goodwin and Williams teaches:
26.  A method as claimed in claim 25, wherein the read request comprises an item selected from the list consisting of: 
a memory address (Gadre [0051]-[0052], [0064] and Williams [0049]: the read request caused by the load instruction includes an address of L2/first memory to load the data when it is located in L2); and 
a number of bytes to be pulled from the gateway transfer memory.


29.  A plurality of non-transitory machine-readable media having stored thereon instructions for performing a method for enabling data transfer from a gateway to a work accelerator ([0034], [0037], and Fig. 2: a GPC 208  of parallel processing subsystem 112 acts as a work accelerator for CPU 102; [0051]: MMU 328 is a gateway connected to an SPM of a GPC) in relation to data exchange synchronization points that act as a barrier between a compute phase and an exchange phase of the work accelerator ([0067]: the MEMBAR instruction acts as a barrier so that results of memory transactions issued before the MEMBAR, i.e. a compute phase, are sufficiently performed and are visible to memory transactions issued after the MEMBER, i.e. an exchange phase), the machine-readable media comprising machine executable code which when executed by at least one machine, causes the machine to: 
load the data from a first memory of the gateway to L1.5 memory ([0050], [0052], [0054]: the MMU 328 loads data from L2, i.e. a first memory of the gateway in the embodiment the MMU is part of the interface 214, to the L1.5, see also Figs 3A and 3B);
receive a synchronization request from the work accelerator after the first data exchange synchronization point is attained (Gadre [0106]-[0108]: the tracking unit of the SPM outputs a MEMBAR command to the MMU, i.e. a synchronization request is received at the gateway from the accelerator after attaining the pre-compiled data exchange synchronization point) and generate a synchronization acknowledgement (Gadre [0107]: the MMU/gateway generates/responds with a MEMBAR ACK to the tracking unit in the SPM); and 
pull the data, by the work accelerator, from L1.5 memory after the synchronization acknowledgement Gadre [0108]: execution of threads that were waiting on the MEMBAR is resumed, which includes execution of load instructions, i.e. an instruction to pull data, of threads that were waiting on the MEMBAR, where the load instruction may pull data from L1.5, i.e. gateway transfer memory, see also [0050] and [0052]).
	Gadre does not teach:
pre-load the data from a first memory of the gateway to a second memory of the gateway in advance of a first one of the data exchange synchronization points attained by the work accelerator; 
pull the data, by the work accelerator, from the second memory after the synchronisation acknowledgement.
However, Goodwin teaches:
a second memory (Abstract and col 6 lines 15-16: a memory controller includes a stream buffer, i.e. a second memory)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre to include the stream buffer of Goodwin. One of ordinary skill in the art would have been motivated to make this modification to conserve memory interconnect bandwidth (Goodwin, col 2 lines 9-15).
Further, in the analogous art of accessing memory with barriers, Williams teaches:
pre-load the data from a first memory of the gateway to a second memory of the gateway in advance of a first one of the data exchange synchronization points attained by the work accelerator ([0030] and [0039]-[0040]: the prefetch requests preloads requested data in advance of the barrier instruction)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre in view of Goodwin to prefetch data into the 

	Regarding claim 30, Gadre in view of Goodwin and Williams teaches:
30.  The non-transitory machine-readable media of claim 29, further comprising machine executable code, which causes the machine to: 
pre-load further data to the second memory in response an upcoming subsequent data exchange synchronization point (Williams [0030] and [0048]-[0050]: load instructions waiting on barriers are converted into prefetches to preload data to be pulled from lower level memory in advance of the barriers/pre-compiled data exchange synchronization points attained by the subsystem, in the case that another upcoming barrier is attained the load instructions that come after each barrier will be converted into prefetches to preload further data into the buffers).

	Regarding claim 31, Gadre in view of Goodwin and Williams teaches:
31.  The non-transitory machine-readable media of claim 29, further comprising machine executable code, which causes the machine to: 
pre-load partial additional data (Goodwin col 3 lines 47-49 and col 9 lines 20-25: the memory controller/gateway prefetches/preloads the stream buffer/second memory, with partial additional data);
receiving a read request from the work accelerator (Goodwin col 3 lines 47-49 and col 9 lines 20-25: a read request from a continuing stream may be received and empty the buffer, causing additional data to be loaded into the buffer); and 
in response to the read request, loading remaining additional data into the second memory (Goodwin col 3 lines 47-49 and col 9 lines 20-25: a read request from a continuing stream may be received and empty the buffer, causing additional data to be loaded into the buffer).

	Regarding claim 32, Gadre in view of Goodwin and Williams teaches:
32.  The non-transitory machine-readable media of claim 31, wherein the read request comprises an item selected from the list consisting of: 
a memory address (Gadre [0051]-[0052], [0064] and Williams [0049]: the read request caused by the load instruction includes an address of L2/first memory to load the data when it is located in L2); and 
a number of bytes to be pulled from the second memory.

Claim 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Goodwin et al. US 5,659,713 (hereinafter, Goodwin), Williams III US 2014/0025892 (hereinafter, Williams), and "Rack-Scale In-Memory Join Processing using RDMA" (hereinafter, Barthels).
	Regarding claim 20, Gadre in view of Goodwin and Williams teaches:
20.  A method as claimed in claim 19, wherein executing an instruction to pull comprises: 
	Gadre in view of Goodwin and Williams does not teach:
pulling the data via remote direct memory access (RDMA).
	However, Barthels teaches:
pulling the data via remote direct memory access (RDMA) (section 1 paragraph 3: RDMA is a lightweight communication mechanism to transfer/pull data).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the SPMs of Gadre in view of Goodwin and Williams to pull data using RDMA as taught by Barthels. One of ordinary skill in the art would have been motivated to make this modification to reduce costs of large data transfers (Barthels section 1 paragraph 3).

Claim 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Goodwin et al. US 5,659,713 (hereinafter, Goodwin), Williams III US 2014/0025892 (hereinafter, Williams), and "Impulse: Building a Smarter Memory Controller" (hereinafter, Carter).
	Regarding claim 23, Gadre in view of Goodwin and Williams teaches:
23.  A method as claimed in claim 22, 
	Gadre in view of Goodwin and Williams does not teach:
wherein each of the buffers comprises a virtual data buffer, wherein at least one of the virtual data buffers stores data in a physically discontiguous space in the gateway transfer memory.
However, Carter teaches:
a virtual data buffer that stores data in a physically discontiguous space in the gateway transfer memory (section 1 paragraph 3, section 2.1 paragraph 2, section 3.2 paragraph 3: shadow memory, i.e. a virtual data buffer, store data in a physically discontiguous space in physical memory)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the buffers of Gadre in view of Goodwin and Williams to be virtual buffers that store data in a discontiguous space as taught by Carter. One of ordinary skill in the art .

Claims 27 and 33 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Goodwin et al. US 5,659,713 (hereinafter, Goodwin), Williams III US 2014/0025892 (hereinafter, Williams), and Maiyuran et al. US 6,898,674 (hereinafter, Maiyuran).
	Regarding claim 27, Gadre in view of Goodwin and Williams teaches:
27.  A method as claimed in claim 19, wherein pre-loading the data comprises: 
	Gadre in view of Goodwin and Williams does not teach:
pre-loading the data to the gateway transfer memory in response to determining that memory availability requirements are met for pre-loading the data.
	However, Maiyuran teaches:
pre-loading the data in response to determining that memory availability requirements are met for pre-loading the data (col 5 line 66- col 6 line 5: the memory controller monitors/checks the available memory bandwidth, i.e. memory availability requirements, and schedules prefetches based on there being available memory bandwidth).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre in view of Goodwin and Williams to schedule prefetches based on available memory bandwidth as taught by Maiyuran. One of ordinary skill in the art would have been motivated to make this modification to utilize bandwidth prefetch information without significant impact on overall system performance (Maiyuran col 5 line 66- col 6 line 5).

Regarding claim 33, Gadre in view of Goodwin and Williams teaches:
33.  The non-transitory machine-readable media of claim 29, 
	Gadre in view of Goodwin and Williams does not teach:
wherein pre-loading the data comprises: 
pre-loading the data to the second memory in response to determining that memory availability requirements are met for pre-loading the data.
	However, Maiyuran teaches:
pre-loading the data in response to determining that memory availability requirements are met for pre-loading the data (col 5 line 66- col 6 line 5: the memory controller monitors/checks the available memory bandwidth, i.e. memory availability requirements, and schedules prefetches based on there being available memory bandwidth).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre in view of Goodwin and Williams to schedule prefetches based on available memory bandwidth as taught by Maiyuran. One of ordinary skill in the art would have been motivated to make this modification to utilize bandwidth prefetch information without significant impact on overall system performance (Maiyuran col 5 line 66- col 6 line 5).

Claims 28 and 34 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gadre et al. US 2012/0198214 (hereinafter, Gadre) in view of Goodwin et al. US 5,659,713 (hereinafter, Goodwin), Williams III US 2014/0025892 (hereinafter, Williams), and Gusat et al. US 2007/0274215 (hereinafter, Gusat).
	Regarding claim 28, Gadre in view of Goodwin and Williams teaches:
28.  A method as claimed in claim 19, further comprising: 
	Gadre in view of Goodwin and Williams does not teach:
storing N credits indicating availability of data transfer to the work accelerator at each of the data exchange synchronization point; and 
wherein transferring the data to the work accelerator is performed in response to determining that N comprises a non-zero number of credits.
	However, Gusat teaches:
store N credits indicating availability of data for transfer ([0027] and [0030]-[0031]: a number of credits is stored indicating availability of data for transfer by a sender); and 
transferring the data in response to determining that the N credits comprises a non-zero number of credits ([0027] and [0030]-[0031]: the data is transferred in credited mode in response to determining there is a non-zero number of credits).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre in view of Goodwin and Williams to include a sender buffer that allows the transfer of data when there are available credits as taught by Gusat such that credits are used to indicate if data is available for transfer when a barrier completes, i.e. at each precompiled data exchange synchronization point, and the MMU will transmit the acknowledgement in response to the barrier completing and there being available data as indicated by a non-zero number of credits. One of ordinary skill in the art would have been motivated to make this modification to efficiently transfer data to the SMP (Gusat [0065]).
	
Regarding claim 34, Gadre in view of Goodwin and Williams teaches:
34.  The non-transitory machine-readable media of claim 29, further comprising machine executable code, which causes the machine to: 
Gadre in view of Goodwin and Williams does not teach:
store N credits indicating availability of data transfer to the work accelerator at each data exchange synchronization point; and 
wherein allowing the work accelerator to pull the data is performed in response to determining that N comprises a non-zero number of credits.
	However, Gusat teaches:
store N credits indicating availability of data for transfer ([0027] and [0030]-[0031]: a number of credits is stored indicating availability of data for transfer by a sender); and 
pull the data is performed in response to determining that the N credits comprises a non-zero number of credits ([0027] and [0030]-[0031]: the data is transferred in credited mode in response to determining there is a non-zero number of credits).
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the MMU of Gadre in view of Goodwin and Williams to include a sender buffer that allows the transfer of data when there are available credits as taught by Gusat such that credits are used to indicate if data is available for transfer when a barrier completes, i.e. at each precompiled data exchange synchronization point, and the MMU will transmit the acknowledgement in response to the barrier completing and there being available data as indicated by a non-zero number of credits. One of ordinary skill in the art would have been motivated to make this modification to efficiently transfer data to the SMP (Gusat [0065]).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KASIM ALLI whose telephone number is (571)270-1476.  The examiner can normally be reached on Monday - Friday 9am 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee Li can be reached on 5712724169.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/K.A./Examiner, Art Unit 2183                                                                                                                                                                                                        
/William B Partridge/Primary Examiner, Art Unit 2183