DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

	This office action is in response to applicant’s amendment filed on 12/07/2020.
	Claims 1, 3, 5-8, 14-16, 18-27 are pending and examined.
	Claims 2, 4, 7-13 and 17 are cancelled.
	Claims 21-27 are newly added.
	The amendment filed on 12/07/2020 to the specification and drawings are accepted.
	
Response to Arguments
Applicant’s arguments filed on 12/07/2020 have been fully considered but they are moot in light of new grounds of rejection with a new reference (NPL) applied.
The examiner is available for a phone interview with applicant.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 5, 8, 16, 18-19, 21-24 and 27 are rejected under 35 U.S.C. 103 as being unpatentable over Nickolls et al. (US PGPUB 2011/0087860) hereinafter Nickolls, in view of Nickolls et al. (Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for? ACM Queue Magazine, March 2008) hereainfter NPL, in view of Dostert et al. (US PGPUB 2006/0143608) hereinafter Dostert.

Per claim 1, Nickolls discloses “a general-purpose graphics processor comprising: a set of processing elements to execute one or more thread groups of a second kernel to be executed by the general-purpose graphics processor; an on-chip memory coupled to the set of processing elements” (Figs. 1 and 12; paragraphs [0044][0045][0066]; a GPU with multiple cores for concurrent execution of programs; paragraphs [0009]-[0012][0014][0165]; multiple threads concurrently execute the same program (kernel) using a plurality of cores, on an input data set to produce an output data set; input data set is loaded onto a shared memory).
Nickolls does not explicitly teach “a scheduler coupled with the set of processing elements, the scheduler to schedule the second thread groups of the kernel to the set of processing elements, wherein the scheduler is to schedule a second thread group of the second kernel to execute subsequent to a first thread group of a first kernel and, in response to a determination that the second kernel is dependent upon the first kernel and that the first thread group and the second thread group have a same number of threads, the second thread group is configured to access a region of the on-chip memory that contains data written by the first thread group”. However, Nickolls discloses “a determination that the second kernel is dependent upon the first kernel and that the first thread group and the second thread group have a same number of threads, the second thread group is configured to access a region of the on-chip memory that contains data written by the first thread group” (paragraphs [0009]-[0012][0129]; each thread has a unique identifier (thread ID) assigned at thread launch time that controls various aspects of the thread's processing behavior; identify one or more other threads with which a given thread is to share an intermediate result (dependency a scheduler coupled with the set of processing elements, the scheduler to schedule the second thread groups of the kernel to the set of processing elements, wherein the scheduler is to schedule a second thread group of the second kernel to execute subsequent to a first thread group of a first kernel and, in response to a determination that the second kernel is dependent upon the first kernel and that the first thread group and the second thread group have a same number of threads” (page 46, “an application may execute multiple grids either independently or dependently. Independent grids may execute concurrently given sufficient hardware resources. Dependent grids execute sequentially, with an implicit inter-kernel barrier between them, thus guaranteeing that all blocks of the first grid will complete before any block of the second dependent grid is launched”; i.e. two thread groups execute sequentially, the second thread group is dependent on the first thread group; page 45, left column; the dimension of each thread group is indicated by the dimGrid variable, the default being 1). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Nickolls and NPL to synchronize thread execution by scheduling execution of thread A after execution of thread B (which thread A is dependent on), to ensure the correct result is produced (because execution of thread A is dependent on results from execution of thread B).
Nickolls does not explicitly teach “wherein the scheduler is to clear at least a portion of the on-chip memory before execution of a third thread group of a third kernel in response to a determination that the third thread group has a different number of threads than the first thread group and the second thread group”. However, NPL discloses (page 46, right column, Fig. 2; shared memory between different thread groups/grids, using function cudaMalloc() to allocate memory, cudaFree() to clear memory; page 47, left column, Fig. 3; executions of thread groups/grids with dynamically defined number of threads; also an example of a sequentially executed thread groups, the first thread group contains a 3 X 2 thread blocks, the second thread group contains 1 X 4 thread blocks, and a thread block also contains a different number of threads). Dostert further suggests (paragraphs [0028][0029]; thread manager may clean shared memory after completion of a thread, so the cleaned memory may be used by the next thread). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Nickolls, NPL and Dostert to execute thread groups with different sizes (to meet different needs of an application) and to clear a portion of the shared memory used by previous thread groups for execution of third thread group (to reuse resource, increase resource utilization).

Per claim 3, Nickolls does not explicitly teach “wherein the scheduler is to clear at least a portion of the on-chip memory before execution of a third thread group of the third thread group in response to a determination that the third kernel is not dependent upon the first kernel or the second kernel”. However, Nickolls suggests (paragraph [0128]; upon completion of a CTA, core interface can initiate execution of a next CTA, reusing the resources that became free when the first CTA was completed. i.e. reusing freed up shared memory for execution of next thread group). Dostert further suggests (paragraphs [0028][0029]; thread manager may clean shared memory after completion of a thread, so the cleaned memory may be used by the next thread; i.e. next thread can use the cleaned shared memory as it does not dependent on data produced by the previous threads). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Nickolls, NPL and Dostert to clear the shared memory for execution of third thread group if the third thread group is not dependent on the first thread group or the second thread group, so the shared memory can be freed up and reused (increase resource utilization).

Per claim 5, Nickolls further suggests “the scheduler is to bypass a clear of the region of the on-chip memory that contains data written by the first thread group in response to the determination that the second kernel is dependent upon the first kernel” (paragraphs [0009]-[0012][0129]; identify one or more other threads with which a given thread is to share an intermediate result; thread executions are synchronized, one thread produces an intermediate result that will be consumed by another thread; the producing thread writes the intermediate result to the designated location in shared memory, then the consuming thread reads it; it would be obvious to not clear the region of shared memory, if the consuming thread needs the result stored in the shared memory and is dependent on the producing thread).

Per claim 8, Nikolls further suggests “wherein the on-chip memory includes an implicitly managed cache memory and an explicitly managed shared memory” (paragraphs [0046][0128]; graphics memory includes a command buffer which is not explicitly managed; shared memory may be freed up for use by the next CTA program, or may store values to be used by the next CTA program).

Per claim 16, Nickolls discloses “a circuit board comprising: a host interconnect; a general-purpose graphics processor coupled to the host interconnect, the general-purpose graphics processor including a set of processing elements to execute one or more thread groups of a second kernel to be executed by the general-purpose graphics processor, an on-chip memory coupled to the set of processing elements, a memory coupled to the host interconnect and the general-purpose graphics processor” (Figs. 1 and 12; paragraphs [0044][0045][0066]; a GPU with multiple cores for concurrent execution of programs; the GPU with access interface to graphics memory; paragraphs [0009]-[0012][0014][0165]; multiple threads concurrently execute the same program (kernel) using a plurality 
Nickolls does not explicitly teach “a scheduler coupled with the set of processing elements, the scheduler to schedule the second thread groups of the kernel to the set of processing elements, wherein the scheduler is to schedule a second thread group of the second kernel to execute subsequent to a first thread group of a first kernel and, in response to a determination that the second kernel is dependent upon the first kernel and that the first thread group and the second thread group have a same number of threads, the second thread group is configured to access a region of the on-chip memory that contains data written by the first thread group”. However, Nickolls discloses “a determination that the second kernel is dependent upon the first kernel and that the first thread group and the second thread group have a same number of threads, the second thread group is configured to access a region of the on-chip memory that contains data written by the first thread group” (paragraphs [0009]-[0012][0129]; each thread has a unique identifier (thread ID) assigned at thread launch time that controls various aspects of the thread's processing behavior; identify one or more other threads with which a given thread is to share an intermediate result (dependency relationship); each thread group has a single thread; thread executions are synchronized, one thread produces an intermediate result that will be consumed by another thread; the producing thread writes the intermediate result to the designated location in shared memory, then the consuming thread reads it). NPL further suggests “a scheduler coupled with the set of processing elements, the scheduler to schedule the second thread groups of the kernel to the set of processing elements, wherein the scheduler is to schedule a second thread group of the second kernel to execute subsequent to a first thread group of a first kernel and, in response to a determination that the second kernel is dependent upon the first kernel and that the first thread group and the second thread group have a same number of threads” (page 46, “an application may execute multiple grids either independently or dependently. Independent grids may execute concurrently given sufficient hardware resources. Dependent grids execute sequentially, with an implicit inter-kernel barrier between them, thus guaranteeing that all blocks of the first grid will complete before any block of the second dependent grid is launched”; i.e. two thread groups execute sequentially, the second thread group is dependent on the first thread group; page 45, left column; the dimension of each thread group is indicated by the dimGrid variable, the default being 1). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Nickolls and NPL to synchronize thread execution by scheduling execution of thread A after execution of thread B (which thread A is dependent on), to ensure the correct result is produced (because execution of thread A is dependent on results from execution of thread B).
Nickolls does not explicitly teach “wherein the scheduler is to clear at least a portion of the on-chip memory before execution of a third thread group of a third kernel in response to a determination that the third thread group has a different number of threads than the first thread group and the second thread group”. However, NPL discloses (page 46, right column, Fig. 2; shared memory between different thread groups/grids, using function cudaMalloc() to allocate memory, cudaFree() to clear memory; page 47, left column, Fig. 3; executions of thread groups/grids with dynamically defined number of threads; also an example of a sequentially executed thread groups, the first thread group contains a 3 X 2 thread blocks, the second thread group contains 1 X 4 thread blocks, and a thread block also contains a different number of threads). Dostert further suggests (paragraphs [0028][0029]; thread manager may clean shared memory after completion of a thread, so the cleaned memory may be used by the next thread). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Nickolls, NPL and Dostert to execute thread groups with different sizes (to meet different needs of an application) and to clear a portion of the shared memory 

Per claim 18, Nickolls and Jiang do not explicitly teach “wherein the scheduler is to clear at least a portion of the on-chip memory before execution of a third thread group of a third kernel in response to a determination that the third kernel is not dependent upon the first kernel or the second kernel”. However, Nickolls suggests (paragraph [0128]; upon completion of a CTA, core interface can initiate execution of a next CTA, reusing the resources that became free when the first CTA was completed. i.e. reusing freed up shared memory for execution of next thread group). Dostert further suggests (paragraphs [0028][0029]; thread manager may clean shared memory after completion of a thread, so the cleaned memory may be used by the next thread; i.e. next thread can use the cleaned shared memory as it does not dependent on data produced by the previous threads). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Nickolls, NPL and Dostert to clean the shared memory for execution of third thread group if the third thread group is not dependent on the first thread group or the second thread group, so the shared memory can be freed up and reused (increase resource utilization).

	Per claim 19, Nickolls further suggests “the scheduler is to bypass a clear of the region of the on-chip memory that contains data written by the first thread group in response to the determination that the second kernel is dependent upon the first kernel” (paragraphs [0009]-[0012][0129]; identify one or more other threads with which a given thread is to share an intermediate result; thread executions are synchronized, one thread produces an intermediate result that will be consumed by another thread; the producing thread writes the intermediate result to the designated location in shared memory, then the consuming thread reads it; it would be obvious to not clear the region of 

Per claim 21, Nikolls further suggests “wherein the on-chip memory includes an implicitly managed cache memory and an explicitly managed shared memory” (paragraphs [0046][0128]; graphics memory includes a command buffer which is not explicitly managed; shared memory may be freed up for use by the next CTA program, or may store values to be used by the next CTA program).

Claims 22-24 and 27 are rejected under similar rationales as claims 1, 3, 5, and 8.

Claims 6-7, 20 and 25-26 are rejected under 35 U.S.C. 103 as being unpatentable over Nickolls, in view of NPL, in view of Dostert, in view of Puri (US PGPUB 2007/0047802).
Per claim 6, Nickolls does not explicitly teach “wherein the first kernel is to compute output of a first layer of a neural network and write output data to the on-chip memory”. However, Puri suggests the above (claims 1, 8-9; utilizing a GPU, compute outputs for different layers of a neural network using different programs; the first layer is connected to a second layer). Nickolls further discloses “wherein the first kernel is to compute output and write output data to the on-chip memory” (paragraphs [0009]-[0012][0129]; identify one or more other threads with which a given thread is to share an intermediate result; thread executions are synchronized, one thread produces an intermediate result that will be consumed by another thread; the producing thread writes the intermediate result to the designated location in shared memory, then the consuming thread reads it to produce an output). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Nickolls, NPL, Dostert and Puri to utilize a GPU to 

Per claim 7, Puri further suggests “wherein the second kernel is to read the output data from the on-chip memory and compute output of a second layer of a neural network, the first layer of the neural network connected to the second layer of the neural network” (claims 1, 8-9; utilizing a GPU, compute outputs for different layers of a neural network using different programs; the first layer is connected to a second layer).

Per claim 20, Nickolls discloses “wherein the first kernel is to compute output and write output data to the on-chip memory, and wherein the second kernel is to read the output data from the on-chip memory and compute output” (paragraphs [0009]-[0012][0129]; identify one or more other threads with which a given thread is to share an intermediate result; thread executions are synchronized, one thread produces an intermediate result that will be consumed by another thread; the producing thread writes the intermediate result to the designated location in shared memory, then the consuming thread reads it to produce an output). But Nickolls does not explicitly teach “wherein the first kernel is to compute output of a first layer of a neural network and wherein the second kernel compute output of a second layer of a neural network, the first layer of the neural network connected to the second layer of the neural network”. However, Puri suggests the above (claims 1, 8-9; utilizing a GPU, compute outputs for different layers of a neural network using different programs; the first layer is connected to a second layer). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Nickolls, NPL, Dostert and Puri to utilize a GPU to compute outputs for different layers of a neural network using different programs (threads), to increase the utilization and versatility of a GPU.

Claims 25-26 are rejected under similar rationales as claims 6-7.

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Goel et al. (US PGPUB 2016/0055667) hereinafter Goel, in view of Nickolls, in view of NPL and in view of Dostert.
Per claim 14, Goel discloses “a non-transitory machine-readable medium storing instructions to cause one or more processors to perform operations comprising:” (paragraph [0091]; memory to store instructions for execution); “loading shader program code for compilation” (paragraph [0150]; a compiler generating shade program for execution on a GPU). 
Goel does not explicitly teach “detecting that the shader program calls a first set of multiple interdependent kernels using a same grid size; and marking the interdependent kernels as executable without clearing shared local memory between execution of kernels in the first set of the multiple interdependent kernels”. However, Nickolls suggests the above (paragraphs [0044]; executing shader programs on GPU; [0009]-[0012][0128][0129]; multiple threads concurrently execute the same program (kernel) using a plurality of cores, on the same input data set to produce an output data set; each thread has a unique identifier (thread ID) assigned/marked at thread launch time that controls various aspects of the thread's processing behavior; identify one or more other threads with which a given thread is to share an intermediate result (detect dependency); thread executions are synchronized, one thread produces an intermediate result that will be consumed by another thread; the producing thread writes the intermediate result to the designated location in shared memory, then the consuming thread reads it; on completion of the CTA program, the output data produced by the threads are advantageously placed in memory for use by a subsequent CTA program; it would be obvious not to clear the shared memory because other thread needs the data stored in the shared memory). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to 
Nickolls does not explicitly teach “detecting that the shader program calls a second set of multiple interdependent kernels using differing grid sizes; and configuring the shared local memory to be cleared between execution of kernels in the second set of multiple interdependent kernels”. However, NPL further suggests (page 46, “an application may execute multiple grids either independently or dependently. Independent grids may execute concurrently given sufficient hardware resources. Dependent grids execute sequentially, with an implicit inter-kernel barrier between them, thus guaranteeing that all blocks of the first grid will complete before any block of the second dependent grid is launched”; i.e. thread groups/grids execute independently; page 45, left column; the dimension of each thread group is indicated by the dimGrid variable; page 46, right column, Fig. 2; shared memory between different thread groups/grids, using function cudaMalloc() to allocate memory, cudaFree() to clear memory; page 47, left column, Fig. 3; executions of thread groups/grids with dynamically defined number of threads; also an example of a sequentially executed thread groups, the first thread group contains a 3 X 2 thread blocks, the second thread group contains 1 X 4 thread blocks, and a thread block also contains a different number of threads). Dostert further suggests (paragraphs [0028][0029]; thread manager may clean shared memory after completion of a thread, so the cleaned memory may be used by the next thread). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Goel, Nickolls, NPL and Dostert to execute thread groups with different grid sizes (to meet different needs of an application) and to clear a portion of the shared memory used by previous thread groups for execution of third thread group (to reuse resource, increase resource utilization).

15 is rejected under 35 U.S.C. 103 as being unpatentable over Goel, Nickolls, NPL, Dostert, in view of Puri.
Per claim 15, Goel and Nickolls do not explicitly teach “wherein the first set of multiple interdependent kernels are to compute output of multiple successive layers of a neural network”. However, Puri suggests the above (claims 1, 8-9; utilizing a GPU, compute outputs for different layers of a neural network using different programs). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Goel, Nickolls, NPL, Dostert and Puri to utilize a GPU to compute outputs for different layers of a neural network using different programs (threads), to increase the utilization and versatility of a GPU.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to HANG PAN whose telephone number is (571)270-7667.  The examiner can normally be reached on 9 AM to 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chat Do can be reached on 571-272-3721.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-






/HANG PAN/Primary Examiner, Art Unit 2193