DETAILED ACTION
The present application is being examined under the pre-AIA  first to invent provisions. 
Claims 1-21 are presented for examination.
This action is in response to the Amendment/Remarks on 1/21/22.  Applicant’s arguments have been fully considered but were not found to be persuasive.

Double Patenting
Claim 1 is rejected on the ground of nonstatutory double patenting as being unpatentable over claim 1 of U.S. Patent No. 9,513,975 B2 in view of Baliga et al. (hereinafter Baliga) (US 2013/0238938 A1). 
Although the claims at issue are not identical, they are not patentably distinct from each other because all limitations of instant claim 1 are contained in claim 1 of US 9,513,975 B2 except for being in a different claim statutory category of invention (see table below).  However, it would have been obvious to one of ordinary skill in the art before the invention was made to have a computer-readable storage medium that could store instruction to implement the computer-implemented method for execution.
INSTANT APPLICATION
US 9,513,975 B2
1. A computer-readable storage medium having stored thereon one or more application programming interfaces (APIs), which if performed by one or more 
          

          execute a parent thread within a first multiprocessor; 

          launch a grid of child threads within a second multiprocessor; and 
          






          in response to a synchronization function call, block execution of the parent thread while waiting for the child thread to complete.
computer-implemented method for executing a child thread grid that is associated with a parent thread within a parallel processor, the method comprising: 
the parent thread executes within a first streaming multiprocessor within the parallel processor; 
          launching the child thread grid within a second streaming multiprocessor within the parallel processor independently of a central processing unit coupled to the parallel processor by performing a memory barrier operation to flush all pending write data from the parent thread to memory in order to ensure memory consistency between the parent thread and the child thread grid; 
          receiving a thread synchronization barrier request from the parent thread, wherein the parent thread is configured to block a first programming instruction of the parent thread corresponding to the thread synchronization barrier request from executing; 

          receiving a notification that the child thread grid has completed executing; and 
          causing the parent thread to resume executing.

Furthermore, US 9,513,975 B2 does not teach to launch a kernel to execute its parent thread.  However, Baliga teaches issuing a Kernel launch for its workload processing to occur ([0075]-[0076]) and launching child threads/tasks in a grid or cooperative thread array (CTA) in a multiple general processing cluster execution environment, thereby providing nested parallelism ([0030]; [0037]-[0040]; [0045]; [0051]-[0058]).  It would have been obvious to one of ordinary skill in the art to modify US 9,513,975 B2 such that it would launch a kernel to execute its parent thread and launch a grid of child threads, as taught and suggested in Baliga.  The suggestion/motivation for doing so would have been to provide the predicted result of being able to have a workload created as well as keeping a group of threads organized together as a grid/CTA based on cooperative behavior (Baliga - [0075]-[0076]; [0051]-[0052]; [0056]).
As to dependent claims 2-10, they are also rejected as being obvious in view of the prior art rejections below.

Claim 11 is rejected on the ground of nonstatutory double patenting as being unpatentable over claim 1 of U.S. Patent No. 9,513,975 B2 in view of Schuster (US2013/0125133 A1), and further in view of Aingaran et al. (hereinafter Aingaran) (US 2006/0136915 A1), and further in view of Baliga.  
Although the claims at issue are not identical, they are not patentably distinct from each other (see table above) because claim 11 of U.S. Patent No. 9,513,975 B2 does not explicitly contain a plurality of cores, a register file, an L1 cache, a crossbar unit, an instruction cache, and a scheduler.  Schuster teaches a processor, comprising: a plurality of cores (multi-core) ([0013]; [0040]; [0118]); an L1 cache ([0122]); an instruction cache ([0122]); a scheduler ([0003]) and Aingaran teaches a multiprocessor that includes items such as a plurality of cores 36a-h, a crossbar 34, L1 cache 42, L1 instruction cache 43, scheduler 216, register files 210, etc (Figs. 3 and 8).  It would have been obvious to one of ordinary skill in the art before the invention was made to modify instant claim 1 such that it would include a plurality of cores, a register file, an L1 cache, a crossbar unit, an instruction cache, and a scheduler, as taught in Schuster and Aingaran.  The suggestion/motivation for doing so would have been to provide the predicted result of having the computer architectural structure needed for scheduling multiple threads for execution. 
Furthermore, Schuster does not teach to launch a kernel to execute its parent thread.  However, Baliga teaches issuing a Kernel launch for its workload processing to occur ([0075]-[0076]) and launching child threads/tasks in a grid or cooperative thread array (CTA) in a multiple general processing cluster execution environment, thereby providing nested parallelism (Figs 2 and 3C; [0030]; [0037]-[0040]; [0045]; [0051]-[0058]).  It would have been obvious to one of ordinary skill in the art to modify US 9,513,975 B2 such that it would launch a kernel to execute its parent thread and launch a grid of child threads, as taught and suggested in Baliga.  The suggestion/motivation for doing so would have been to provide the predicted result of being Baliga - [0075]-[0076]; [0051]-[0052]; [0056]).
As to dependent claims 12-21, they are also rejected as being obvious in view of the prior art rejections below.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1 and 21 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. The limitations of having stored one or more application programming interfaces (APIs) to be performed by one or more processors (independent claim 1) and wherein the one or more APIs are to a thread-oriented programming environment to program a parallel processing subsystem comprising at least one of the first multiprocessor the second multiprocessor (dependent claim 21) is not disclosed or described in the specification.

Claim Rejections - 35 USC § 103
The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the manner in which the invention was made.

Claims 1-10 and 21 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Schuster in view of Baliga.

As to claim 1, Schuster teaches a computer-readable storage medium having stored thereon one or more application programming interfaces (APIs), which if performed by one or more processors, cause the one or more processors to at least (one or more GPUs 540 that can be multi-core or multi-threaded processors, may implement one or more application programmer interfaces, APIs, that permit programmers to invoke the functionality of the GPU) ([0013]; [0023]; [0119]): 
execute a parent thread within a first multiprocessor (executing a given/parent thread, wherein each number of threads, including the given/parent thread, may execute concurrently with the same number of processors or cores in a multi-core processor) (Abstract; Figs. 1, 3, and 5, items 105 and 305; [0040]; 0123]); 
launch a child thread within a second multiprocessor (nested parallelism by parent/given thread spawning one or more children with concurrent/parallel execution of one or more CPUs and/or GPUs, wherein a CPU or GPU processor that is a multi-core or multi-threaded processor can be the second multiprocessor) (Abstract; [0013]; [0023]; [0119]; Figs. 1, 3, and 5, items 105 and 305; [0040]; 0123]); and 
in response to a synchronization function call (sync function call), block execution of the parent thread while waiting for the child thread to complete (in response to a sync function call, the 
It is noted that the claimed one or more application programming interfaces (APIs) is in the preamble and not given patentable weight.  In addition, although Schuster does not literally disclose to block execution of the parent thread, one of ordinary skill in the art before the invention was made would know that Schuster’s teaching of the parent thread being suspended and to wait until all of its child threads are completed before resuming execution serves the same function as blocking.  It would be obvious to include this feature of blocking execution of the parent thread because it would provide the predicted result of having fully strict thread-level parallel programs with load balancing between concurrently executing threads that efficiently distribute work among themselves.
Furthermore, Schuster does not teach to launch a kernel to execute its parent thread and to launch a grid of child threads.  However, Baliga teaches issuing a Kernel launch for its workload processing to occur ([0075]-[0076]) and launching child threads/tasks in a grid or cooperative thread array (CTA) in a multiple general processing cluster execution environment, thereby providing nested parallelism involving its parallel processing subsystem 112 and/or GPC 208 (General Processing Clusters) that includes a number M of streaming processors (SMs 310) (Figs 2 and 3C; [0022]-[0023]; [0030]; [0034]; [0037]-[0040]; [0042]-[0045]; [0056]).
Schuster and Baliga are analogous art with the claimed invention because they are all in the same field of endeavor of thread processing.  It would have been obvious to one of ordinary skill in the art to modify Schuster’s thread processing such that it would launch a kernel to execute its parent thread and launch a grid of child threads, as taught and suggested in Baliga.  Baliga - [0075]-[0076]; [0051]-[0052]; [0056]).

As to claim 2, Schuster teaches wherein the one or more processors comprise a graphics processing unit (GPU) (Computer System 500 includes a plurality of GPU(s) 540 and CPU(s) 530) (Fig. 5).

As to claim 3, Schuster teaches wherein the instructions, if performed by the one or more processors, cause the one or more processors to resume execution of the parent thread after completion of execution of the child thread (in response to a sync function call, the parent thread is blocked and waits until all of its child threads are completed before continuing/resuming execution) ([0031]).

As to claim 4, Schuster teaches wherein the instructions, if performed by the one or more processors, cause the one or more processors to store execution state of the parent thread in response to the synchronization function call ([0034]; [0031]).

As to claim 5, Schuster teaches wherein the instructions that cause the one or more processors to block execution of the parent thread, if performed by the one or more processors, cause the one or more processors to ensure memory coherence between the parent thread and the child thread ([0062]).

As to claim 6, Schuster teaches wherein the instructions, if performed by the one or more processors, cause the one or more processors to resume execution of the parent thread in response to notification that the child thread has completed execution (in response to a sync function call, the parent thread is blocked and waits until be notified that all of its child threads are completed before continuing/resuming execution) ([0031]).

As to claim 7, Schuster teaches wherein: the one or more processors comprise a graphics processing unit (GPU) (GPU(s) 540); and the instructions, if performed by the one or more processors, cause the one or more processors to: store execution state of the parent thread in response to the synchronization function call ([0034]; [0031]); receive a notification that execution of the child thread completed (in response to a sync function call, the parent thread is blocked and waits until be notified that all of its child threads are completed before continuing/resuming execution) ([0031]); and resume execution of the parent thread in response to notification that the child thread has completed execution (in response to a sync function call, the parent thread is blocked and waits until be notified that all of its child threads are completed before continuing/resuming execution) ([0031]).

As to claim 8, Schuster teaches wherein the one or more processors comprise a graphics processing unit (GPU) and wherein the GPU comprises the first multiprocessor and second multiprocessor (Fig. 5; [0013]; [0040]; [0118]).

As to claim 9, Schuster teaches wherein the parent thread comprises an instruction following the synchronization function call and wherein the instructions of the computer-

As to claim 10, Schuster teaches wherein the first multiprocessor and second multiprocessor are in the same parallel processing unit (PPU) (Fig. 5; [0013]; [0040]; [0118]).

As to claim 21, Schuster teaches wherein the one or more APIs are to a thread-oriented programming environment to program a parallel processing subsystem comprising at least one of the first multiprocessor the second multiprocessor (one or more GPUs 540 that can be multi-core or multi-threaded processors, may implement one or more application programmer interfaces, APIs, that permit programmers to invoke the functionality of the GPU) ([0013]; [0023]; [0119]).

Claims 11-20 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Schuster in view of Aingaran, and further in view of Baliga.

As to claim 11, Schuster teaches a processor, comprising: 
a plurality of cores (multi-core) ([0013]; [0040]; [0118]); 
an L1 cache ([0122]); 
an instruction cache ([0122]); 
a scheduler ([0003]); and 

execute a parent thread within a first multiprocessor (executing a given/parent thread, wherein each number of threads, including the given/parent thread, may execute concurrently with the same number of processors or cores in a multi-core processor) (Abstract; Figs. 1, 3, and 5, items 105 and 305; [0040]; 0123]); 
launch a child thread within a second multiprocessor (nested parallelism by parent/given thread spawning one or more children with concurrent/parallel execution of one or more CPUs and/or GPUs, wherein a CPU or GPU processor that is a multi-core or multi-threaded processor can be the second multiprocessor) (Abstract; [0013]; [0023]; [0119]; Figs. 1, 3, and 5, items 105 and 305; [0040]; 0123]); and 
in response to a synchronization function call (sync function call), block execution of the parent thread while waiting for the child thread to completes (in response to a sync function call, the parent thread is suspended and waits until all of its child threads are completed before continuing/resuming execution) ([0031]; [0095]).
Although Schuster does not literally disclose to block execution of the parent thread, one of ordinary skill in the art before the invention was made would know that Schuster’s teaching of the parent thread being suspended and to wait until all of its child threads are completed before resuming execution serves the same function as blocking.  It would be obvious to include this feature of blocking execution of the parent thread because it would provide the predicted result of having fully strict thread-level parallel programs with load balancing between concurrently executing threads that efficiently distribute work among themselves.
Schuster does not explicitly teach its processor to have a register file and a crossbar unit.  However, Aingaran teaches a multiprocessor that includes items such as a plurality of cores 36a-h, a crossbar 34, L1 cache 42, L1 instruction cache 43, scheduler 216, register files Shuster and Aingaran are analogous art with the claimed invention because they are all in the same field of endeavor of thread processing.  It would have been obvious to one of ordinary skill in the art before the invention was made to modify Shuster’s processor such that it would include a register file, crossbar unit, etc., as taught in Aingaran.  The suggestion/motivation for doing so would have been to provide the predicted result of having the computer architectural structure needed for scheduling multiple threads for execution.
Furthermore, Schuster does not teach to launch a kernel to execute its parent thread and to launch a grid of child threads.  However, Baliga teaches issuing a Kernel launch for its workload processing to occur ([0075]-[0076]) and launching child threads/tasks in a grid or cooperative thread array (CTA) in a multiple general processing cluster execution environment, thereby providing nested parallelism involving its parallel processing subsystem 112 and/or GPC 208 (General Processing Clusters) that includes a number M of streaming processors (SMs 310) (Figs 2 and 3C; [0022]-[0023]; [0030]; [0034]; [0037]-[0040]; [0042]-[0045]; [0056]).
Schuster, Aingaran, and Baliga are analogous art with the claimed invention because they are all in the same field of endeavor of thread processing.  It would have been obvious to one of ordinary skill in the art to modify Schuster in view of Aingaran’s thread processing such that it would launch a kernel to execute its parent thread and launch a grid of child threads, as taught and suggested in Baliga.  The suggestion/motivation for doing so would have been to provide the predicted result of being able to have a workload created as well as keeping a group of threads organized together as a grid/CTA based on cooperative behavior (Baliga - [0075]-[0076]; [0051]-[0052]; [0056]).

As to claim 12, Schuster teaches wherein the processor comprises a graphics processing unit (GPU) to execute the instructions (Computer System 500 includes a plurality of GPU(s) 540 and CPU(s) 530) (Fig. 5).

As to claim 13, Schuster teaches wherein the instructions, if performed by the processor, cause the processor to resume execution of the parent thread after completion of execution of the child thread (in response to a sync function call, the parent thread is blocked and waits until all of its child threads are completed before continuing/resuming execution) ([0031]).

As to claim 14, Schuster teaches wherein the instructions, if performed by the processor, cause the processor to store execution state of the parent thread in response to the synchronization function call ([0034]; [0031]).

As to claim 15, Schuster teaches wherein the instructions, if executed by the processor, cause the processor to ensure memory coherence between the parent thread and the child thread ([0062]).

As to claim 16, Schuster teaches wherein the instructions, if executed by the processor, cause the processor to resume execution of the parent thread in response to notification that the child thread has completed execution (in response to a sync function call, the parent thread is blocked 

As to claim 17, Schuster teaches wherein: the processor comprises a graphics processing unit (GPU) (GPU(s) 540); and the instructions, if performed by the processor, cause the processor to: store execution state of the parent thread in response to the synchronization function call ([0034]; [0031]); receive a notification that execution of the child thread completed (in response to a sync function call, the parent thread is blocked and waits until be notified that all of its child threads are completed before continuing/resuming execution) ([0031]); and resume execution of the parent thread in response to notification that the child thread has completed execution (in response to a sync function call, the parent thread is blocked and waits until be notified that all of its child threads are completed before continuing/resuming execution) ([0031]).

As to claim 18, Schuster teaches wherein the processor comprises a graphics processing unit (GPU) and wherein the GPU comprises the first multiprocessor and second multiprocessor (Fig. 5; [0013]; [0040]; [0118]).

As to claim 19, Schuster teaches wherein the parent thread comprises an instruction following the synchronization function call and wherein the instructions, if performed by the processor, cause the processor to continue execution at the instruction following the synchronization function call (in response to a sync function call, the parent thread is blocked and waits until be notified that all of its child threads are completed before continuing/resuming execution) ([0031]).

As to claim 20, Schuster teaches wherein the first multiprocessor and second multiprocessor are in the same parallel processing unit (PPU) (Fig. 5; [0013]; [0040]; [0118]).

Response to Arguments
As to independent claim 1, Applicant argues that the prior art references fail to teach or suggest "one or more application programming interfaces (APIs)" that "launch a kernel to execute a parent thread within a first multiprocessor" 

In response, the limitation of “one or more application programming interfaces (APIs)” is in the preamble and not given patentable weight.  Nonetheless, Schuster discloses one or more GPUs 540 that can be multi-core or multi-threaded processors, may implement one or more application programmer interfaces, APIs, that permit programmers to invoke the functionality of the GPU ([0013]; [0023]; [0119]).

As to independent claims 1 and 11, Applicant argues that the prior art references do not teach the feature to "launch a grid of child threads comprising a child thread within a second multiprocessor" (emphasis by Examiner) as recited in claims 1 and 11.
In response, both Schuster and Baliga teach this claimed limitation:
Schuster teaches to launch a child thread within a second multiprocessor through its nested parallelism by its parent/given thread spawning one or more children with concurrent/parallel execution of one or more CPUs and/or GPUs, wherein a CPU or GPU processor that is a multi-core or multi-threaded processor can be the second multiprocessor
Baliga teaches issuing a Kernel launch for its workload processing to occur ([0075]-[0076]) and launching child threads/tasks in a grid or cooperative thread array (CTA) in a multiple general processing cluster execution environment, thereby providing nested parallelism involving its parallel processing subsystem 112 and/or GPC 208 (General Processing Clusters) that includes a number M of streaming processors (SMs 310) (Figs. 2 and 3C; [0022]-[0023]; [0030]; [0034]; [0037]-[0040]; [0042]-[0045]; [0056]).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Demetriou (US 20120042303)
Vorbach (US 20120137075)
deCorral (US 20130080073)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KENNETH TANG whose telephone number is (571)272-3772. The examiner can normally be reached Monday-Friday 7AM-3PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Lewis Bullock can be reached on 571-272-3759. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: 





/KENNETH TANG/Primary Examiner, Art Unit 2199