DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 6 is objected to because of the following informalities:  
Claim 6: The phrase “wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution:” should be corrected to the phrase “wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution comprises:”; and 
Claim 7: The phrase “wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution:” should be corrected to the phrase “wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution comprises:”. Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under pre-AIA  35 U.S.C. 103(a) are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 3-4, and 8-11 are rejected under 35 U.S.C. 103 as being unpatentable over Duluk et al. (US 7697007) in view of Nordquist (US 7533236).
As per claim 1, Duluk teaches the invention substantially as claimed including a method for encapsulating and scheduling compute tasks in a streaming multiprocessor, the method comprising: 
	allocating memory for storing a metadata structure for a compute task (Column 1, Lines 41-43, a pushbuffer can be used as a mechanism to queue the launching of Cooperative Thread Arrays (CTAs). With the pushbuffer mechanism, many CTAs can be queued at once);
	storing initialization parameters in the metadata structure that configure the streaming multiprocessor to execute the compute task (Column 4, Lines 33-35, State information, as used herein, includes any information (other than input data) relevant to defining a CTA or grid of CTAs; and Column 4, Lines 46-53, State information may also include size information for the number of threads per CTA (e.g., 256 threads per CTA) and the size of the CTA grid (a "grid" of CTAs typically includes multiple CTAs of same dimension that all execute the same program, often for an input data set that is too big for a single CTA to handle efficiently). The size of the CTA grid specifies how many CTAs are in the grid. In some embodiments, the total number (T) of threads is also provided); and Column 5, Lines 6-9, "Loading" a thread includes supplying, via front end 210 from pushbuffer 150 into GPU 122, state information, input data, and any other parameters required to execute the program); 
	storing scheduling parameters in the metadata structure that control the scheduling of the compute task (Column 4, Lines 33-35, State information, as used herein, includes any information (other than input data) relevant to defining a CTA or grid of CTAs; and Column 4, Lines 46-53, State information may also include size information for the number of threads per CTA (e.g., 256 threads per CTA) and the size of the CTA grid (a "grid" of CTAs typically includes multiple CTAs of same dimension that all execute the same program, often for an input data set that is too big for a single CTA to handle efficiently). The size of the CTA grid specifies how many CTAs are in the grid. In some embodiments, the total number (T) of threads is also provided)); 
	storing execution parameters in the metadata structure that control execution of the compute task by the streaming multiprocessor (Column 5, Lines 6-15, "Loading" a thread includes supplying, via front end 210 from pushbuffer 150 into GPU 122, state information, input data, and any other parameters required to execute the program. For example, in the case of CTA processing, front end 210 loads the starting PC value for the CTA program into a slot in a program counter (PC) array (not shown) that is not currently in use. Depending on state information from pushbuffer 150, front end 210 can allocate space in one or more register files (not shown) for each processing engine of GPU 122 to execute one CTA thread); 
	storing a first pointer (Column 4, Lines 35-39,  state information includes… a starting program counter (e.g., memory address) for a program to be executed by each thread) [and a counter overflow flag corresponding to the first pointer in the metadata structure], wherein the first pointer points to a queue that stores data associated with the compute task (Column 4, Lines 35-39,  state information includes… a starting program counter (e.g., memory address) for a program to be executed by each thread) [and the counter overflow flag indicates when the first pointer is to wrap to a beginning of the queue];
and 
	scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution in the steaming multiprocessor, wherein the compute task executes in the streaming multiprocessor based on the execution parameters in the metadata structure (Column 4, Lines 29-32, depending on state information loaded into pushbuffer 150, front end 210 of GPU 122 loads and launches threads, CTAs, or grids of CTAs until all threads have been launched).

	Duluk fails to specifically teach, storing … a counter overflow flag corresponding to the first pointer in the metadata structure, ….and the counter overflow flag indicates when the first pointer is to wrap to a beginning of the queue.
	However, Nordquist teaches, storing … a counter overflow flag corresponding to the first pointer in the metadata structure (Column 11, Lines 58-62, When the head pointer is increased and overflows, the head pointer value wraps rather than saturates. The value of the head pointer stored in pointer registers 610 then corresponds to the next available allocation unit 410), ….and the counter overflow flag indicates when the first pointer is to wrap to a beginning of the queue (Column 11, Lines 58-62, When the head pointer is increased and overflows, the head pointer value wraps rather than saturates. The value of the head pointer stored in pointer registers 610 then corresponds to the next available allocation unit 410).
	Duluk and Nordquist are analogous because they are each related to controlling work distribution among tasks. Duluk teaches a method for scheduling work to be completed by a set of CTAs based on parameters regarding how the work should be distributed. (Column 6, Lines 47-50, Launching a thread or CTA includes supplying…state information, input data, and any other parameters required to execute the program; and Column 4, Lines 63-66, depending on state information loaded into pushbuffer 150, front end 210 of GPU 122 loads and launches threads, CTAs, or grids of CTAs until all threads have been launched in multithreaded core array 202).  Nordquist teaches a method of dynamically allocating memory for parallel threads using head and tail pointers.  (Abstract, Systems and methods for dynamically allocating memory for thread processing may reduce memory requirements while maintaining thread processing parallelism; Column 11, Lines 7-9, Pointer registers 610 store a head pointer and a tail pointer for a local memory pool that are used to configure the local memory pool as a ring buffer; and Column 11, Lines 58-62, When the head pointer is increased and overflows, the head pointer value wraps rather than saturates. The value of the head pointer stored in pointer registers 610 then corresponds to the next available allocation unit 410). It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention that based on the combination, Duluk would be modified to include Nordquist’s memory allocation and ring buffer mechanism in order to effectively execute tasks. Therefore, it would be obvious to one of ordinary skill in the art to combine the teachings of Duluk and Nordquist in order to control the processing of tasks.

As per claim 3, Duluk teaches, further comprising: 
	accessing, via the first pointer, the data associated with the compute task (Column 5, Lines 9-19, in the case of CTA processing, front end 210 loads the starting PC value for the CTA program into a slot in a program counter (PC) array (not shown) that is not currently in use. Depending on state information from pushbuffer 150, front end 210 can allocate space in one or more register files (not shown) for each processing engine of GPU 122 to execute one CTA thread, and load the input data into shared memory. Once the input data for threads have been loaded, front end 210 launches the group by signaling an instruction unit in GPU 122 to begin fetching and issuing instructions); and
	executing the compute task based on the data associated with the compute task (Column 5, Lines 9-19, in the case of CTA processing, front end 210 loads the starting PC value for the CTA program into a slot in a program counter (PC) array (not shown) that is not currently in use. Depending on state information from pushbuffer 150, front end 210 can allocate space in one or more register files (not shown) for each processing engine of GPU 122 to execute one CTA thread, and load the input data into shared memory. Once the input data for threads have been loaded, front end 210 launches the group by signaling an instruction unit in GPU 122 to begin fetching and issuing instructions).

	Duluk fails to specifically teach, determining that the counter overflow flag is not set; and incrementing the first pointer to point to data in the queue other than the data associated with the compute task.
	However, Nordquist teaches, determining that the counter overflow flag is not set (Column 11, Lines 50-56, An allocation is available when the tail pointer and head pointer stored in pointer registers 610 are not equal and the ring buffer is not full. If, in step 654 memory allocation controller 600 determines that an allocation unit 410 is available, then in step 656 memory allocation controller 600 writes the head pointer into an available entry in a thread table 330); and 
	incrementing the first pointer to point to data in the queue other than the data associated with the compute task (Column 9, Lines 7-9, Memory offset counter 510 is initialized to zero and is incremented by the value in allocation size register 530 whenever an allocation unit 410 is allocated to a thread).
	The same motivation used in the rejection of claim 1 is applicable to the instant claim.

As per claim 4, Duluk teaches, further comprising: 
	accessing, via the first pointer, the data associated with the compute task (Column 5, Lines 9-19, in the case of CTA processing, front end 210 loads the starting PC value for the CTA program into a slot in a program counter (PC) array (not shown) that is not currently in use. Depending on state information from pushbuffer 150, front end 210 can allocate space in one or more register files (not shown) for each processing engine of GPU 122 to execute one CTA thread, and load the input data into shared memory. Once the input data for threads have been loaded, front end 210 launches the group by signaling an instruction unit in GPU 122 to begin fetching and issuing instructions); 
	executing the compute task based on the data associated with the compute task (Column 5, Lines 9-19, in the case of CTA processing, front end 210 loads the starting PC value for the CTA program into a slot in a program counter (PC) array (not shown) that is not currently in use. Depending on state information from pushbuffer 150, front end 210 can allocate space in one or more register files (not shown) for each processing engine of GPU 122 to execute one CTA thread, and load the input data into shared memory. Once the input data for threads have been loaded, front end 210 launches the group by signaling an instruction unit in GPU 122 to begin fetching and issuing instructions).

	Duluk fails to specifically teach, determining that the counter overflow flag is set; and 
wrapping the first pointer to the beginning of the queue
	However, Nordquist teaches, 	determining that the counter overflow flag is set (Column 11, Lines 56-60, In step 658 memory allocation controller 600 updates the head pointer, increasing it by the value of allocation size register 630. When the head pointer is increased and overflows, the head pointer value wraps rather than saturates); and 
	wrapping the first pointer to the beginning of the queue (Column 11, Lines 58-60, When the head pointer is increased and overflows, the head pointer value wraps rather than saturates).
	The same motivation used in the rejection of claim 1 is applicable to the instant claim.

As per claim 8, Duluk teaches, wherein the execution parameters include a plurality of semaphore releases that includes at least one of a type of memory barrier (Column 6, Lines 52-57,  semaphore release mechanisms are used to determine whether a CTA has completed its processing, and whether to launch subsequent CTAs. In some embodiments, semaphore data structure 230 is written by controlling CTA 310 to enable/disable launching of predicated CTA 320 and/or other predicated CTAs; and Column 6, Line 64-Column 7, Line 5, although only two semaphore data structures 230 and 235 are depicted in FIG. 3 for the sake of clarity, there may be a large number of semaphore data structures in system memory 104 or other memories of system 100. In some embodiments with large numbers of semaphore data structures, a particular semaphore data structure (e.g., semaphore data structure 230) may be accessed by an offset within system memory 104 (e.g., a 40-bit address that indicates a starting address for semaphore data structure 230)), an address of a semaphore data structure in memory (Column 7, Lines 18-19, The sequence 410 of pushbuffer 150 commands include the following: a semaphore acquire command 410(1); and Column 8, Lines 17-21, For a semaphore acquire, GPU 122 reads the semaphore data structure (e.g., semaphore data structure 230) specified by an offset within system memory 104 (e.g., the 40-bit starting address of semaphore data structure 230)), or a size of the semaphore data structure.

As per claim 9, Duluk teaches, wherein the execution parameters include a starting address of the compute task to be executed (Column 7, Lines 35-39, state information includes …a starting program counter (e.g., memory address) for a program to be executed by each thread) or a type of memory barrier operation that is performed when execution of the compute task completes.

As per claim 10, Duluk teaches, wherein the execution parameters include a serial execution flag (Column 7, Lines 18-22, The sequence 410 of pushbuffer 150 commands include the following:…a launch enable command 410(2) that reads the result generated by the controlling process and either enables or disables subsequent CTA launches; and Column 7, Lines 27-33,  the controlling process (e.g., controlling CTA 310 of FIG. 3) is launched. At act 455, controlling CTA 310 executes and writes report 350 to semaphore data structure 230. At act 460, semaphore acquire command 410(1) executes to wait for the report 350 in semaphore data structure 230 written by controlling CTA 310) that indicates whether a first set of threads associated with the compute task is permitted to execute concurrently with a second set of threads associated with the compute task (Column 9, Lines 32-48, Because GPU 122 is highly parallel, with large numbers of queues, hardware at the bottom of the pipeline (e.g. back end 240 of FIG. 2) should have write queues empty before the semaphore release is written to memory. Therefore, to prevent race conditions, in some embodiments, a semaphore release may occur only after a result is written to memory. In some embodiments, hardware in GPU 122 is interlocked so that a semaphore release is performed only after queues are empty. In some embodiments, such synchronization can be done in software, since CTAs are cooperative. For example, once all CTAs are done processing, one thread can be designated to write a report to memory. Without a hardware interlock to ensure that a report will not be written to memory until all CTAs are finished processing, each thread that writes can perform a read that causes a flush out to memory. Once all threads have performed a read, then all of the threads must have finished processing, and then the designated reporting thread can write the report to memory).

As per claim 11, Duluk teaches, wherein the execution parameters include a throttle enable flag that controls whether a number of sets of threads executing concurrently is permitted based on memory limitations specified by parameters stored in the metadata structure (Column 4, Lines 33-40, State information, as used herein, includes any information (other than input data) relevant to defining a CTA or grid of CTAs. For example, in one embodiment, state information includes parameters that define the size of the CTA, the amount of register file space required for each thread…and selection of hardware resource allocation algorithms; and Column 5, Lines 2-3, one core may execute more than one CTA at a time depending on the resources required per CTA).

	Claims 5-7 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Duluk-Nordquist as applied to claim 1 and in further view of Lauer et al. (US 7765549).
As per claim 5, the combination of Duluk-Nordquist fails to specifically teach wherein the scheduling parameters include a launch parameter that indicates a minimum number of entries needed in the queue for the compute task to be launched , and wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution, comprises: determining that a first number of entries in the queue is greater than or equal to a value of the launch parameter); and scheduling the compute task for execution.
	However, Lauer teaches, wherein the scheduling parameters include a launch parameter that indicates a minimum number of entries needed in the queue for the compute task to be launched (Column 6, Lines 12-27, the first predetermined threshold establishes a minimum number of items to be included in a batch. The first predetermined threshold may be specified in the request (e.g., by automatically inserting a parameter defining the threshold in accordance with the type of processing to be performed) or may otherwise be predefined in the system that controls distribution of items. The first predetermined threshold may be equal to a requested batch size or may be some value less than a requested batch size. The first predetermined threshold can be a fixed value or can be determined in accordance with an algorithm such that the threshold itself is not preset, but the algorithm for determining the threshold is. In addition, different thresholds can be used depending on various parameters), and wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution, comprises: 
	determining that a first number of entries in the queue is greater than or equal to a value of the launch parameter (Column 6, Lines 12-27, the first predetermined threshold establishes a minimum number of items to be included in a batch. The first predetermined threshold may be specified in the request (e.g., by automatically inserting a parameter defining the threshold in accordance with the type of processing to be performed) or may otherwise be predefined in the system that controls distribution of items. The first predetermined threshold may be equal to a requested batch size or may be some value less than a requested batch size. The first predetermined threshold can be a fixed value or can be determined in accordance with an algorithm such that the threshold itself is not preset, but the algorithm for determining the threshold is. In addition, different thresholds can be used depending on various parameters); and 
	scheduling the compute task for execution (Column 6, Lines 39-41, If the number of available items exceeds the first predetermined threshold, a batch of items is assembled and sent in response to the request (step 215)).

	The combination of Duluk-Nordquist and Lauer are analogous because they are each related to controlling work distribution among tasks. Duluk teaches a method for scheduling work to be completed by a set of CTAs based on parameters regarding how the work should be distributed. Nordquist teaches a method of dynamically allocating memory for parallel threads. Lauer teaches a scheduling method that balances maintaining efficient batch sizes with avoiding unacceptable delays (Column 3, Lines 35-44, To decrease the possible efficiency problems associated with having batch sizes that are too small and to avoid unacceptable processing delays, another option is to use a time out process in which, if less than a minimal number of documents are available, a batch of documents is not sent unless the oldest of the available documents has been available for greater than a threshold amount of time. Thus, if there are not enough documents available, a batch of documents is only sent if a time out criterion is fulfilled). It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention that based on the combination, the combination of Duluk-Nordquist would be modified to include Lauer’s timeout threshold and techniques regarding maintaining batch sizes in order to launch CTAs when the amount of work in the processing queue is less than a specified parameter or threshold. Therefore, it would be obvious to one of ordinary skill in the art to combine the teachings of the combination of Duluk-Nordquist and Lauer in order to control the processing of tasks.

As per claim 6, the combination of Duluk-Nordquist fails to specifically teach wherein the scheduling parameters include a coalesce waiting time parameter that indicates a length of time after which the compute task is to be launched even when the queue has fewer than the minimum number of entries, and wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution: determining that the length of time indicated by a value of the coalesce waiting time parameter has passed; and scheduling the compute task for execution.
	However, Lauer teaches, wherein the scheduling parameters include a coalesce waiting time parameter that indicates a length of time after which the compute task is to be launched even when the queue has fewer than the minimum number of entries (Column 6, Line 51-Column 7, Line 1, If the number of available items does not exceed the first predetermined threshold, a determination is made as to whether a time period associated with one or more of the available items exceeds a second predetermined threshold (step 220). The time period can be a period that an oldest of the available items has been available, as determined by the amount of time an item has been available for a particular processing step, the amount of time an item has been in the workflow, or using some other criteria. The time period can alternatively be an average amount of time for multiple different available items. In some implementations, different items may have different time period thresholds depending on differing level of priority among the various items. The determination made in step 220 is generally used to support a time out procedure that ensures that items do not age beyond some acceptable limit, as defined for a particular workflow, while favoring batches that are as close as possible to the first predetermined threshold), and wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution: 	determining that the length of time indicated by a value of the coalesce waiting time parameter has passed (Column 7, Lines 4-9, If the time period does exceed the second predetermined threshold, a batch of items is assembled and sent in response to the request (step 225). Generally, the batch of items sent at step 225 will include all of the available items but will include fewer items than specified by the first predetermined threshold); and 
	scheduling the compute task for execution (Column 7, Lines 4-6, If the time period does exceed the second predetermined threshold, a batch of items is assembled and sent in response to the request (step 225)).
	The same motivation used in the rejection of claim 5 is applicable to the instant claim.

As per claim 7, the combination of Duluk-Nordquist fails to specifically teach wherein the scheduling parameters include a launch parameter that indicates a minimum number of entries needed in the queue for the compute task to be launched  and a coalesce waiting time parameter that indicates a length of time after which the compute task is to be launched even when the queue has fewer than the minimum number of entries and wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution: determining that a first number of entries in the queue is less than a value of the launch parameter; determining that the length of time indicated by a value of the coalesce waiting time parameter has not passed; and delaying scheduling of the compute task.
	However, Lauer teaches, wherein the scheduling parameters include a launch parameter that indicates a minimum number of entries needed in the queue for the compute task to be launched (Column 6, Lines 12-27, the first predetermined threshold establishes a minimum number of items to be included in a batch) and a coalesce waiting time parameter that indicates a length of time after which the compute task is to be launched even when the queue has fewer than the minimum number of entries (Column 6, Line 51-Column 7, Line 3, If the number of available items does not exceed the first predetermined threshold, a determination is made as to whether a time period associated with one or more of the available items exceeds a second predetermined threshold (step 220). The time period can be a period that an oldest of the available items has been available, as determined by the amount of time an item has been available for a particular processing step, the amount of time an item has been in the workflow, or using some other criteria. The time period can alternatively be an average amount of time for multiple different available items. In some implementations, different items may have different time period thresholds depending on differing level of priority among the various items. The determination made in step 220 is generally used to support a time out procedure that ensures that items do not age beyond some acceptable limit, as defined for a particular workflow, while favoring batches that are as close as possible to the first predetermined threshold, which typically sets a lower limit on the number of items considered to constitute an acceptable batch size), and wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution: 
	determining that a first number of entries in the queue is less than a value of the launch parameter (Column 5, Line 64-Column 6, Line 1,  A determination is made as to whether a number of available items exceeds a first predetermined threshold (step 210). The available items are items that comply with the parameters and that are ready for the particular type of processing to which the request relates); 
	determining that the length of time indicated by a value of the coalesce waiting time parameter has not passed (Column 6, Lines 51-55, If the number of available items does not exceed the first predetermined threshold, a determination is made as to whether a time period associated with one or more of the available items exceeds a second predetermined threshold (step 220); and Column 7, Lines 9-11,  If the number of available items does not exceed the first predetermined threshold and the time period does not exceed the second predetermined threshold, the request is rejected); and 
	delaying scheduling of the compute task (Column 7, Lines 9-11,  If the number of available items does not exceed the first predetermined threshold and the time period does not exceed the second predetermined threshold, the request is rejected).
	The same motivation used in the rejection of claim 5 is applicable to the instant claim.

	Claims 2 and 12-20 are rejected under 35 U.S.C. 103 as being unpatentable over Duluk et al. (US 7697007) in view of Lauer (US 7765549).
As per claim 2, Duluk teaches a method for encapsulating and scheduling compute tasks in a streaming multiprocessor, the method comprising: 
	allocating memory for storing a metadata structure for a compute task (Column 1, Lines 41-43, a pushbuffer can be used as a mechanism to queue the launching of Cooperative Thread Arrays (CTAs). With the pushbuffer mechanism, many CTAs can be queued at once); 
	storing initialization parameters in the metadata structure that configure the streaming multiprocessor to execute the compute task (Column 8,  Lines 51-61, The following is an example sequence of operations that use a semaphore data structure written by a controlling CTA to enable or disable a subsequent first set of CTAs: (A) Set LaunchEnableMode=True, to enable all CTA launches. (B) Initialize semaphore data structure 230 with the following data: payload=0x0FF; report_value=0x00FF;timestamp="don't care"); 
	storing execution parameters in the metadata structure that control execution of the compute task by the streaming multiprocessor (Column 5, Lines 6-15, "Loading" a thread includes supplying, via front end 210 from pushbuffer 150 into GPU 122, state information, input data, and any other parameters required to execute the program. For example, in the case of CTA processing, front end 210 loads the starting PC value for the CTA program into a slot in a program counter (PC) array (not shown) that is not currently in use. Depending on state information from pushbuffer 150, front end 210 can allocate space in one or more register files (not shown) for each processing engine of GPU 122 to execute one CTA thread); 
	storing a first pointer in the metadata structure, wherein the first pointer points to a queue that stores data associated with the compute task (Column 4, Lines 35-39,  state information includes… a starting program counter (e.g., memory address) for a program to be executed by each thread); 
	storing scheduling parameters in the metadata structure that control the scheduling of the compute task (Column 5, Lines 6-15, "Loading" a thread includes supplying, via front end 210 from pushbuffer 150 into GPU 122, state information, input data, and any other parameters required to execute the program. For example, in the case of CTA processing, front end 210 loads the starting PC value for the CTA program into a slot in a program counter (PC) array (not shown) that is not currently in use. Depending on state information from pushbuffer 150, front end 210 can allocate space in one or more register files (not shown) for each processing engine of GPU 122 to execute one CTA thread), wherein the scheduling parameters include a launch parameter that indicates a minimum number of entries needed in the queue for the compute task to be launched (Column 6, Lines 37-46, State information may also include size information for the number of threads per CTA …and the size of a CTA grid … In some embodiments, the total number (T) of threads is also provided) [and a coalesce waiting time parameter that indicates a length of time after which the compute task is to be launched even when the queue has fewer than the minimum number of entries]; and 	
	scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution in the steaming multiprocessor, wherein the compute task executes in the streaming multiprocessor based on the execution parameters in the metadata structure (Column 4, Lines 29-32, depending on state information loaded into pushbuffer 150, front end 210 of GPU 122 loads and launches threads, CTAs, or grids of CTAs until all threads have been launched).

	Duluk fails to specifically teach, wherein the scheduling parameters include a launch parameter that indicates … a coalesce waiting time parameter that indicates a length of time after which the compute task is to be launched even when the queue has fewer than the minimum number of entries.
	However, Lauer teaches, wherein the scheduling parameters include a launch parameter that indicates … a coalesce waiting time parameter that indicates a length of time after which the compute task is to be launched even when the queue has fewer than the minimum number of entries (Column 6, Line 51-Column 7, Line 1, If the number of available items does not exceed the first predetermined threshold, a determination is made as to whether a time period associated with one or more of the available items exceeds a second predetermined threshold (step 220). The time period can be a period that an oldest of the available items has been available, as determined by the amount of time an item has been available for a particular processing step, the amount of time an item has been in the workflow, or using some other criteria. The time period can alternatively be an average amount of time for multiple different available items. In some implementations, different items may have different time period thresholds depending on differing level of priority among the various items. The determination made in step 220 is generally used to support a time out procedure that ensures that items do not age beyond some acceptable limit, as defined for a particular workflow, while favoring batches that are as close as possible to the first predetermined threshold).

	Duluk and Lauer are analogous because they are each related to controlling work distribution among tasks. Duluk teaches a method for scheduling work to be completed by a set of CTAs based on parameters regarding how the work should be distributed. (Column 6, Lines 47-50, Launching a thread or CTA includes supplying…state information, input data, and any other parameters required to execute the program; and Column 4, Lines 63-66, depending on state information loaded into pushbuffer 150, front end 210 of GPU 122 loads and launches threads, CTAs, or grids of CTAs until all threads have been launched in multithreaded core array 202).  Lauer teaches a scheduling method that balances maintaining efficient batch sizes with avoiding unacceptable delays (Column 3, Lines 35-44, To decrease the possible efficiency problems associated with having batch sizes that are too small and to avoid unacceptable processing delays, another option is to use a time out process in which, if less than a minimal number of documents are available, a batch of documents is not sent unless the oldest of the available documents has been available for greater than a threshold amount of time. Thus, if there are not enough documents available, a batch of documents is only sent if a time out criterion is fulfilled). It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention that based on the combination, Duluk would be modified to include Lauer’s timeout threshold and techniques regarding maintaining batch sizes in order to launch CTAs when the amount of work in the processing queue is less than a specified parameter or threshold. Therefore, it would be obvious to one of ordinary skill in the art to combine the teachings of Duluk and Lauer in order to control the processing of tasks.

As per claim 12, Lauer teaches, wherein a total number of entries in the queue over a course of execution is not evenly divisible by a minimum number of entries in the queue, and, in response, a first number of entries in the queue is less than a value of the launch parameter (Column 5, Lines 26-41, the threshold number may be less than a requested number of documents. If the processing queue 125 includes a sufficient number of available documents, the central server 110 sends a batch of documents, typically numbering at least as many as the threshold number but not more than the requested number, to the requesting client device 135. If the processing queue 125 does not include a sufficient number of documents that satisfy the parameters, a timer or a time period associated with the oldest available document that satisfies the parameters is checked to determine whether it surpasses a certain threshold value. The time period can be a difference between a current time and a time stored in the processing queue 125 in connection with each document. If the time period surpasses the threshold value, the central server 110 sends all of the documents that satisfy the parameters to the requesting client device 135).

As per claim 13, Duluk teaches, further comprising transmitting, to the compute task, a first number of entries in the queue as a parameter (Column 6, Lines 37-46, State information may also include size information for the number of threads per CTA …and the size of a CTA grid … In some embodiments, the total number (T) of threads is also provided).

As per claim 14, Lauer teaches, wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution, comprises: 
	determining that a first number of entries in the queue is greater than or equal to a value of the launch parameter (Column 6, Lines 12-27, the first predetermined threshold establishes a minimum number of items to be included in a batch. The first predetermined threshold may be specified in the request (e.g., by automatically inserting a parameter defining the threshold in accordance with the type of processing to be performed) or may otherwise be predefined in the system that controls distribution of items. The first predetermined threshold may be equal to a requested batch size or may be some value less than a requested batch size. The first predetermined threshold can be a fixed value or can be determined in accordance with an algorithm such that the threshold itself is not preset, but the algorithm for determining the threshold is. In addition, different thresholds can be used depending on various parameters); and 
	scheduling the compute task for execution (Column 6, Lines 39-41, If the number of available items exceeds the first predetermined threshold, a batch of items is assembled and sent in response to the request (step 215)).

As per claim 15, Lauer teaches, wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution, comprises: 
	determining that the length of time indicated by a value of the coalesce waiting time parameter has passed (Column 6, Lines 52-55, a determination is made as to whether a time period associated with one or more of the available items exceeds a second predetermined threshold (step 220)); and 
	scheduling the compute task for execution (Column 7, Lines 4-9, If the time period does exceed the second predetermined threshold, a batch of items is assembled and sent in response to the request (step 225). Generally, the batch of items sent at step 225 will include all of the available items but will include fewer items than specified by the first predetermined threshold).

As per claim 16, Lauer teaches, wherein scheduling, based on the scheduling parameters in the metadata structure, the compute task for execution, comprises: 
	determining that a first number of entries in the queue is less than a value of the launch parameter (Column 5, Line 64-Column 6, Line 1,  A determination is made as to whether a number of available items exceeds a first predetermined threshold (step 210). The available items are items that comply with the parameters and that are ready for the particular type of processing to which the request relates); 
	determining that the length of time indicated by a value of the coalesce waiting time parameter has not passed (Column 6, Lines 51-55, If the number of available items does not exceed the first predetermined threshold, a determination is made as to whether a time period associated with one or more of the available items exceeds a second predetermined threshold (step 220); and Column 7, Lines 9-11,  If the number of available items does not exceed the first predetermined threshold and the time period does not exceed the second predetermined threshold, the request is rejected); and 
	delaying scheduling of the compute task (Column 7, Lines 9-11,  If the number of available items does not exceed the first predetermined threshold and the time period does not exceed the second predetermined threshold, the request is rejected).

As per claim 17, this claim is similar to claim 8 and is rejected for the same reasons.
As per claim 18, this claim is similar to claim 9 and is rejected for the same reasons.
As per claim 19, this claim is similar to claim 10 and is rejected for the same reasons.
As per claim 20, this claim is similar to claim 11 and is rejected for the same reasons.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MELISSA A HEADLY whose telephone number is (571)272-1972. The examiner can normally be reached Monday- Friday 9-5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Lewis Bullock can be reached on 571-272-3759. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LEWIS A BULLOCK  JR/Supervisory Patent Examiner, Art Unit 2199                                                                                                                                                                                                        
MELISSA A. HEADLY
Examiner
Art Unit 2199