DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This action is responsive to Applicant’s Amendment filed on 6/21/2022.
Claims 1-20 are presented for examination. Claims 1-2, 4, 8-9, 11, 15-16 and 18 have been amended. 
Applicant’s amendments to the specification and claims have overcome 112 (a) rejections set forth in the non-Final Office Action mailed 3/21/2022.

Examiner Notes
Examiner cites particular columns, paragraphs, figures and line numbers in the references as applied to the claims below for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in entirely as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 5/11/2022, 5/11/2022 and 7/26/2022.  The submissions are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner. 

Specification
The disclosure is objected to because of the following informalities:
The meaning of “scheduling queue” at this invention is not clear, to be more specific, it is not clear that whether a scheduling group resided at the scheduling queue represents the wavefront(s) of the scheduling group is currently executing or not. According to Applicant’s statement at page 10 of the Remarks, i.e., “provide further support that the scheduling groups are in a scheduling queue, and thus, the determining features occur ‘prior to the first scheduling group beginning execute’”, one with ordinary skill in the art would consider Applicant here intends to mean the wavefront(s) from the scheduling group which is resided at the scheduling queue is not yet started to be executed or at least is not currently executing. However, the scheduling queue related features described by the specification (see Fig. 7, [0024]-[0025], [0027] and [0037]-[0039]) is about solving the resource contention situation by moving the scheduling groups at the scheduling queue to descheduled queue. If so, then one with ordinary skill in the art would not understand the purpose of disallowing the schedule on the wavefronts that are not currently executing at the situation of resource contention OR how does disallow scheduling the wavefronts that are not currently executing would improve the resource contention situation. To one with ordinary skill in the art, since the wavefronts at the scheduling queue are not currently executing, those wavefronts do not occupy any resources except for the memory/storage resource loads/stores such wavefronts (however, the memory/storage resource are still occupied or consumed even if disallowing those wavefronts from scheduling). In addition, the resource contention described by the specification can be specified by monitoring the resource parameters including “compute unit stall cycles, cache miss rates, memory access latency, link utilization”; such resource parameters are known being changed/increased/decreased by executing or stopping execution of tasks; disallowing the scheduling on the wavefronts that are not started to be executed will not change such resource parameters. Thereby, it is not clear the meaning or purpose of moving the wavefronts that are not currently executing to a queue that the objects on the queue are prevented from scheduling in view of there is resource contention situation. 
Note: if what Applicant mean is moving the scheduling group from scheduling queue to a de schedule queue is only to help reduce the wavefronts number to be scheduled in the future, it does not reduce or improve the resource contention situation right after the moving action, the resource contention situation would be reduced or improved via the system finish currently executing wavefronts but has less pending wavefronts to be scheduled or executed, then it will be unclear to one with ordinary skill in the art that the purpose of moving the lowest priority scheduling group since the system can move any pending or not-yet executing scheduling group to the descheduled queue to achieve the result mentioned above (if the system reduce or improve resource contention situation by reducing number of pending or not-yet executing scheduling group, then it is more reasonable to one with ordinary skill in the art to moving such scheduling group having highest priority to the descheduled queue since such scheduling group having highest priority has highest chance/possibility to be scheduled). 

Appropriate explanation is required.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a)  IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same,  and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.


Claims 2, 9 and 16 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement.  The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for pre-AIA  the inventor(s), at the time the application was filed, had possession of the claimed invention.

Regarding to Claim 2, the claim limitation “in response to determining, prior to the first scheduling group beginning execution, that the first scheduling group does not have one or more waterfronts ready for execution” at lines 1-3 lacks support from the specification. At the Remarks submitted by 6/21/2022, Applicant stated at least “In addition to figure 4 and paragraphs 35-36, figure 3 and paragraph 27, and figure 8 and paragraph 40 provide further support that the scheduling groups are in a scheduling queue and thus, the determining feature occur ‘prior to the first scheduling group beginning execution’” (see 2nd paragraph of page 10 from the Remarks).
However, the features from figure 4 as mentioned by Applicant would actually provide support for a feature that is conflict with Applicant’s intention. Fig. 4 and [0030]-[0031] of Applicant’s specifications are related to feature of selecting lower priority scheduling group in response to determining a higher priority scheduling group does not have any wavefronts ready for execution after the higher priority scheduling group began to execute (based on Fig. 4 and [0030]-[0031], at time slot t1, the group of kernels B and D having higher priority is scheduled and executed; later at time slot t2, “Wavefronts from kernel D” having lower priority than the group of kernels B and D,  “is now able to be scheduled in time slot t2 since there are no higher priority kernels available in the same cycle”. At this example, at the time of determining the higher priority group containing B and D does not have any wavefront is ready for execution to select the lower priority group containing kernel D, the higher priority group containing kernel B and D already began to execute). The descriptions from [0035]-[0036] of the specification mentioned by Application at most provide support for there is such action of determining the first scheduling group does not have one or more wavefronts ready for execution; however it is silence about whether such action is performed before or after the first scheduling group begins to execute. Although Figs. 3, 8, [0027] and [0040] from the specification does provide support for feature of “the scheduling groups are in a scheduling queue”, such meaning of “the scheduling groups are in a scheduling queue” is not clear in view of the intended features/purpose related to resource contention described by Figs. 3, 8, [0027] and [0040]; see the corresponding specification objection seection above.
Thereby, the limitation mentioned above fails to comply with the written description requirement.

Regarding to Claim 9, Claim 9 is rejected under the same reason set forth in the rejection of Claim 2 above.

Regarding to Claim 16, Claim 16 is rejected under the same reason set forth in the rejection of Claim 2 above.


The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 4-6, 11-13 and 18-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.

Regarding to Claim 4, the meaning of “a lowest priority scheduling group from a scheduling queue” at line 8 is not clear. It is not clear that whether the wavefront(s) from such lowest priority scheduling group in the claimed scheduling queue is/are currently executing or not (before moving such lowest priority scheduling group to the claimed descheduled queue). As explained at the specification objection section, if the wavefronts from such lowest priority scheduling group in the claimed scheduling queue is not currently executing or even is not yet started to be executed, then the purpose of moving such lowest priority scheduling group to the claimed descheduled queue in view of resource contention situation is not clear since such movement does not solve or improve any resource contention issue. However, according to Applicant’s statement from the Remarks (see 2nd paragraph of page 10 and last 2nd paragraph of page 13 from the Remarks), Applicant interprets the wavefronts from such lowest priority scheduling group in the claimed scheduling queue is not currently executing or even is not yet started to be executed. Thereby, the intended meaning for the limitation mentioned above is not clear. For the purpose of examination, examiner interprets the claimed limitation mentioned above as: a lowest priority scheduling group.
Claims 5-6 are rejected for failing to cure the deficiency from their respective parent claim by dependency.

Regarding to Claim 11, Claim 11 is rejected under the same reason set forth in the rejection of Claim 4 above.
Claims 12-13 are rejected for failing to cure the deficiency from their respective parent claim by dependency.

Regarding to Claim 18, Claim 18 is rejected under the same reason set forth in the rejection of Claim 4 above.
Claims 19-20 are rejected for failing to cure the deficiency from their respective parent claim by dependency.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3 and 7-8, 10 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (title: Improving GPGPU resource utilization through alternative thread block scheduling-recorded on IDS submitted 9/19/2019, hereafter Lee) in view of Boudier (US 20130332702 A1), Hsu et al. (US 20160055005 A1, hereafter Hsu), Yudanov et al. (US 20160371082 A1, hereafter Yudanov) and Yoaz et la. (US 6697932 B1, hereafter Yoaz).
Boudier, Hsu and Yudanov were cited on the previous office action.

Regarding to Claim 1, Lee discloses: a system comprising:
a compute unit is configured to:
receive a plurality of wavefronts of a plurality of kernels (see lines 15-19, 21-23 of abstract; “the optimal number of thread blocks”, “the ‘block’ of CTAs allocated to a core” and “multiple kernels to be allocated to the same core”. In order to schedule or execute multiple thread blocks/CTAs and kernels on a same core, the method is inherently to require to receive a plurality of wavefronts of a plurality of kernels. Note: see lines 7-11 of 1st paragraph of 1. Introduction for the relationship between wavefronts/warps and thread blocks/CTAs);
create a plurality of scheduling groups including one or more scheduling groups that each comprises wavesfronts from at least one kernel of the plurality of kernels, wherein wavefronts selected from inclusion in a scheduling group of the one or more scheduling groups are selected based on an identified criteria of a corresponding kernel (see lines 7-11 of 1st paragraph of 1. Introduction, Fig. 9(a), lines 1-9 of 1st paragraph of 4.2 Block CAT Scheudling (BCS), “A collection of threads are grouped to form a warp or a wavefront and the warps are combined to create a CTA (cooperative thread array) or a thread block”, “a kernel with 16X16 CTA dimension”. Wavefronts of same/common kernel are grouped into certain groups as CATs);
select, for scheduling, a first scheduling group from the plurality of scheduling groups; and select for scheduling a second scheduling group from the plurality of scheduling groups (see Fig. 2 at page 2, lines 7-15 of 1st paragraph of 1. Introduction; “All threads within a CTA are executed on the same core and the threads within a warp are often executed together”, “a warp (or a wavefront) scheduler to determine which warp is executed” and “a thread block or CTA scheduler to assign CTAs to cores”. The CTAs including at least first CTA and second CTA, i.e., claimed first scheduling group and claimed second scheduling group, scheduled by the CTA scheduler will be scheduled for execution).

Lee does not disclose: 
a plurality of compute units; and
a command processor coupled to the plurality of compute units, wherein the command processor is configured to dispatch kernels to the plurality of compute units;
wherein each compute unit of the plurality of compute units is configured to: 
receive, from the command processor, a plurality of wavefronts of a plurality of kernels;
each scheduling group comprises wavesfronts from at least two kernels of the plurality of kernels, the identified criteria of a corresponding kernel for creating the plurality of scheduling groups is an identified priority of a corresponding kernel;
in response to determining that the first scheduling group has one or more wavefronts ready for execution, each compute unit is further configured to:
scheduling wavefronts for execution from the first scheduling group; and
prevent wavefronts of scheduling groups other than the first scheduling group from being scheduled for execution until the first scheduling group has completed.    

However, Boudier discloses: a system comprising:
a plurality of compute units (see Figs. 1, 2 and [0022]; “The CPs 134 each may include many processing elements 212 (see FIG. 2) that perform as single instruction multiple data (SIMD) processing elements 212”); and
a command processor coupled to the plurality of compute units (see Figs. 1, 2 and [0022]; “A command processor 140 may control a group of CUs 134”), wherein the command processor is configured to dispatch kernels to the plurality of compute units (see [0008], [0029]; “The GPU may determine a number of processing elements for the consumer kernels to execute on” and “The command processor 140 may control the processing elements 212 by determining a kernel 220 that should be executed on each of the processing elements 212”);
wherein each compute unit of the plurality of compute units is configured to: 
receive, from the command processor, a plurality of wavefronts of a plurality of kernels (see [0008], [0029]; “The GPU may determine a number of processing elements for the consumer kernels to execute on” and “The command processor 140 may control the processing elements 212 by determining a kernel 220 that should be executed on each of the processing elements 212”. Note: [0008] and [0029] from Boudier may only include descriptions for compute unit receives a plurality of kernels instead of receiving “a plurality of wavefronts of a plurality of kernels”. It is understood in GPU technology field that a kernel at least includes a wavefronts or warp, see lines 1-11 of 1st paragraph of 1. Introduction from Lee, [0002] from Applicant’s specification).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify a system architecture of a GPU-type of processing element having two level of schedulers for scheduling received kernels/workloads from Lee by including a system architecture of a GPU containing a command processor to dispatch kernels /workloads to GPU-type of processing elements for execution as taught by Boudier, since both of Lee and Boudier discuss executing kernels/workloads in a GPU environment, combining both would provide a general completed GPU system architecture (Lee focuses on the discussion of processing elements of GPU to schedule wavefronts of kernels while Boudier focuses on the discussion of a command processor of GPU to dispatch/schedule kernels containing wavefronts to processing element, and they are prior arts related to different parts of a same computing system respectively. Combining such two prior art together provides a completed system architecture of a same component).

The combination of Lee and Boudier does not disclose:
each scheduling group comprises wavesfronts from at least two kernels of the plurality of kernels, the identified criteria of a corresponding kernel for creating the plurality of scheduling groups is an identified priority of a corresponding kernel;
in response to determining that the first scheduling group has one or more wavefronts ready for execution, each compute unit is further configured to:
scheduling wavefronts for execution from the first scheduling group; and
prevent wavefronts of scheduling groups other than the first scheduling group from being scheduled for execution until the first scheduling group has completed.    
However, Hsu discloses: create a plurality of scheduling groups including one or more scheduling groups that each comprises wavefronts from at least two kernels of the plurality of kernels, wherein wavefronts selected for inclusion in a scheduling group of one or more scheduling groups are selected based on an identified criteria of a corresponding kernel (see Figs. 2-3, [0030]-[0032], [0043]-[0046]; “In this example, the application 201 launches kernels 201.1 and 201.2”, “an application my launch any number of kernels”, “the wavefront classifier 310 may assign wavefronts of the workgroups 201.1.2 and 201.2.1 to the active subset 316, while all other wavefronts are assigned to the pending subset 314”, “assign wavefronts of the kernels 201.2 and 202.1 to the active subset 316, while all other wavefronts are assigned to the pending subset 314”, “a wavefront is classified based on its application identifier. For example, the wavefront classifier 310 may assign wavefronts of the application 201 of FIG. 2 to the active subset 316, while all other wavefronts are assigned to the pending subset 314” and “wavefronts of the same application, kernel, or workgroup may be grouped together for processing”, emphasis added. In the particular example of Fig. 2, Application 202 only launches one single kernel 202.1; however it is understood that there is a well-known example that Application 202 also launch more than one kernel as Application 201 from the particular example of Fig. 2. At such example, assigning wavefronts into active subset and pending subset based on same application identifier associated with the kernels/wavefronts would include each of active subset and pending subset comprises wavefronts from at least two kernels, wherein wavefronts selected for inclusion in one of the subsets are selected based on an identified criterion of the corresponding kernel. Also see [0039]-[0040] for the detail explanations on active subset 316 and pending subset 314 as scheduling groups).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the grouping mechanism of wavefronts to form different scheduling groups from the combination of Lee and Boudier by including grouping different wavefronts that are even from different kernels to form different scheduling groups from Hsu, since the wavefronts from different kernels may still have affinity to be executed as a group (see [0045] from Hsu).

The combination of Lee, Boudier and Hsu does not disclose:
the identified criteria of a corresponding kernel for creating the plurality of scheduling groups is an identified priority of a corresponding kernel.
in response to determining that the first scheduling group has one or more wavefronts ready for execution, each compute unit is further configured to:
scheduling wavefronts for execution from the first scheduling group; and
prevent wavefronts of scheduling groups other than the first scheduling group from being scheduled for execution until the first scheduling group has completed.    
However, Yudanov discloses: create a plurality of scheduling groups including one or more scheduling groups that each comprises, wherein threads selected for inclusion in a scheduling group of the one or more scheduling groups are selected based on an identified priority of a corresponding kernel (see [0033]; “the scheduler 230 are configured to select groups of threads for execution”, “threads may be allocated to a group if the threads are accessing the same portion of the main memory 215” and “threads may be coalesced to provide preferential access to applications or kernels that are given higher priority at runtime”. There is at least two groups of threads/wavefronts being coalesced, i.e., the group of the threads/wavefronts from different applications or kernels that are given higher priority at runtime and the group of the threads/wavefronts from different applications or kernels that are given lower priority at runtime).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the method of grouping different wavefronts from at least two kernels into different scheduling groups for executions from the combination of Lee, Boudier and Hsu by including method of coalescing threads from different applications or kernels based on priority of the application or kernel from Yudanov, and thus the new combination would teach the limitation of wherein wavefronts selected for inclusion in a scheduling group of the one or more scheduling groups are selected based on an identified priority of a corresponding kernel, since scheduling a group of instructions/threads/jobs/tasks having same priority together is a well-known computing task scheduling mechanism (note: Lee also discusses prioritizing threads/instructions for execution, see lines 11-16 of right side of page 5 from Lee; however Lee does not explicitly discuss threads/wavefronts from different kernels having same priority as a group can also be scheduled together). 

The combination of Lee, Boudier, Hsu and Yudanov does not disclose: 
in response to determining that the first scheduling group has one or more wavefronts ready for execution, each compute unit is further configured to:
scheduling wavefronts for execution from the first scheduling group; and
prevent wavefronts of scheduling groups other than the first scheduling group from being scheduled for execution until the first scheduling group has completed.    
However, Yoaz discloses: in response to determining that the first scheduling group has one or more instructions ready for execution, scheduling instructions for execution from the first scheduling group; and prevent instructions of scheduling group other than the first scheduling group from being scheduled for execution until the first scheduling group has completed (see lines 47-59 of col.6 and lines 22-30 of col. 10; “searches scheduling window 135 for a ready to execute instruction using a modified scheduling algorithm. Instead of dispatching the oldest ready instruction, a parallel search is also made to check whether a ready instruction exists which has its priority field set. Under this algorithm, scheduler 145 sends to execution unit 150 the oldest instruction having its priority field set but if none of the instructions has its priority field set then the oldest instruction is sent to execution unit 150” and “scheduler 145 picks the next instruction in the scoreboard that should be executed and sends this instruction to execution unit 150. When searching the scoreboard for a ready to execute instruction, scheduler 145 uses the following algorithm: Selects the oldest instruction with its priority field set to “1”, but if none of the instructions have its priority field set to “1” then the oldest ready instruction is selected”. The system would determine whether the instructions having priority field set to 1, i.e., instructions from first scheduling group, are ready for execution, if there is at least one of the instructions having priority filed set to 1 is ready for execution, then the scheduler would keep to schedule such instructions until there is no such instruction is ready for execution, i.e., preventing other instructions having other priority value from scheduling until the first scheduling group has completed).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the scheduling algorithm from the combination of Lee, Boudier, Hsu and Yudanov by including a scheduling algorithm of scheduling ready for execution instructions in the order of higher priority first from Yoaz, and thus the combination of Lee, Boudier, Hsu, Yudanov and Yoaz would disclose the missing limitation from the combination of Lee, Boudier, Hsu and Yudanov, since it would provide a mechanism of not only scheduling the instructions having higher priority first but also only scheduling the instructions which are ready for execution to executing the instruction as soon as possible due to readiness. 

Regarding to Claim 3, the rejection of Claim 1 is incorporated and further the combination of Lee, Boudier, Hsu, Yudanov and Yoaz discloses: each compute unit is configured to group wavefronts within a same priority together into a same scheduling group (see Fig. 9, lines 11-16 of right side of page 5 from Lee, Fig. 1 and [0033] from Yudanov).

Regarding to Claim 7, the rejection of Claim 1 is incorporated and further the combination of Lee, Boudier, Hsu, Yudanov and Yoaz discloses: wherein, in further response to determining that the second scheduling group has one or more wavefronts ready for execution, each compute unit is further configured to: schedule wavefronts for execution from the second scheduling group; and prevent wavefronts from scheduling groups other than the second scheduling group from being scheduled for execution (see [0007] and [0054] from Hsu; “a plurality of subsets comprising an active subset and a pending subset”, “The plurality of subsets may further comprises a pre-fetch subset, and the wavefront scheduler may be further configured to schedule the pre-fetch subset for processing after scheduling the active subset for processing and before scheduling the pending subset for processing” and “one or more wavefronts assigned to the pre-fetch subset 318 may become eligible if all wavefronts assigned to the active subset 518 become stalled, and one or more wavefronts assigned to the pending subset 514 may become eligible if wavefronts assigned to the active subset 516 and pre-fetch subset 518 become stalled”. In one of the embodiments, the scheduling groups at the combination system at least include an active subset, a pre-fetch subset and a pending subset, wherein the active subset is mapped to the claimed first scheduling group and the pre-fetch subset is mapped to the claimed second scheduling group. The wavefronts from the active subset are scheduled first among the three subsets/groups, the wavefronts from the pre-fetch subset are going to be scheduled after all of the wavefronts become stalled, i.e., there is no wavefronts from the active subset is ready for scheduled. During scheduling the wavefronts from the pre-fetch subset, the wavefronts from the active subset and pending subset are prevented from scheduled since they are not eligible for scheduling at the time of scheduling the wavefronts from the pre-fetch subset).

Regarding to Claim 8, Claim 8 is a method claim corresponds to system Claim 1 and is rejected for the same reason set forth in the rejection of Claim 1 above.

Regarding to Claim 10, the rejection of Claim 8 is incorporated and further Claim 10 is a method claim corresponds to system Claim 3 and is rejected for the same reason set forth in the rejection of Claim 3 above.

Regarding to Claim 14, the rejection of Claim 8 is incorporated and further Claim 17 is a method claim corresponds to system Claim 7 and is rejected for the same reason set forth in the rejection of Claim 7 above.

Claims 2 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (title: Improving GPGPU resource utilization through alternative thread block scheduling-recorded on IDS submitted 9/19/2019, hereafter Lee) in view of Boudier (US 20130332702 A1), Hsu et al. (US 20160055005 A1, hereafter Hsu), Yudanov et al. (US 20160371082 A1, hereafter Yudanov) and Yoaz et la. (US 6697932 B1, hereafter Yoaz) and further in view of Kakadia et al. (US 20150208275 A1, hereafter Kakadia).
Boudier, Hsu, Yudanov and Kakadia were cited on the previous office action.

Regarding to Claim 2, the rejection of Claim 1 is incorporated, the combination of Lee, Boudier, Hsu, Yudanov and Yoaz does not discloses: wherein, in response to determining, prior to the first scheduling group beginning execution, that the first scheduling group does not have one or more wavefronts ready for execution, select for scheduling a second scheduler group from the plurality of scheduling groups.

However, Kakadia discloses: a scheduling tasks for executions comprises: wherein, in response to determining, prior to the first scheduling task beginning execution, that the first scheduling task is not ready for execution, select for scheduling a second scheduling task from plurality of tasks (see [0056]. Before the first task having high priority beginning execution, determine whether this first task is ready for execution or not, if such first task is not ready for execution, i.e., no high priority task ready for execution, then the scheduler would select a lower priority task for scheduling).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the scheduling algorithm from the combination of Lee, Boudier, Hsu, Yudanov and Yoaz by including a scheduling algorithm of scheduling a lower priority task if there is no sufficient resources for executing higher priority tasks from Kakadia, and thus the combination of Lee, Boudier, Hsu, Yudanov, Yoaz and Kakadia would disclose the missing limitation from the combination of Lee, Boudier, Hsu, Yudanov and Yoaz (note: Kakadia may not disclose scheduling groups having multiple tasks/wavefonts; however, in views of the wavefronts of each scheduling group from the combination of Lee, Boudier, Hsu, Yudanov and Yoaz have same priority level and both of the combination of Lee, Boudier, Hsu, Yudanov, Yoaz and Kakadia are related to scheduling higher priority tasks first in general situation, it is still reasonable apply the scheduling mechanism at certain specifically situations, like there is no sufficient resources for executing higher priority task when scheduler tries to schedule higher priority task, from Kakadia into the combination of Lee, Boudier, Hsu, Yudanov and Yoaz), since it would provide a mechanism of scheduling tasks for execution but there is not sufficient resources for executing the scheduled tasks (see [0056] from Kakadia).

Regarding to Claim 9, the rejection of Claim 8 is incorporated and further Claim 9 is a method claim corresponds to system Claim 2 and is rejected for the same reason set forth in the rejection of Claim 2 above.

Claims 4-5 and 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (title: Improving GPGPU resource utilization through alternative thread block scheduling-recorded on IDS submitted 9/19/2019, hereafter Lee) in view of Boudier (US 20130332702 A1), Hsu et al. (US 20160055005 A1, hereafter Hsu), Yudanov et al. (US 20160371082 A1, hereafter Yudanov) and Yoaz et la. (US 6697932 B1, hereafter Yoaz) and further in view of Fontenot et al. (US 20110302372 A1, hereafter Fontenot) and Sollars (US 5900025 A).
Boudier, Hsu, Yudanov, Fontenot and Sollars were cited on the previous office action.

Regarding to Claim 4, the rejection of Claim 1 is incorporated, the combination of Lee, Boudier, Hsu, Yudanov and Yoaz discloses: each of the one or more scheduling groups comprises wavefronts from at least two kernels of the plurality of kernels (see Figs. 2-3 of Hsu and [0033] from Yudanov. The combination system groups threads or wavefronts to different scheduling groups based on the corresponding applications and kernels having same priority, and thus each of the scheduling groups based on such method would result all/every scheduling groups comprise threads/wavefronts from different kernels having same priority).
The combination of Lee, Boudier, Hsu, Yudanov and Yoaz does not disclose: wherein each compute unit is further configured to:
monitor one or more conditions indicative of resource contention on the compute unit, the one or more conditions comprising at least one of compute unit stall cycles, cache miss rates, memory access latency, and link utilization; 
generate a first measure of resource contention based on the one or more conditions being monitored; and
move a lowest priority scheduling group into a descheduled queue responsive to determining that the first measure of resource contention is greater than a first threshold, wherein:
wavefronts from scheduling groups stored in the descheduled queue are prevented from being scheduled for execution on the compute unit; and
one or more scheduling groups stored in the descheduled queue comprise wavefronts from at least two kernels of the plurality of kernels.

However, Fontenot discloses: a method comprising:
monitor one or more conditions indicative of resource contention on the compute unit, the one or more conditions comprising at least one of compute unit stall cycles, cache miss rates, memory access latency, and link utilization (see [0097]; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”. Also see “When the cache miss rate for L2 cache 624 and L3 cache 626 is low, core 610 likely is able to effectively utilize additional levels of parallelism resulting in an increase in overall computes” from [0087] and “should the cache misses as counted by counter 840 exceed the lower count value of count threshold 850” from [0099], and thus the counter 840 is a measurement of resource contention); 
generate a first measure of resource contention based on the one or more conditions being monitored (see [0097]; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”); and
reduce number of parallel execution threads responsive to determining that the first measure of resource contention is greater than a first threshold, wherein: threads from the reduced number of parallel execution threads are prevented from being scheduled for execution on the compute unit (see [0100]; “the cache misses as counted by counter 840 exceed the upper count value of count threshold 850, core 810 can remove layers of parallelism by switching from a higher SMT mode to a lower SMT mode” and “overall computes may be increased by reducing parallelism and the contention for a contested resource. In this case, the number of parallel threads could be reduced”).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the warp scheduler in the compute unit of GPU to schedule executions of groups of wavefronts having threads from the combination of Lee, Boudier, Hsu, Yudanov and Yoaz by including a method of increasing or decreasing the number of parallel thread executions in responsive to different resource contention situations as taught by Fontenot, and thus the new combination system would teach some portions of the missing limitations mentioned above (note: at the new combination system, when the cache misses as counted by counter 840 for one GPU compute unit exceed the upper count value of count threshold 850, then the combination system would reduce number of parallel execution threads, i.e., reduce number of scheduling groups. Since such reducing is reducing the parallel execution, and thus the wavefronts from the reduced scheduling groups would be prevented from being scheduled for execution on the GPU compute unit. In addition, since every scheduling group comprises wavefronts from different kernels having same priority, and thus the wavefronts from the reduced scheduling group would comprises wavefronts from different kernels), since it would provide a method of efficiently utilizing resources of the system via increasing parallel thread executions number when resource contention is low and decreasing parallel thread executions number when resource contention is high (see [0099]-[0100] from Fontenot).   

The combination of Lee, Boudier, Hsu, Yudanov, Yoaz and Fontenot does not disclose: reduce number of scheduling groups is move a lowest priority scheduling group into a descheduled queue.
However, Sollars discloses: a method of reducing allocated thread number comprising: move a lowest priority allocated thread into a descheudled queue (see lines 15-29 of col. 12; “deallocates and queues the lowest priority allocated context”. In addition, “If all context level control register sets 104 have been allocated” from lines 15-29 of col. 12 also indicates a resource contention is reached a threshold level).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the method of reducing the number of parallel thread executions when the resource contention is reached a threshold level as taught by the combination of Lee, Boudier, Hsu, Yudanov, Yoaz and Fontenot by including a method of moving lowest priority allocated thread/context into a queue which holes deallocated thread/context when resource contention is reached a threshold level as taught by Sollars, thereby the new combination system would teach the missing limitations from the combination of Lee, Boudier, Hsu, Yudanov and Yoaz, since it would provide a mechanism of only deallocate or descheudle lowest priority threads/contexts, and thus avoid to deallocate the threads/contexts having higher priority (see lines 15-29 of col. 12).

Regarding to Claim 5, the rejection of Claim 4 is incorporated and further the combination of Lee, Boudier, Hsu, Yudanov, Yoaz, Fontenot and Sollars discloses: wherein each compute unit is configured to:
wait a given amount of time after moving the lowest priority scheduling group into the descheduled queue (see Fig. 13, [0097], [0145]-[0146] from Fontenot; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”. After performing one action of reducing parallel thread execution number, i.e., moving the lowest priority scheduling group into descheduled queue, at the combination system, the method would return step/action of monitoring the cache misses over a period of time to detect the next trigger point to perform step 1320 of Fig. 13);
generate a second measure of resource contention based on the one or more conditions being monitored (see Fig. 13, [0097], [0145]-[0146] from Fontenot and the analysis of previous limitation; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”); and
move a next lowest priority scheduling group into the descheduled queue responsive to determining the second measure of resource contention is greater than the first threshold (see lines 7-15 of 1st paragraph of 1. Introduction from Lee, [0100] from Fontenot and lines 16-29 of col. 12 from Sollars. At the combination system, when the cache misses as counted by counter 840 for one GPU compute unit exceed the upper count value of count threshold 850, then the combination system would reduce number of parallel execution threads, i.e., reduce number of scheduling groups; by moving the current lowest priority scheduling group, i.e., the previous next lowest priority scheduling group, into the descheduled queue, and thus the wavefronts from the scheduling groups stored in the descheudled queue are prevented from being scheduled for execution on the GPU compute unit).

Regarding to Claim 11, the rejection of Claim 8 is incorporated and further Claim 11 is a method claim corresponds to system Claim 4 and is rejected for the same reason set forth in the rejection of Claim 4 above.

Regarding to Claim 12, the rejection of Claim 11 is incorporated and further Claim 12 is a method claim corresponds to system Claim 5 and is rejected for the same reason set forth in the rejection of Claim 5 above.

Claims 6 and 13 is rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (title: Improving GPGPU resource utilization through alternative thread block scheduling-recorded on IDS submitted 9/19/2019, hereafter Lee) in view of Boudier (US 20130332702 A1), Hsu et al. (US 20160055005 A1, hereafter Hsu), Yudanov et al. (US 20160371082 A1, hereafter Yudanov), Yoaz et la. (US 6697932 B1, hereafter Yoaz), Fontenot et al. (US 20110302372 A1, hereafter Fontenot) and Sollars (US 5900025 A) and further in view of Otenko (US 20150212794 A1).
Boudier, Hsu, Yudanov, Fontenot, Sollars and Otenko were cited on the previous office action.

Regarding to Claim 6, the rejection of Claim 4 is incorporated and further the combination of Lee, Boudier, Hsu, Yudanov, Yoaz, Fontenot and Sollars discloses: wherein each compute unit is configured to:
wait a given amount of time after moving the lowest priority scheduling group into the descheduled queue (see Fig. 13, [0097], [0145]-[0146] from Fontenot; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”. After performing one action of reducing parallel thread execution number, i.e., moving the lowest priority scheduling group into descheduled queue, at the combination system, the method would return step/action of monitoring the cache misses over a period of time to detect the next trigger point to perform step 1320 of Fig. 13);
generate a second measure of resource contention based on the one or more conditions being monitored (see Fig. 13, [0097], [0145]-[0146] from Fontenot and the analysis of previous limitation; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”); and
increase number of scheduling groups responsive to determining the second measure of resource contention is less than a second threshold (see lines 7-15 of 1st paragraph of 1. Introduction from Lee and [0099] from Fontenot; “should the cache misses as counted by counter 840 exceed the lower count value of count threshold 850, core 810 can add additional layers of parallelism by switching from a lower SMT mode to a higher SMT mode” and “overall computes may be increased by increasing parallelism and the contention for a contested resource. In this case, the number of parallel threads could be increased”. At the combination system, when the cache misses as counted by counter 840 for one GPU compute unit exceed lower count value of count threshold 850, then the combination system would increase number of parallel execution threads, i.e., increase number of scheduling groups).

The combination of Lee, Boudier, Hsu, Yudanov, Yoaz, Fontenot and Sollars does not disclose: increase number of scheduling groups is move a highest priority scheduling group out of the descheduled queue.
However, Otenko discloses: a method of increasing allocated tasks comprising: move a highest priority task out of the descheudled queue (see [0023]; “When the contention level is low in the system, the underlying priority queue 101 can sort the requests waiting in the priority queue 101 (with a logarithmic cost) in order to ensure that the request with the highest priority in the priority queue 101 can have the shortest waiting time. Thus, when the next consumer 120, or worker thread, is allowed to pick up a unit of work, it can pick up the unit with the highest priority”).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the method of increasing the number of parallel thread executions when the resource contention is low as taught by the combination of Lee, Boudier, Hsu, Yudanov, Yoaz, Fontenot and Sollars by including a method of scheduling the highest priority tasks in a wait queue having tasks to be scheduled as soon as possible as taught by Otenko, since it is understood to scheduling a highest priority task first instead of a lower priority task (see [0023] from Otenko. Also see lines 11-16 of right side of page 5 from Lee).
Thereby, the combination of Lee, Boudier, Hsu, Yudanov, Yoaz, Fontenot, Sollars and Otenko discloses: move a highest priority scheduling group out of the descheduled queue responsive to determining the second measure of resource contention is less than a second threshold (see lines 7-15 of 1st paragraph of 1. Introduction from Lee, [0100] from Fontenot, lines 16-29 of col. 12 from Sollars and [0023] from Otenko. At the combination system, when the cache misses as counted by counter 840 for one GPU compute unit exceed lower count value of count threshold 850, then the combination system would increase number of parallel execution threads, i.e., increase number of scheduling groups, by moving the highest priority scheduling group out of the descehduled queue).

Regarding to Claim 13, the rejection of Claim 11 is incorporated and further Claim 13 is a method claim corresponds to system Claim 6 and is rejected for the same reason set forth in the rejection of Claim 6 above.

Claims 15 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (title: Improving GPGPU resource utilization through alternative thread block scheduling-recorded on IDS submitted 9/19/2019, hereafter Lee) in view of Hsu et al. (US 20160055005 A1, hereafter Hsu), Yudanov et al. (US 20160371082 A1, hereafter Yudanov) and Yoaz et la. (US 6697932 B1, hereafter Yoaz).
Hsu and Yudanov was cited on the previous office action.

Regarding to Claim 15, Lee discloses: an apparatus comprising:
a memory, and a processor coupled to the memory, wherein the processor is configured to (see 2.1. Methodology of page 2; “The GPGPU that we model consists of 28 cores (or streaming multiprocessors (SMs)) connected to 8 memory controllers, as shown in Fig. 2. Each core has its own private L1 data cache, texture cache, and shared memory”):
receiving a plurality of wavefronts of a plurality of kernels (see lines 15-19, 21-23 of abstract; “the optimal number of thread blocks”, “the ‘block’ of CTAs allocated to a core” and “multiple kernels to be allocated to the same core”. In order to schedule or execute multiple thread blocks/CTAs and kernels on a same core, the method is inherently to require to receive a plurality of wavefronts of a plurality of kernels. Note: see lines 7-11 of 1st paragraph of 1. Introduction for the relationship between wavefronts/warps and thread blocks/CTAs);
create a plurality of scheduling groups including one or more scheduling groups that each comprises wavesfronts from at least one kernel of the plurality of kernels, wherein wavefronts selected from inclusion in a scheduling group of the one or more scheduling groups are selected based on an identified criteria of a corresponding kernel (see lines 7-11 of 1st paragraph of 1. Introduction, Fig. 9(a), lines 1-9 of 1st paragraph of 4.2 Block CAT Scheudling (BCS), “A collection of threads are grouped to form a warp or a wavefront and the warps are combined to create a CTA (cooperative thread array) or a thread block”, “a kernel with 16X16 CTA dimension”. Wavefronts of same/common kernel are grouped into certain groups as CATs); and
select, for scheduling, a first scheduling group from the plurality of scheduling groups; and select for scheduling a second scheduling group from the plurality of scheduling groups (see Fig. 2 at page 2, lines 7-15 of 1st paragraph of 1. Introduction; “All threads within a CTA are executed on the same core and the threads within a warp are often executed together”, “a warp (or a wavefront) scheduler to determine which warp is executed” and “a thread block or CTA scheduler to assign CTAs to cores”. The CTAs including at least first CTA and second CTA, i.e., claimed first scheduling group and claimed second scheduling group, scheduled by the CTA scheduler will be scheduled for execution).

Lee does not disclose: 
each scheduling group comprises wavesfronts from at least two kernels of the plurality of kernels, the identified criteria of a corresponding kernel for creating the plurality of scheduling groups is an identified priority of a corresponding kernel;
in response to a determination that the first scheduling group has one or more wavefronts ready for execution, the processor is further configured to:
schedule wavefronts for execution from the first scheduling group; and
prevent wavefronts of scheduling groups other than the first scheduling group from being scheduled for execution until the first scheduling group has completed.

However, Hsu discloses: create a plurality of scheduling groups including one or more scheduling groups that each comprises wavefronts from at least two kernels of the plurality of kernels, wherein wavefronts selected for inclusion in a scheduling group of one or more scheduling groups are selected based on an identified criteria of a corresponding kernel (see Figs. 2-3, [0030]-[0032], [0043]-[0046]; “In this example, the application 201 launches kernels 201.1 and 201.2”, “an application my launch any number of kernels”, “the wavefront classifier 310 may assign wavefronts of the workgroups 201.1.2 and 201.2.1 to the active subset 316, while all other wavefronts are assigned to the pending subset 314”, “assign wavefronts of the kernels 201.2 and 202.1 to the active subset 316, while all other wavefronts are assigned to the pending subset 314”, “a wavefront is classified based on its application identifier. For example, the wavefront classifier 310 may assign wavefronts of the application 201 of FIG. 2 to the active subset 316, while all other wavefronts are assigned to the pending subset 314” and “wavefronts of the same application, kernel, or workgroup may be grouped together for processing”, emphasis added. In the particular example of Fig. 2, Application 202 only launch one single kernel 202.1; however it is understood that there is a well-known example that Application 202 also launch more than one kernel as Application 201 from the particular example of Fig. 2. At such example, assigning wavefronts into active subset and pending subset based on same application identifier associated with the kernels/wavefronts would include each of active subset and pending subset comprises wavefronts from at least two kernels, wherein wavefronts selected for inclusion in one of the subsets are selected based on an identified criterion of the corresponding kernel. Also see [0039]-[0040] for the detail explanations on active subset 316 and pending subset 314 as scheduling groups);
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the grouping mechanism of wavefronts to form different scheduling groups from Lee by including grouping different wavefronts that are even from different kernels to form different scheduling groups from Hsu, since the wavefronts from different kernels may still have affinity to be executed as a group (see [0045] from Hsu).

The combination of Lee and Hsu does not disclose:
the identified criteria of an corresponding kernel for creating the plurality of scheduling groups is an identified priority of a corresponding kernel;
in response to a determination that the first scheduling group has one or more wavefronts ready for execution, the processor is further configured to:
schedule wavefronts for execution from the first scheduling group; and
prevent wavefronts of scheduling groups other than the first scheduling group from being scheduled for execution until the first scheduling group has completed.
However, Yudanov discloses: create a plurality of scheduling groups including one or more scheduling groups that each comprises, wherein threads selected for inclusion in a scheduling group of the one or more scheduling groups are selected based on an identified priority of a corresponding kernel (see [0033]; “the scheduler 230 are configured to select groups of threads for execution”, “threads may be allocated to a group if the threads are accessing the same portion of the main memory 215” and “threads may be coalesced to provide preferential access to applications or kernels that are given higher priority at runtime”. There is at least two groups of threads/wavefronts being coalesced, i.e., the group of the threads/wavefronts from different applications or kernels that are given higher priority at runtime and the group of the threads/wavefronts from different applications or kernels that are given lower priority at runtime).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the method of grouping different wavefronts from at least two kernels into different scheduling groups for executions from the combination of Lee and Hsu by including method of coalescing threads from different applications or kernels based on priority of the application or kernel from Yudanov, and thus the new combination would teach the limitation of wherein wavefronts selected for inclusion in a scheduling group of the one or more scheduling groups are selected based on an identified priority of a corresponding kernel, since scheduling a group of instructions/threads/jobs/tasks having same priority together is a well-known computing task scheduling mechanism (note: Lee also discusses prioritizing threads/instructions for execution, see lines 11-16 of right side of page 5 from Lee; however Lee does not explicitly discuss threads/wavefronts from different kernels having same priority as a group can also be scheduled together). 

The combination of Lee, Hsu and Yudanov does not disclose:
in response to a determination that the first scheduling group has one or more wavefronts ready for execution, the processor is further configured to:
schedule wavefronts for execution from the first scheduling group; and
prevent wavefronts of scheduling groups other than the first scheduling group from being scheduled for execution until the first scheduling group has completed.

However, Yoaz discloses: in response to a determination that the first scheduling group has one or more instructions ready for execution, the processor is further configured to: schedule instructions for execution from the first scheduling group; and prevent instructions of scheduling group other than the first scheduling group from being scheduled for execution until the first scheduling group has completed (see lines 47-59 of col.6 and lines 22-30 of col. 10; “searches scheduling window 135 for a ready to execute instruction using a modified scheduling algorithm. Instead of dispatching the oldest ready instruction, a parallel search is also made to check whether a ready instruction exists which has its priority field set. Under this algorithm, scheduler 145 sends to execution unit 150 the oldest instruction having its priority field set but if none of the instructions has its priority field set then the oldest instruction is sent to execution unit 150” and “scheduler 145 picks the next instruction in the scoreboard that should be executed and sends this instruction to execution unit 150. When searching the scoreboard for a ready to execute instruction, scheduler 145 uses the following algorithm: Selects the oldest instruction with its priority field set to “1”, but if none of the instructions have its priority field set to “1” then the oldest ready instruction is selected”. The system would determine whether the instructions having priority field set to 1, i.e., instructions from first scheduling group, are ready for execution, if there is at least one of the instructions having priority filed set to 1 is ready for execution, then the scheduler would keep to schedule such instructions until there is no such instruction is ready for execution, i.e., preventing other instructions having other priority value from scheduling until the first scheduling group has completed).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the scheduling algorithm from the combination of Lee, Hsu and Yudanov by including a scheduling algorithm of scheduling ready for execution instructions in the order of higher priority first from Yoaz, and thus the combination of Lee, Hsu, Yudanov and Yoaz would disclose the missing limitation from the combination of Lee, Hsu and Yudanov, since it would provide a mechanism of not only scheduling the instructions having higher priority first but also only scheduling the instructions which are ready for execution to executing the instruction as soon as possible due to readiness. 

Regarding to Claim 17, the rejection of Claim 15 is incorporated and further the combination of Lee, Hsu, Yudanov and Yoaz discloses: each compute unit is configured to group wavefronts within a same priority together into a same scheduling group (see Fig. 9, lines 11-16 of right side of page 5 from Lee, Fig. 1 and [0033] from Yudanov).

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (title: Improving GPGPU resource utilization through alternative thread block scheduling-recorded on IDS submitted 9/19/2019, hereafter Lee) in view of Hsu et al. (US 20160055005 A1, hereafter Hsu), Yudanov et al. (US 20160371082 A1, hereafter Yudanov) and Yoaz et la. (US 6697932 B1, hereafter Yoaz) and further in view of Kakadia et al. (US 20150208275 A1, hereafter Kakadia).
Hsu, Yudanov and Kakadia were cited on the previous office action.

Regarding to Claim 16, the rejection of Claim 15 is incorporated, the combination of Lee, Hsu, Yudanov and Yoaz does not disclose: wherein, in response to determining, prior to the first scheduling group beginning execution, that the first scheduling group does not have one or more wavefronts ready for execution, select for scheduling a second scheduling group from the plurality of scheduling groups.

However, Kakadia discloses: a scheduling tasks for executions comprises: wherein, in response to determining, prior to the first scheduling task beginning execution, that the first scheduling task is not ready for execution, select for scheduling a second scheduling task from plurality of tasks (see [0056]. Before the first task having high priority beginning execution, determine whether this first task is ready for execution or not, if such first task is not ready for execution, i.e., no high priority task ready for execution, then the scheduler would select a lower priority task for scheduling).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the scheduling algorithm from the combination of Lee, Hsu, Yudanov and Yoaz by including a scheduling algorithm of scheduling a lower priority task if there is no sufficient resources for executing higher priority tasks from Kakadia, and thus the combination of Lee, Hsu, Yudanov, Yoaz and Kakadia would disclose the missing limitation from the combination of Lee, Hsu, Yudanov and Yoaz (note: Kakadia may not disclose scheduling groups having multiple tasks/wavefonts; however, in views of the wavefronts of each scheduling group from the combination of Lee, Hsu, Yudanov and Yoaz have same priority level and both of the combination of Lee, Hsu, Yudanov, Yoaz and Kakadia are related to scheduling higher priority tasks first in general situation, it is still reasonable apply the scheduling mechanism at certain specifically situations, like there is no sufficient resources for executing higher priority task when scheduler tries to schedule higher priority task, from Kakadia into the combination of Lee, Hsu, Yudanov and Yoaz), since it would provide a mechanism of scheduling tasks for execution but there is not sufficient resources for executing the scheduled tasks (see [0056] from Kakadia).

Claims 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (title: Improving GPGPU resource utilization through alternative thread block scheduling-recorded on IDS submitted 9/19/2019, hereafter Lee) in view of Hsu et al. (US 20160055005 A1, hereafter Hsu), Yudanov et al. (US 20160371082 A1, hereafter Yudanov) and Yoaz et la. (US 6697932 B1, hereafter Yoaz) and further in view of Fontenot et al. (US 20110302372 A1, hereafter Fontenot) and Sollars (US 5900025 A).
Hsu, Yudanov, Fontenot and Sollars were cited on the previous office action.

Regarding to Claim 18, the rejection of Claim 15 is incorporated, the combination of Lee, Hsu, Yudanov and Yoaz discloses: each of the one or more scheduling groups comprises wavefronts from at least two kernels of the plurality of kernels (see Figs. 2-3 of Hsu and [0033] from Yudanov. The combination system groups threads or wavefronts to different scheduling groups based on the corresponding applications and kernels having same priority, and thus each of the scheduling groups based on such method would result all/every scheduling groups comprise threads/wavefronts from different kernels having same priority).

The combination of Lee, Hsu, Yudanov and Yoaz does not disclose:
monitor one or more conditions indicative of resource contention on the compute unit, the one or more conditions comprising at least one of compute unit stall cycles, cache miss rates, memory access latency, and link utilization;
generate a first measure of resource contention based on the one or more conditions being monitored; and
move a lowest priority scheduling group into a descheduled queue responsive to determining that the first measure of resource contention is greater than a first threshold, wherein:
wavefronts from scheduling groups stored in the descheduled queue are prevented from being scheduled for execution on the compute unit; and
one or more scheduling groups stored in the descheduled queue comprise wavefronts from at least two kernels of the plurality of kernels.

However, Fontenot discloses: a method comprising:
monitor one or more conditions indicative of resource contention on the compute unit, the one or more conditions comprising at least one of compute unit stall cycles, cache miss rates, memory access latency, and link utilization (see [0097]; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”. Also see “When the cache miss rate for L2 cache 624 and L3 cache 626 is low, core 610 likely is able to effectively utilize additional levels of parallelism resulting in an increase in overall computes” from [0087] and “should the cache misses as counted by counter 840 exceed the lower count value of count threshold 850” from [0099], and thus the counter 840 is a measurement of resource contention); 
generate a first measure of resource contention based on the one or more conditions being monitored (see [0097]; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”); and
reduce number of parallel execution threads responsive to determining that the first measure of resource contention is greater than a first threshold, wherein: threads from the reduced number of parallel execution threads are prevented from being scheduled for execution on the compute unit (see [0100]; “the cache misses as counted by counter 840 exceed the upper count value of count threshold 850, core 810 can remove layers of parallelism by switching from a higher SMT mode to a lower SMT mode” and “overall computes may be increased by reducing parallelism and the contention for a contested resource. In this case, the number of parallel threads could be reduced”).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the warp scheduler in the compute unit of GPU to schedule executions of groups of wavefronts having threads from the combination of Lee, Hsu, Yudanov and Yoaz by including a method of increasing or decreasing the number of parallel thread executions in responsive to different resource contention situations as taught by Fontenot, and thus the new combination system would teach some portions of the missing limitations mentioned above (note: at the new combination system, when the cache misses as counted by counter 840 for one GPU compute unit exceed the upper count value of count threshold 850, then the combination system would reduce number of parallel execution threads, i.e., reduce number of scheduling groups. Since such reducing is reducing the parallel execution, and thus the wavefronts from the reduced scheduling groups would be prevented from being scheduled for execution on the GPU compute unit. In addition, since every scheduling group comprises wavefronts from different kernels having same priority, and thus the wavefronts from the reduced scheduling group would comprises wavefronts from different kernels), since it would provide a method of efficiently utilizing resources of the system via increasing parallel thread executions number when resource contention is low and decreasing parallel thread executions number when resource contention is high (see [0099]-[0100] from Fontenot).   

The combination of Lee, Hsu, Yudanov, Yoaz and Fontenot does not disclose: reduce number of scheduling groups is move a lowest priority scheduling group into a descheduled queue.
However, Sollars discloses: a method of reducing allocated thread number comprising: move a lowest priority allocated thread into a descheudled queue (see lines 15-29 of col. 12; “deallocates and queues the lowest priority allocated context”. In addition, “If all context level control register sets 104 have been allocated” from lines 15-29 of col. 12 also indicates a resource contention is reached a threshold level).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the method of reducing the number of parallel thread executions when the resource contention is reached a threshold level as taught by the combination of Lee, Hsu, Yudanov, Yoaz and Fontenot by including a method of moving lowest priority allocated thread/context into a queue which holes deallocated thread/context when resource contention is reached a threshold level as taught by Sollars, thereby the new combination system would teach the missing limitations from the combination of Lee, Hsu, Yudanov and Yoaz, since it would provide a mechanism of only deallocate or descheudle lowest priority threads/contexts, and thus avoid to deallocate the threads/contexts having higher priority (see lines 15-29 of col. 12).

Regarding to Claim 19, the rejection of Claim 15 is incorporated and further the combination of Lee, Hsu, Yudanov, Yoaz, Fontenot and Sollars discloses: 
wait a given amount of time after moving the lowest priority scheduling group into the descheduled queue (see Fig. 13, [0097], [0145]-[0146] from Fontenot; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”. After performing one action of reducing parallel thread execution number, i.e., moving the lowest priority scheduling group into descheduled queue, at the combination system, the method would return step/action of monitoring the cache misses over a period of time to detect the next trigger point to perform step 1320 of Fig. 13);
generate a second measure of resource contention based on the one or more conditions being monitored (see Fig. 13, [0097], [0145]-[0146] from Fontenot and the analysis of previous limitation; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”); and
move a next lowest priority scheduling group into the descheduled queue responsive to determining the second measure of resource contention is greater than the first threshold (see lines 7-15 of 1st paragraph of 1. Introduction from Lee, [0100] from Fontenot and lines 16-29 of col. 12 from Sallars. At the combination system, when the cache misses as counted by counter 840 for one GPU compute unit exceed the upper count value of count threshold 850, then the combination system would reduce number of parallel execution threads, i.e., reduce number of scheduling groups; by moving the current lowest priority scheduling group, i.e., the previous next lowest priority scheduling group, into the descheduled queue, and thus the wavefronts from the scheduling groups stored in the descheudled queue are prevented from being scheduled for execution on the GPU compute unit).

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (title: Improving GPGPU resource utilization through alternative thread block scheduling-recorded on IDS submitted 9/19/2019, hereafter Lee) in view of Hsu et al. (US 20160055005 A1, hereafter Hsu), Yudanov et al. (US 20160371082 A1, hereafter Yudanov), Yoaz et la. (US 6697932 B1, hereafter Yoaz), Fontenot et al. (US 20110302372 A1, hereafter Fontenot) and Sollars (US 5900025 A) and further in view of Otenko (US 20150212794 A1).
Hsu, Yudanov, Fontenot, Sollars and Otenko were cited on the previous office action.

Regarding to Claim 20, the rejection of Claim 18 is incorporated and further the combination of Lee, Yudanov, Hsu, Yoaz, Fontenot and Sollars discloses:
wait a given amount of time after moving the lowest priority scheduling group into the descheduled queue (see Fig. 13, [0097], [0145]-[0146] from Fontenot; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”. After performing one action of reducing parallel thread execution number, i.e., moving the lowest priority scheduling group into descheduled queue, at the combination system, the method would return step/action of monitoring the cache misses over a period of time to detect the next trigger point to perform step 1320 of Fig. 13);
generate a second measure of resource contention based on the one or more conditions being monitored (see Fig. 13, [0097], [0145]-[0146] from Fontenot and the analysis of previous limitation; “Counter 840 is a counter that tracks a number of cache misses in cache 822 over a period of time”); and
increasing number of scheduling groups responsive to determining the second measure of resource contention is less than a second threshold (see lines 7-15 of 1st paragraph of 1. Introduction from Lee and [0099] from Fontenot; “should the cache misses as counted by counter 840 exceed the lower count value of count threshold 850, core 810 can add additional layers of parallelism by switching from a lower SMT mode to a higher SMT mode” and “overall computes may be increased by increasing parallelism and the contention for a contested resource. In this case, the number of parallel threads could be increased”. At the combination system, when the cache misses as counted by counter 840 for one GPU compute unit exceed lower count value of count threshold 850, then the combination system would increase number of parallel execution threads, i.e., increase number of scheduling groups).

The combination of Lee, Hsu, Yudanov, Yoaz, Fontenot and Sollars does not disclose: increasing number of scheduling groups is move a highest priority scheduling group out of the descheduled queue.
However, Otenko discloses: a method of increasing allocated tasks comprising: move a highest priority task out of the descheudled queue (see [0023]; “When the contention level is low in the system, the underlying priority queue 101 can sort the requests waiting in the priority queue 101 (with a logarithmic cost) in order to ensure that the request with the highest priority in the priority queue 101 can have the shortest waiting time. Thus, when the next consumer 120, or worker thread, is allowed to pick up a unit of work, it can pick up the unit with the highest priority”).
It would have been obvious to one with ordinary skill, in the art before the effective filing date of the claim invention, to modify the method of increasing the number of parallel thread executions when the resource contention is low as taught by the combination of Lee, Hsu, Yoaz, Fontenot and Sollars by including a method of scheduling the highest priority tasks in a wait queue having tasks to be scheduled as soon as possible as taught by Otenko, since it is understood to scheduling a highest priority task first instead of a lower priority task (see [0023] from Otenko. Also see lines 11-16 of right side of page 5 from Lee).
Thereby, the combination of Lee, Hsu, Yudanov, Yoaz, Fontenot, Sollars and Otenko discloses: move a highest priority scheduling group out of the descheduled queue responsive to determining the second measure of resource contention is less than a second threshold (see lines 7-15 of 1st paragraph of 1. Introduction from Lee, [0100] from Fontenot, lines 16-29 of col. 12 from Sollars and [0023] from Otenko. At the combination system, when the cache misses as counted by counter 840 for one GPU compute unit exceed lower count value of count threshold 850, then the combination system would increase number of parallel execution threads, i.e., increase number of scheduling groups, by moving the highest priority scheduling group out of the descehduled queue).

Response to Arguments
Applicant’s arguments, filled 6/21/2022, with respect to rejections of Claims 1-20 under 35 U.S.C. 103 have been full considered. New grounds of rejections were made based on the amended limitations from the independent claims. Applicant’s arguments for dependent Claims, 4, 11 and 18 are not moot due to the 112 (b) rejections explained above. Note: the main issue for Applicant’s arguing the prior art rejections of Claims 4, 11 and 18 is Applicant stated that “A lowest priority context/thread is a thread that is already executing is no longer in a scheduling queue” at last second paragraph of page 13 from the Remarks. However, as explained the specification objection and the 112 rejections above, it is not clear at Applicant’s invention that whether wavefront(s) from the scheduling group resided at the claimed scheduling queue is currently executing or not.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ZHI CHEN whose telephone number is (571)272-0805.  The examiner can normally be reached on Monday-Friday 9:30AM-5PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emerson Puente can be reached on (571)272-3652.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/Zhi Chen/
Patent Examiner, AU2196

/EMERSON C PUENTE/Supervisory Patent Examiner, Art Unit 2196