DETAILED ACTION
Claims 1-20 are pending in this application.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
The current claim amendment incudes “...prior to dispatching the individual wavefronts for execution...” on claims 1, 8 and 15.
Figures 5, 8, 9, 10 and 11 of the disclosure are the closest description of these claims. Although this claim limitation may be inherent, it described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor had possession of the claimed invention. In the next response to this office action Applicants are advised to direct the Examiner to paragraphs/line numbers that disclose the independent claims in general and specifically the current claim amendment.


Claims 1, 8 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al.

As to claim 1, Martin teaches a processor comprising: 
a plurality of compute units comprising circuitry configured to execute instructions (SEs 106/CU 202); and 
a dispatch unit comprising circuitry (WD 102) configured to dispatch workgroups to the plurality of compute units (SEs 106/CU 202), wherein each of one or more compute units is configured execute instructions of an entire workgroup (“...In an embodiment, WD 102 distributes the work to other components in a graphics pipeline for parallel processing. WD 102 receives patches from a driver that include instructions for rendering primitives on a display screen. The driver receives patches from a graphics application. Once the driver receives patches from the graphics application, it uses a communication interface, such as a communication bus, to transmit patches to a graphics pipeline that begins with WD 102. In an embodiment, WD 102 divides patches into multiple work groups that are processed in parallel using multiple SEs 106...In an embodiment, to transmit work groups to SEs 106, WD 102 passes work groups to IAs 104. In an embodiment, there may be multiple IAs 104 connected to WD 102. IAs 104 divide workgroups into primitive group (also referred to as "prim groups"). IA 104 then passes the prim groups to SEs 106. In an embodiment, each IA 104 is coupled to two SEs 106. IAs 104 may also retrieve data that is manipulated using instructions in the patches, and performs other functions that prepare patches for processing using SEs 106...In another embodiment, WD 102 may distribute prim groups directly to SEs 106. In this embodiment, the functionality of IA 104 may be included in WD 102 or in SE 106. In this case, WD 102 divides a draw call into multiple prim groups and passes a prim group to SE 106 for processing. This configuration allows WD 102 to scale the number of prim groups to the number of SEs 106 that are included in the graphics pipeline...[0027] In an embodiment, SEs 106 process prim groups. For example, SEs 106 use multiple compute units to manipulate the data in each prim group so that it is displayed as objects on a display screen...” paragraphs 0024-0027);
wherein the processor is configured to: 
divide a first workgroup into individual wavefronts (“...In an embodiment, VGTs 108 begin processing each thread group from a prim group that it receives from IA 104. VGTs 108 divide thread groups into wave fronts (also referred to as "waves"), where each wave front includes a number of threads that are processed in parallel. VGT 108 then launches the waves to other components in SEs 106, such as SPI 110 and compute units, as described in detail in FIG. 2. SPI 110 associates waves or waves with different shader programs. A person skilled in the art will appreciate that a shader program is written by an application developer, in, for example, OpenGL or D3D. The shader program provides instructions to a compute unit for processing waves on a per element basis. Example shader programs are a local shader, a hull shader, and a domain shader. A local shader manipulates a position, texture coordinates, and color of each vertex in a triangle. A hull shader computes color and attributes, such as light, shadows, specular highlights, translucency, etc., for each output control point of the patch. A domain shader manipulates the surface geometry of the objects that are comprised of multiple triangles on the display screen. SPI 110 is coupled to compute units that process the wave using the associated shader. Compute units include arithmetic logic units (ALU's) that manipulate waves based on instructions provided in the shader programs...In an embodiment, VGT 108 generates waves for each thread group and launches the waves to SPI 110. For example, VGT 108 generates an LS wave for each thread group. LS waves are components in a thread group that are processed by CU 202. SPI 100 associates the LS wave with LS 204 for processing on CU 202...” paragraphs 0036/0040).
Martin is silent with reference to prior to dispatching the individual wavefronts for execution, and responsive to determining that the individual wavefronts of the first workgroup do not fit within a single compute unit based on currently available resources of the plurality of compute units determine a process for dispatching the individual wavefronts of the first workgroup to separate compute units of the plurality of compute units based on reducing resource contention among the currently available resources of the plurality of compute units.  
Breternitz teaches prior to dispatching the individual wavefronts for execution, and responsive to determining that the individual wavefronts of the first workgroup do not fit within a single compute unit based on currently available resources of the plurality of compute units determine a process for dispatching the individual wavefronts of the first workgroup to separate compute units of the plurality of compute units based on reducing resource contention among the currently available resources of the plurality of compute units (block 1008/migration/reaching a given threshold) (“...In block 1008, a given tagged migration point is reached. In one embodiment, a measurement of the utilization of a currently used OpenCL device may be performed. If the measurement indicates the utilization or performance is below a given threshold, then the associated compute kernel or compute sub-kernel may be migrated to another OpenCL device, such as a heterogeneous core with a different micro-architecture. In one embodiment, this measurement is a count of a number of currently executing work units on a SIMD core that reached an exit or return within an associated compute kernel or compute sub-kernel. Alternatively, a count of a number of disabled computation units in a wavefront may provide the same number. If this count is above a given threshold, then the work units that have not yet reached an exit point may be migrated to another heterogeneous core, such as a general-purpose core. Then the wavefront on the SIMD core may be released and is available for other scheduled work units...In other embodiments, the above technique may be extended to initiate migrations at any situation in which it is determined that a large fraction of the parallel executing work units in a wavefront on a SIMD core are idle and the remaining work units are expected to continue substantial execution. For example, the generated If execution efficiency is not determined to be below a given threshold (conditional block 1010), then control flow of method 1000 returns to block 1006 and execution continues. If execution efficiency is determined to be below a given threshold (conditional block 1010), then in block 1012, one or more work units are identified to migrate to a second processor core with a micro-architecture different from a micro-architecture of the first processor core. The identified work units may have caused the above measurement to be below the given threshold. In block 1014, the associated local data produced by the first processor core is promoted to global data. In block 1016, the compiled versions of the migrated work units are scheduled to be executed on the second processor core beginning at the migration tagged point...” paragraphs 0076-0078).


As to claim 8, see the rejection of claim 1 and 15, expect for a processor and a memory.
Martin teaches a processor and a memory (The embodiments are also directed to computer program products comprising software stored on any computer-usable medium. Such software, when executed in one or more data processing devices, causes a data processing device(s) to operate as described herein or, as noted above, allows for the synthesis and/or manufacture of computing devices (e.g., ASICs, or processors) to perform embodiments described herein. Embodiments employ any computer-usable or -readable medium, known now or in the future. Examples of computer-usable mediums include, but are not limited to, primary storage devices (e.g., any type of random access memory), secondary storage devices (e.g., hard drives, floppy disks, CD ROMS, ZIP .

Claims 2, 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al. as applied to claims 1, 8 and 15 above, and further in view of U.S. Pub. No. 2018/0082470 A1 to Nijasure et al.

As to claim 2, Martin as modified by Breternitz teaches the processor as recited in claim 1, however it is silent with reference to wherein dividing the first workgroup into individual wavefronts for dispatch from the dispatch unit to separate compute units comprises: dispatching a first wavefront of the first workgroup to a first compute unit; and dispatching a second wavefront of the first workgroup to a second compute unit, wherein: the second wavefront is different from the first wavefront; the second compute unit is different from the first compute unit; and 2/19Application Serial No. 15/965,23 1 Filed April 27, 2018 

Nijasure teaches wherein dividing the first workgroup into individual wavefronts for dispatch from the dispatch unit to separate compute units comprises: dispatching a first wavefront of the first workgroup to a first compute unit; and dispatching a second wavefront of the first workgroup to a second compute unit, wherein: the second wavefront is different from the first wavefront; the second compute unit is different from the first compute unit; and 2/19Application Serial No. 15/965,23 1 Filed April 27, 2018 
at least one of the first compute unit and the second compute unit concurrently executes a wavefront of a second workgroup different from the first workgroup (“...The basic unit of execution in shader engines 132 is a work-item. Each work-item represents a single instantiation of a shader program that is to be executed in parallel in a particular lane. Work-items can be executed simultaneously as a "wavefront" on a single SIMD unit 138. Multiple wavefronts may be included in a "work group," which includes a collection of work-items designated to execute the same program. A work group can be executed by executing each of the wavefronts that make up the work group. The wavefronts may be executed sequentially on a single SIMD unit 138 or partially or fully in parallel on different SIMD units 138. Wavefronts can be thought of as instances of parallel execution of a shader program, where each wavefront includes multiple work-items that execute simultaneously on a single SIMD unit 138 in line with the SIMD paradigm (e.g., one instruction control unit executing the same stream of instructions with multiple data)...A scheduler 136 is configured to perform operations related to scheduling various wavefronts on different shader engines 132 and SIMD units 138. Wavefront bookkeeping 204 inside scheduler 136 stores data for pending wavefronts, which are wavefronts that have launched and are either executing or "asleep" (e.g., waiting to execute or not currently executing for some other reason). In addition to identifiers identifying pending wavefronts, wavefront bookkeeping 204 also stores indications of resources used by each wavefront, including registers such as vector registers 206 and/or scalar registers 208, portions of a local data store memory 212 assigned to a wavefront, portions of a memory 210 not local to any particular shader engine 132, or other resources assigned to the wavefront...” paragraphs 0022/0023).


As to claims 9 and 16, see rejection of claim 2 above.

Claims 3, 10 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al. as applied to claims 1, 8 and 15 above, and further in view of U.S. Pub. No. 2013/0215117 A1 to Glaisteret al.

As to claim 3, Martin as modified by Breternitz teaches the processor as recited in claim 1, however it is silent with reference to wherein the processor further comprises a scoreboard, and wherein the processor is further configured to: allocate an entry in the scoreboard to track wavefronts of the first workgroup; track, in the entry, a number of 
Glaisteret teaches wherein the processor further comprises a scoreboard, and wherein the processor is further configured to: allocate an entry in the scoreboard to track wavefronts of the first workgroup; track, in the entry, a number of wavefronts of the first workgroup which have reached a given barrier; and send a signal to two or more compute units to allow wavefronts of the first workgroup to proceed when the number of wavefronts of the first workgroup which have reached the given barrier is equal to a total number of wavefronts in the first workgroup (“...While the efficiency of scalar shader code is important, discussion herein relates to efficiently mapping onto CPUs (as opposed to GPUs) the parallelism found in compute shaders. Compute shaders may expose parallelism in different ways. For example, the Direct Compute.TM. Dispatch call defines a grid of thread blocks to expose parallelism on a coarse level, which is trivial to map onto CPU threads. Each thread block is an instance of a compute shader program that is executed by multiple shader threads (a shader is The threads of each thread block may be synchronized via barriers to enable accesses to shared memory without concern for data-race conditions arising. GPUs typically execute compute shaders via hardware thread-contexts, in groups of threads (warps or wave-fronts), and each context may legally execute the program until it encounters a barrier, at which point the context must wait for all other contexts to reach the same barrier. Hardware context switching in GPUs is fast and heavily pipelined. However, CPUs do not have such hardware support, which makes it difficult to efficiently execute compute shaders on CPUs...Notice that any barrier must execute in uniform control flow (UCF) (all threads execute the statement). In other words, all threads of a thread block must reach the barrier in a correct program. Therefore, "if(c1)" in the example above must be a uniform transfer, and it is sufficient to check only one instance, e.g., c1 instance of thread 0--c1[0]...” paragraphs 0003/0018). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin 

As to claims 10 and 17, see the rejection of claim 3 above.

Claims 4, 11 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al. and further in view of U.S. Pub. No. 2013/0215117 A1 to Glaisteret al. as applied to claims 3 and 17 above, and further in view of U.S. Pub. No. 2014/0181467 A1 to Roger et al.

As to claim 4, Martin as modified by Breternitz teaches the processor as recited in claim 3, however it is silent with reference to wherein the two or more compute units are identified by a compute unit mask field in the entry.  
Rogers teaches wherein the two or more compute units are identified by a compute unit mask field in the entry (“...The execution mask of the SIMD array 121 is overridden at block 216. For example, software code may be generated to override the execution mask. Overriding the execution mask enables certain lanes 123 of the SIMD array 121. For example, an instruction may be included to set or clear a bit of the execution mask that indicates whether the lane associated with the bit will execute the current instruction. When the override portion of the code has completed, the execution mask may revert back to the status of the execution mask when the override portion was entered. Accordingly, a programmer may effectively take control of all of the execution resources of the machine when the programmer knows that the parallel nature of the hardware would improve execution of the software....” paragraph 0052).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin, Breternitz and Glaisteret with the teaching of Roger because the teaching of Roger would improve the system of Martin, Breternitz and Glaisteret by providing a technique of to set or clear a bit of the execution mask that indicates whether the lane associated with the bit will execute the current instruction (Roger paragraph 0052).

As to claims 11 and 18, see the rejection of claim 4 above.
Claims 5, 6, 12, 13, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al. as applied to claims 1, 8 and 15 above, and further in view of U.S. Pat. No. 9,189,282 B2 issued to Conte et al. 

As to claim 5, Martin as modified by Breternitz teaches the processor as recited in claim 1, however it is silent with reference to monitor a plurality of performance counters to track resource contention among the plurality of compute units; calculate a load-rating for each compute unit and each resource based on the plurality of performance counters; and determine how to allocate wavefronts of the first workgroup to the plurality of compute units based on calculated load-ratings.  
Conte teaches monitor a plurality of performance counters to track resource contention among the plurality of compute units; calculate a load-rating for each compute unit and each resource based on the plurality of performance counters; and determine how to allocate wavefronts of the first workgroup to the plurality of compute units based on calculated load-ratings (“...FIG. 6a is a schematic illustration of a system for performing methods for multi-core thread mapping in accordance with the present disclosure. As shown in FIG. 6a, a computer system 600 may include a processor 605 configured for performing an example of a method for mapping threads to execution to processor cores. In other examples, various operations or portions of various operations of the method may be performed outside of the processor 605. In operation 602, the method may include executing at least one software application program resulting in at least one thread of execution. In operation 603, the method may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the method may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...FIG. 6b is a schematic illustration of a computer accessible medium having stored thereon computer executable instructions for performing a procedure for mapping threads of execution to processor cores in a multi-core processing system. As shown in FIG. 6b, a computer accessible medium 600 may have stored thereon computer accessible instructions 605 configured for performing an example procedure for mapping threads to execution to processor cores. In operation 602, the procedure may include executing at least one software application program resulting in at In operation 603, the procedure may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the procedure may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...” Col. 5 Ln. 57 – 67, Col. 6 Ln. 1 – 21).
 It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin and Breternitz with the teaching of Conte because the teaching of Conte would improve the system of Martin and Breternitz by providing a technique of mapping threads to execution to processor cores.

As to claims 12 and 19, see the rejection of claim 5 above.

As to claim 6, Martin as modified by Breternitz teaches the processor as recited in claim 5, however it is silent with reference to wherein the processor is further configured to select a first compute unit as a candidate for dispatch responsive to determining the first compute unit has a lowest load-rating among the plurality of compute units for a first resource.
In operation 603, the method may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the method may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...FIG. 6b is a schematic illustration of a computer accessible medium having stored thereon computer executable instructions for performing a procedure for mapping threads of execution to processor cores in a multi-core processing system. As shown in FIG. 6b, a computer In operation 603, the procedure may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the procedure may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...” Col. 5 Ln. 57 – 67, Col. 6 Ln. 1 – 21).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin and Breternitz with the teaching of Conte because the teaching of Conte would improve the system of Martin and Breternitz by providing a technique of mapping threads to execution to processor cores.

As to claims 13 and 20, see the rejection of claim 6 above.


Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al. and further in view of U.S. Pat. No. 9,189,282 B2 issued to Conte et al. as applied to claim 5 above, and further in view of U.S. Pub. No. 2017/0031719 A1 to Clark et al.

As to claim 7, Martin as modified by Breternitz and Conte teaches the processor as recited in claim 5, however it is silent with reference to wherein the plurality of performance counters track two or more of vector arithmetic logic unit (VALU) execution bandwidth, scalar ALU (SALU) execution bandwidth, local data share (LDS) bandwidth, Load Store Bus bandwidth, Vector Register File (VRF) bandwidth, Scalar Register File (SRF) bandwidth, cache subsystem capacity, cache bandwidth, and translation lookaside buffer (TLB) bandwidth.  
Clark teaches wherein the plurality of performance counters track two or more of vector arithmetic logic unit (VALU) execution bandwidth, scalar ALU (SALU) execution bandwidth, local data share (LDS) bandwidth, Load Store Bus bandwidth, Vector Register File (VRF) bandwidth, Scalar Register File (SRF) bandwidth, cache subsystem capacity, cache bandwidth, and translation lookaside buffer (TLB) bandwidth (Cache Bandwidth 450D, Cache Capacity 450E) (“...Performance counter values 450 may include a plurality of values associated with a given guest VM or a given vCPU of a guest VM, depending on the embodiment. A guest VM may include a plurality of performance counter values 450, and the guest VM may include a different set of performance counter values for a plurality of vCPUs in use by the guest VM. In various embodiments, the performance counter values 450 may include one or more of CPU time 450A, instructions retired 450B, floating point operations (FLOPs) 450C, cache bandwidth 450D, cache capacity 450E, memory bandwidth 450F, memory capacity 450G, I/O bandwidth 450H, and/or fixed function IP usage 450J...” paragraph 0052).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin and Breternitz with the teaching of Clark because the teaching of Clark would improve the system of Martin and Breternitz by providing a set of special-purpose registers built into modern microprocessors to store the counts of hardware-related activities within computer systems.

As to claim 14, see the rejection of claim 7 above.

Response to Arguments
Applicant's arguments filed 01/25/22 have been fully considered but they are not persuasive. 
Applicants argued in substance that the Breternitz prior art does not teach “...prior to dispatching the individual wavefronts for execution...”.
The Breternitz prior art discloses system and method for automatically migrating the execution of work units between multiple heterogeneous cores. The multiple heterogeneous cores includes a first processor core with a single instruction multiple data micro-architecture and a second processor core with a general-purpose micro-architecture. It includes an OS scheduler for scheduling work units/work wavefronts on the multiple heterogeneous cores. In response to receiving an indication that a condition for migration is satisfied, the OS scheduler moves the live values to a location indicated by the data structure for access by the second processor core and schedules code after the given location to the second processor core. When a migration point or condition is reaches a work-item execution transfers to a different heterogeneous core of the multiple heterogeneous cores. 
.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
U.S. Pub. No. 2019/0034151 A1 to Dutu et al. and directed to scheduling and executing workgroups on a plurality of compute units.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHARLES E ANYA whose telephone number is (571)272-3757.  The examiner can normally be reached on Mon-Fir. 9-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/CHARLES E ANYA/Primary Examiner, Art Unit 2194