DETAILED ACTION
Claims 1-20 are pending in this application.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 8 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al.

As to claim 1, Martin teaches a processor comprising: 
a plurality of compute units comprising circuitry configured to execute instructions (SEs 106/CU 202); and 
a dispatch unit comprising circuitry (WD 102) configured to dispatch workgroups to the plurality of compute units (SEs 106/CU 202), wherein each of one or more compute units is configured execute instructions of an entire workgroup (“...In an embodiment, WD 102 distributes the work to other components in a graphics pipeline for parallel processing. WD 102 receives patches from a driver that include instructions for rendering primitives on a display screen. The driver receives patches from a graphics application. Once the driver receives patches from the graphics application, it uses a communication interface, such as a communication bus, to transmit patches to a graphics pipeline that begins with WD 102. In an embodiment, WD 102 divides patches into multiple work groups that are processed in parallel using multiple SEs 106...In an embodiment, to transmit work groups to SEs 106, WD 102 passes work groups to IAs 104. In an embodiment, there may be multiple IAs 104 connected to WD 102. IAs 104 divide workgroups into primitive group (also referred to as "prim groups"). IA 104 then passes the prim groups to SEs 106. In an embodiment, each IA 104 is coupled to two SEs 106. IAs 104 may also retrieve data that is manipulated using instructions in the patches, and performs other functions that prepare patches for processing using SEs 106...In another embodiment, WD 102 may distribute prim groups directly to SEs 106. In this embodiment, the functionality of IA 104 may be included in WD 102 or in SE 106. In this case, WD 102 divides a draw call into multiple prim groups and passes a prim group to SE 106 for processing. This configuration allows WD 102 to scale the number of prim groups to the number of SEs 106 that are included in the graphics pipeline...[0027] In an embodiment, SEs 106 process prim groups. For example, SEs 106 use multiple compute units to manipulate the data in each prim group so that it is displayed as objects on a display screen...” paragraphs 0024-0027);
wherein the processor is configured to: 
divide a first workgroup into individual wavefronts (“...In an embodiment, VGTs 108 begin processing each thread group from a prim group that it receives from IA 104. VGTs 108 divide thread groups into wave fronts (also referred to as "waves"), where each wave front includes a number of threads that are processed in parallel. VGT 108 then launches the waves to other components in SEs 106, such as SPI 110 and compute units, as described in detail in FIG. 2. SPI 110 associates waves or waves with different shader programs. A person skilled in the art will appreciate that a shader In an embodiment, VGT 108 generates waves for each thread group and launches the waves to SPI 110. For example, VGT 108 generates an LS wave for each thread group. LS waves are components in a thread group that are processed by CU 202. SPI 100 associates the LS wave with LS 204 for processing on CU 202...” paragraphs 0036/0040).
Martin is silent with reference to responsive to determining that the individual wavefronts of the first workgroup do not fit within a single compute unit based on currently available resources of the plurality of compute units determine a process for dispatching the individual wavefronts of the first workgroup to separate compute units of the plurality of compute units based on reducing resource contention among the currently available resources of the plurality of compute units.  
Breternitz teaches responsive to determining that the individual wavefronts of the first workgroup do not fit within a single compute unit based on currently available resources of the plurality of compute units determine a process for dispatching the individual wavefronts of the first workgroup to separate compute units of the plurality of compute units based on reducing resource contention among the currently available resources of the plurality of compute units (block 1008/migration) (“...In block 1008, a given tagged migration point is reached. In one embodiment, a measurement of the utilization of a currently used OpenCL device may be performed. If the measurement indicates the utilization or performance is below a given threshold, then the associated compute kernel or compute sub-kernel may be migrated to another OpenCL device, such as a heterogeneous core with a different micro-architecture. In one embodiment, this measurement is a count of a number of currently executing work units on a SIMD core that reached an exit or return within an associated compute kernel or compute sub-kernel. Alternatively, a count of a number of disabled computation units in a wavefront may provide the same number. If this count is above a given threshold, then the work units that have not yet reached an exit point may be migrated to another heterogeneous core, such as a general-purpose core. Then the wavefront on the SIMD core may be released and is available for other scheduled work units...In other embodiments, the above technique may be extended to initiate migrations at any situation in which it is determined that a large fraction of the parallel executing work units in a wavefront on a SIMD core are idle and the remaining work units are expected to continue substantial execution. For example, the generated data structures may be in shared memory and in one or more caches. In a system with virtual memory support, a subset of the work units may hit the cache whereas the remaining work units experience virtual memory misses, which are long latency events. In this case, overall computing performance may be better with continued execution on a general-purpose core since further execution may benefit from prefetching techniques enabled by the current execution...If execution efficiency is not determined to be below a given threshold (conditional block 1010), then control flow of method 1000 returns to block 1006 and execution continues. If execution efficiency is determined to be below a given threshold (conditional block 1010), then in block 1012, one or more work units are identified to migrate to a second processor core with a micro-architecture different from a micro-architecture of the first processor core. The identified work units may have caused the above measurement to be below the given threshold. In block 1014, the associated local data produced by the first processor core is promoted to global data. In block 1016, the compiled versions of the migrated work units are scheduled to be executed on the second processor core beginning at the migration tagged point...” paragraphs 0076-0078).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin with the teaching of Breternitz because the teaching of Breternitz would improve the system of Martin by providing a technique (load balancing) of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient.


Martin teaches a processor and a memory (The embodiments are also directed to computer program products comprising software stored on any computer-usable medium. Such software, when executed in one or more data processing devices, causes a data processing device(s) to operate as described herein or, as noted above, allows for the synthesis and/or manufacture of computing devices (e.g., ASICs, or processors) to perform embodiments described herein. Embodiments employ any computer-usable or -readable medium, known now or in the future. Examples of computer-usable mediums include, but are not limited to, primary storage devices (e.g., any type of random access memory), secondary storage devices (e.g., hard drives, floppy disks, CD ROMS, ZIP disks, tapes, magnetic storage devices, optical storage devices, MEMS, nano-technological storage devices, etc.), and communication mediums (e.g., wired and wireless communications networks, .

Claims 2, 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al. as applied to claims 1, 8 and 15 above, and further in view of U.S. Pub. No. 2018/0082470 A1 to Nijasure et al.

As to claim 2, Martin as modified by Breteritz teaches the processor as recited in claim 1, however it is silent with reference to wherein dividing the first workgroup into individual wavefronts for dispatch from the dispatch unit to separate compute units comprises: dispatching a first wavefront of the first workgroup to a first compute unit; and dispatching a second wavefront of the first workgroup to a second compute unit, wherein: the second 
at least one of the first compute unit and the second compute unit concurrently executes a wavefront of a second workgroup different from the first workgroup.  
Nijasure teaches wherein dividing the first workgroup into individual wavefronts for dispatch from the dispatch unit to separate compute units comprises: dispatching a first wavefront of the first workgroup to a first compute unit; and dispatching a second wavefront of the first workgroup to a second compute unit, wherein: the second wavefront is different from the first wavefront; the second compute unit is different from the first compute unit; and 2/19Application Serial No. 15/965,23 1 Filed April 27, 2018 
at least one of the first compute unit and the second compute unit concurrently executes a wavefront of a second workgroup different from the first workgroup (“...The basic unit of execution in shader engines 132 is a work-item. Each work-item represents a single instantiation of a shader program that is to be Multiple wavefronts may be included in a "work group," which includes a collection of work-items designated to execute the same program. A work group can be executed by executing each of the wavefronts that make up the work group. The wavefronts may be executed sequentially on a single SIMD unit 138 or partially or fully in parallel on different SIMD units 138. Wavefronts can be thought of as instances of parallel execution of a shader program, where each wavefront includes multiple work-items that execute simultaneously on a single SIMD unit 138 in line with the SIMD paradigm (e.g., one instruction control unit executing the same stream of instructions with multiple data)...A scheduler 136 is configured to perform operations related to scheduling various wavefronts on different shader engines 132 and SIMD units 138. Wavefront bookkeeping 204 inside scheduler 136 stores data for pending wavefronts, which are wavefronts that have launched and are either executing or "asleep" (e.g., waiting to execute or not currently executing for some other reason). In addition to identifiers identifying pending wavefronts, wavefront bookkeeping 204 also stores indications of resources used by each wavefront, including registers such as vector registers 206 and/or scalar registers 208, portions of a local data store memory 212 assigned to a wavefront, portions of a memory 210 not local to any particular shader engine 132, or other resources assigned to the wavefront...” paragraphs 0022/0023).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin and Breternitz with the teaching of Nijasure because the teaching of Nijasure would improve the system of Martin and Breternitz by providing a technique of simultaneously executing task on computing units to allow for optimal use of computing resources.

.

Claims 3, 10 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al. as applied to claims 1, 8 and 15 above, and further in view of U.S. Pub. No. 2013/0215117 A1 to Glaisteret al.

As to claim 3, Martin as modified by Breternitz teaches the processor as recited in claim 1, however it is silent with reference to wherein the processor further comprises a scoreboard, and wherein the processor is further configured to: allocate an entry in the scoreboard to track wavefronts of the first workgroup; track, in the entry, a number of wavefronts of the first workgroup which have reached a given barrier; and send a signal to two or more compute units to allow wavefronts of the first workgroup to proceed when the number of wavefronts of the first workgroup 
Glaisteret teaches wherein the processor further comprises a scoreboard, and wherein the processor is further configured to: allocate an entry in the scoreboard to track wavefronts of the first workgroup; track, in the entry, a number of wavefronts of the first workgroup which have reached a given barrier; and send a signal to two or more compute units to allow wavefronts of the first workgroup to proceed when the number of wavefronts of the first workgroup which have reached the given barrier is equal to a total number of wavefronts in the first workgroup (“...While the efficiency of scalar shader code is important, discussion herein relates to efficiently mapping onto CPUs (as opposed to GPUs) the parallelism found in compute shaders. Compute shaders may expose parallelism in different ways. For example, the Direct Compute.TM. Dispatch call defines a grid of thread blocks to expose parallelism on a coarse level, which is trivial to map onto CPU threads. Each thread block is an instance of a compute The threads of each thread block may be synchronized via barriers to enable accesses to shared memory without concern for data-race conditions arising. GPUs typically execute compute shaders via hardware thread-contexts, in groups of threads (warps or wave-fronts), and each context may legally execute the program until it encounters a barrier, at which point the context must wait for all other contexts to reach the same barrier. Hardware context switching in GPUs is fast and heavily pipelined. However, CPUs do not have such hardware support, which makes it difficult to efficiently execute compute shaders on CPUs...Notice that any barrier must execute in uniform control flow (UCF) (all threads execute the statement). In other words, all threads of a thread block must reach the barrier in a correct program. Therefore, "if(c1)" in the example above must be a uniform transfer, and it is sufficient to check only one instance, e.g., c1 instance of thread 0--c1[0]...” paragraphs 0003/0018). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin and Breternitz with the teaching of Glaisteret because the teaching of Glaisteret would improve the system of Martin and Breternitz by providing a technique of executing tasks that complete seamlessly.

As to claims 10 and 17, see the rejection of claim 3 above.

Claims 4, 11 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al. and further in view of U.S. Pub. No. 2013/0215117 A1 to Glaisteret al. as applied to claims 3 and 17 above, and further in view of U.S. Pub. No. 2014/0181467 A1 to Roger et al.

As to claim 4, Martin as modified by Breternitz teaches the processor as recited in claim 3, however it is silent with reference to wherein the two or more compute units are identified by a compute unit mask field in the entry.  
Rogers teaches wherein the two or more compute units are identified by a compute unit mask field in the entry (“...The execution mask of the SIMD array 121 is overridden at block 216. For example, software code may be generated to override the execution mask. Overriding the execution mask enables certain lanes 123 of the SIMD array 121. For example, an instruction may be included to set or clear a bit of the execution mask that indicates whether the lane associated with the bit will execute the current instruction. When the override portion of the code has completed, the execution mask may revert back to the status of the execution mask when the override portion was entered. Accordingly, a programmer may effectively take control of all of the execution resources of the machine when the programmer knows that the parallel nature of the hardware would improve execution of the software....” paragraph 0052).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin, Breternitz and Glaisteret with the teaching of Roger because the teaching of Roger would improve the system of Martin, Breternitz and Glaisteret by providing a technique of to set or clear a bit of the execution mask that indicates whether the lane associated with the bit will execute the current instruction (Roger paragraph 0052).

As to claims 11 and 18, see the rejection of claim 4 above.

Claims 5, 6, 12, 13, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al. as applied to claims 1, 8 and 15 above, and further in view of U.S. Pat. No. 9,189,282 B2 issued to Conte et al. 

As to claim 5, Martin as modified by Bretenitz teaches the processor as recited in claim 1, however it is silent with reference to monitor a plurality of performance counters to track resource contention among the plurality of compute units; calculate a load-rating for each compute unit and each resource based on the plurality of performance counters; and determine how to allocate wavefronts of the first workgroup to the plurality of compute units based on calculated load-ratings.  
Conte teaches monitor a plurality of performance counters to track resource contention among the plurality of compute units; calculate a load-rating for each compute unit and each resource based on the plurality of performance counters; and determine how to allocate wavefronts of the first workgroup to the plurality of FIG. 6a is a schematic illustration of a system for performing methods for multi-core thread mapping in accordance with the present disclosure. As shown in FIG. 6a, a computer system 600 may include a processor 605 configured for performing an example of a method for mapping threads to execution to processor cores. In other examples, various operations or portions of various operations of the method may be performed outside of the processor 605. In operation 602, the method may include executing at least one software application program resulting in at least one thread of execution. In operation 603, the method may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the method may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...FIG. 6b is a schematic illustration of a computer accessible medium having In operation 603, the procedure may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the procedure may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...” Col. 5 Ln. 57 – 67, Col. 6 Ln. 1 – 21).
 It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin and Breternitz with the teaching of Conte 

As to claims 12 and 19, see the rejection of claim 5 above.

As to claim 6, Martin as modified by Bretenitz teaches the processor as recited in claim 5, however it is silent with reference to wherein the processor is further configured to select a first compute unit as a candidate for dispatch responsive to determining the first compute unit has a lowest load-rating among the plurality of compute units for a first resource.
Conte teaches wherein the processor is further configured to select a first compute unit as a candidate for dispatch responsive to determining the first compute unit has a lowest load-rating among the plurality of compute units for a first resource (“...FIG. 6a is a schematic illustration of a system for performing methods for multi-core thread mapping in accordance with the present In operation 603, the method may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the method may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...FIG. 6b is a schematic illustration of a computer accessible medium having stored thereon computer executable instructions for performing a procedure for mapping threads of execution to processor cores in a multi-core processing system. As shown in FIG. 6b, a computer accessible medium 600 In operation 603, the procedure may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the procedure may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...” Col. 5 Ln. 57 – 67, Col. 6 Ln. 1 – 21).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin and Breternitz with the teaching of Conte because the teaching of Conte would improve the system of Martin and Breternitz by providing a technique of mapping threads to execution to processor cores.
.


Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2012/0297163 A1 to Breternitz et al. and further in view of U.S. Pat. No. 9,189,282 B2 issued to Conte et al. as applied to claim 5 above, and further in view of U.S. Pub. No. 2017/0031719 A1 to Clark et al.

As to claim 7, Martin as modified by Bretenitz and Conte teaches the processor as recited in claim 5, however it is silent with reference to wherein the plurality of performance counters track two or more of vector arithmetic logic unit (VALU) execution bandwidth, scalar ALU (SALU) execution bandwidth, local data share (LDS) bandwidth, Load Store Bus bandwidth, Vector Register File (VRF) bandwidth, Scalar Register File (SRF) 
Clark teaches wherein the plurality of performance counters track two or more of vector arithmetic logic unit (VALU) execution bandwidth, scalar ALU (SALU) execution bandwidth, local data share (LDS) bandwidth, Load Store Bus bandwidth, Vector Register File (VRF) bandwidth, Scalar Register File (SRF) bandwidth, cache subsystem capacity, cache bandwidth, and translation lookaside buffer (TLB) bandwidth (Cache Bandwidth 450D, Cache Capacity 450E) (“...Performance counter values 450 may include a plurality of values associated with a given guest VM or a given vCPU of a guest VM, depending on the embodiment. A guest VM may include a plurality of performance counter values 450, and the guest VM may include a different set of performance counter values for a plurality of vCPUs in use by the guest VM. In various embodiments, the performance counter values 450 may include one or more of CPU time 450A, instructions retired 450B, floating point operations (FLOPs) 450C, 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin and Breternitz with the teaching of Clark because the teaching of Clark would improve the system of Martin and Breternitz by providing a set of special-purpose registers built into modern microprocessors to store the counts of hardware-related activities within computer systems.

As to claim 14, see the rejection of claim 7 above.

Response to Arguments
Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of 
	NOTE: The Examiner additionally proposed claim amendment that would put the claims in this application in condition for allowance but Applicants declined the suggestion.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHARLES E ANYA whose telephone number is (571)272-3757.  The examiner can normally be reached on Mon-Fir. 9-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Dennis Chow can be reached on 571-272-7767.  The fax phone number for the 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.