DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12/21/2020 has been entered.
This communication is responsive to Amendments and Remarks filed 12/21/2020.
Claims 1-20 are pending in this application.  Claims 1, 8, and 15 are independent claims.  In the Amendment, no claims were cancelled and no claims were added.
Examiner Notes
The Examiner cites particular columns and line numbers in the references as applied to the claims below for the convenience of the Applicant(s).  Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well.  It is respectfully requested that, in preparing responses, the Applicant fully consider the references in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the Examiner.
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

Summary of Rejections
The following rejections of the claims are set forth below in this Office Action:
Ground 1:  Claims 1-2, 8-9, and 15-16 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Oro Garcia et al. (US 2013/0243329; hereinafter Garcia).
Ground 2:  Claims 3-4, 10-11, and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Garcia in view of Abdallah (US 2014/0181475).
Ground 3:  Claims 5-7, 12-14, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Garcia in view of Johnson (US 2018/0217836).

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-2, 8-9, and 15-16 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Garcia.

Claim 1

A processor comprising: 

a plurality of compute units comprising circuitry configured to execute instructions; and 

a dispatch unit comprising circuitry configured to dispatch workgroups to the plurality of compute units, wherein one or more compute units are configured execute instructions of an entire workgroup; 

Garcia teaches “the parallel processing unit (PPU) 102… The PPU 102 is a massively parallel engine that is specifically conceived for performing computations of workloads that benefit from data level parallelism (DLP). These workloads may range from graphics computations such as geometric transformations (usually performed in shaders) to fluid dynamic simulations or computational finance. The PPU 102 could be implemented in hardware as a programmable accelerator that features stream computing capabilities or a graphics processing unit (GPU)” (e.g., “A processor” as claimed) (see para. 24), and “the 

Garcia also teaches “All thread instructions that constitute processing tasks are scheduled to the PPU cores 203 via the work distribution unit 202” (e.g., “a dispatch unit comprising circuitry configured to dispatch workgroups to the plurality of compute units, wherein one or more compute units are configured execute instructions of an entire workgroup” as claimed) (see para. 26), and “Each PPU core 203 implements a computation engine by combining P functional units 303 which are specifically designed for executing data parallel SIMD instructions” (see para. 27).



divide a workgroup into individual wavefronts for dispatch from the dispatch unit to separate compute units, responsive to determining that the workgroup does not fit within a single compute unit based on currently available resources of the plurality of compute units; and 

determine a process for dispatching individual wavefronts of the workgroup to the plurality of compute units based on reducing resource contention among the plurality of compute units.

Garcia teaches “proper balance of thread computations among the plurality of cores and SIMD processing units available” (see para. 6; and see also paras. 26, 31).

Garcia also teaches “When kernels are serially executed 1101, PPU resources are underutilized. Therefore, idle PPU cores 203 could be executing instructions from other kernels instead of ineffectively consuming computing cycles. In order to address this issue, the PPU 102 can issue and execute instructions from different kernels concurrently 1102. As FIG. 11 shows, concurrent execution 1102 may reduce the execution time in situations where computations are unbalanced. These kernel functions are coded using language constructs that explicitly express data-level parallelism by using a set of threads 903 or work-item groups” (see para. 30; and see also paras. 29, 31-35, 38), and “a kernel function that relies on a divide and conquer approach where each input image is split into equally-sized image chunks of w.times.h elements. Additionally, for each image chunk a block of w.times.h threads 903 is created. These image chunks, which respectively 

Garcia also teaches “the executed code will achieve a higher throughput derived from the reduced latency and increased bandwidth of the underlying memory subsystem. Each PPU core 203 fetches thread instructions from its instruction unit 301, the contents of which are managed by the work distribution unit 202” (e.g., “determine a process for dispatching individual wavefronts of the workgroup to the plurality of compute units based on reducing resource contention among the plurality of compute units” as claimed) (see para. 27).



Claim 2

The processor as recited in claim 1, wherein dividing the workgroup into individual wavefronts for dispatch from the dispatch unit to separate compute units comprises: 

dispatching a first wavefront of the workgroup to a first compute unit; and 

dispatching a second wavefront of the workgroup to a second compute unit, wherein the second wavefront is different from the first wavefront, and wherein the second compute unit is different from the first compute unit.

Garcia teaches “proper balance of thread computations among the plurality of cores and SIMD processing units available” (see para. 6; and see also paras. 26, 31).

Garcia also teaches “When kernels are serially executed 1101, PPU resources are underutilized. Therefore, idle PPU cores 203 could be executing instructions from other kernels instead of ineffectively consuming computing cycles. In order to address this issue, the PPU 102 can issue and execute instructions from different kernels concurrently 1102. As FIG. 11 shows, concurrent execution 1102 may reduce the execution time in situations where computations are unbalanced. 

Garcia also teaches “the executed code will achieve a higher throughput derived from the reduced latency and increased bandwidth of the underlying memory subsystem. Each PPU core 203 fetches thread instructions from its instruction unit 301, the contents of which are managed by the work distribution unit 202” (see para. 27).



Regarding claims 8-9, they are method claims having similar limitations as cited in claims 1-2.  Thus, claims 8-9 are also rejected under the same rationale as cited in the rejection of rejected claim 1-2.

.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 3-4, 10-11, and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Garcia in view of Abdallah.

Claim 3

The processor as recited in claim 1, wherein the processor further comprises a scoreboard, and wherein the processor is further configured to: 



track, in the entry, a number of wavefronts which have reached a given barrier; and 

send a signal to two or more compute units to allow wavefronts to proceed when the number of wavefronts which have reached the given barrier is equal to a total number of wavefronts in the workgroup.



Garcia does not explicitly teach “a scoreboard” and “allocate an entry in the scoreboard”, “track, in the entry, a number”.

However, in an analogous art, Abdallah teaches “the concepts of a virtual register file and a register cache are introduced as possible implementation components. The virtual register file or register cache also provide support for virtually larger number of threads or contexts in the hardware than possible using traditional hardware thread support. A multi hierarchal register file support provides larger bandwidth to the register file” (see para. 29; and see also paras. 31, 33-35), and “passing the location of the register as part of the score board mechanism from the producer instruction of the register to its consumer instructions. Consumer instructions need to read the register so that they know which copy of that architecture register they need to access, but in the location/position based schemes, no tags are needed because the register is accessed by its register number and the location where the particular register copy is physically located among the multiple copies of that register in the multi-hierarchy register file” (see para. 35). 

Abdallah also teaches “resolving false dependency on the same-name registers during the dynamic execution of those threads. This 

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Garcia and Abdallah with predictable results to achieve the claimed invention because one of ordinary skill in the art would have been prompted to implement the teachings of Garcia such that passing the location of the register as part of the score board mechanism from the producer instruction of the register to its consumer instructions allows for synchronization of threads to be resolved through register/bucket cross referencing and control flags as suggested by Abdallah (see Abdallah paras. 35, 125).



Claim 4

The processor as recited in claim 3, wherein the two or more compute units are identified by a compute unit mask field in the entry.

Garcia teaches “memory consistency among all threads 903 within a block is ensured by executing a barrier instruction in each PPU core 203. Then the scaling computations of each image subregion take place by performing the selected smoothing filtering method with fine-grain parallelism in each thread block 902. In this process, each thread 903 within the block computes the new smoothed value of its 

Garcia does not explicitly teach “the two or more compute units are identified by a compute unit mask field in the entry”.

However, in an analogous art, Abdallah teaches “a coherency scheme for the memory architecture among those engines /cores/processors. This scheme starts by an address request from one of the address calculation units in one segment /core/processor. For example, assume the address is requested by segment 1 (1211). It can obtain and calculate its address using address registers that belong to its own segment and or from registers across other segments using the address interconnect bus 1200. After calculating the address it creates the reference address of either 32-bit address or 64-bit address that is used to access caches and memory. This address is usually fragmented into a tag field and a set and line fields. This particular segment/engine /core will store the address into its load store buffer and/or L1 and/or L2 address arrays 1202, at the same time it will create a compressed version of the tag (with smaller number of bits than the original tag field of the address) by using a compression technique. More the different segments/engines /cores/processors will use the set field or a subset of the set field as an index to identify which segment /core/processor the address is maintained in. This indexing of the segments by the address set field bits ensures exclusiveness of ownership of the address in a particular segment /core/engine even though the memory data that corresponds to that address can live in another or multiple other segments/engines /cores/processors… After the compressed address tag is formed, the set's field bits are used to identify the particular 

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Garcia and Abdallah with predictable results to achieve the claimed invention because one of ordinary skill in the art would have been prompted to implement the teachings of Garcia such that exclusiveness of ownership of the address in a particular segment /core/engine is ensured even though the memory data that corresponds to that address can live in another or multiple other segments/engines /cores/processors as suggested by Abdallah (see Abdallah para. 134).



Regarding claims 10-11, they are method claims having similar limitations as cited in claims 3-4.  Thus, claims 10-11 are also rejected under the same rationale as cited in the rejection of rejected claim 3-4.

Regarding claims 17-18, they are system claims having similar limitations as cited in claims 3-4.  Thus, claims 17-18 are also rejected under the same rationale as cited in the rejection of rejected claim 3-4.

Claims 5-7, 12-14, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Garcia in view of Johnson.

Claim 5

The processor as recited in claim 1, wherein the processor is further configured to: 

monitor a plurality of performance counters to track resource contention among the plurality of compute units; 

calculate a load-rating for each compute unit and each resource based on the plurality of performance counters; and 

determine how to allocate wavefronts of the workgroup to the plurality of compute units based on calculated load-ratings.

Garcia teaches “proper balance of thread computations among the plurality of cores and SIMD processing units available” (see para. 6; and see also paras. 26-27, 31).

Garcia also teaches “When kernels are serially executed 1101, PPU resources are underutilized. Therefore, idle PPU cores 203 could be executing instructions from other kernels instead of ineffectively consuming computing cycles. In order to address this issue, the PPU 102 can issue and execute instructions from different kernels concurrently 1102. As FIG. 11 shows, concurrent execution 1102 may reduce the execution time in situations where computations are unbalanced. These kernel functions are coded using language constructs that explicitly express data-level parallelism by using a set of threads 903 or work-item groups” (see para. 30; and see also paras. 29, 31-35, 38), and “a kernel function that relies on a divide and conquer approach where each input image is split into equally-sized image chunks of w.times.h elements. Additionally, for each image chunk a block of w.times.h threads 903 is created. These image chunks, which respectively correspond to different image subregions… If the number of thread blocks 902 is greater than the number of cores available in the PPU 102, the remaining thread blocks 902 are dynamically enqueued and dequeued as the PPU 102 resources of each core become available” (e.g., “track resource contention among the plurality of compute units”, “for each compute unit and each resource”, “determine how to allocate wavefronts of the workgroup to the plurality of compute units” as claimed) (see para. 39; and see also paras. 40-43, 46-49).

Garcia does not explicitly teach “monitor a plurality of performance counters”, “calculate a load-rating”, “based on the plurality of performance counters” and “based on calculated load-ratings”.

However, in an analogous art, Johnson teaches “A similarity between two vectors may be defined based on a distance metric (e.g., an L.sup.2 distance, a cosine similarity, etc.) between the two vectors. In particular embodiments, a similarity search may be a k-nearest neighbor (k-NN) search, which may identify the k most similar objects or object vectors to a query or query vector. In particular embodiments, a k-NN search may be an exact nearest neighbor search. In particular embodiments, a k-NN search may be an approximate nearest neighbor (ANN) search. In particular embodiments, a similarity search may comprise accessing input comprising the distances values and performing k-selection. The distances values may be exact distance values or approximated distance values (e.g., distances between quantized vectors generated by a quantizer or product quantizer). In particular embodiments, k-selection may comprise identifying the k least distances values or the objects corresponding to the k least distance values. In particular embodiments, k-selection may comprise identifying the k greatest distances values or the objects corresponding to the k greatest distance values. In particular embodiments, k-selection may be performed using parallel processing on a graphics processing unit (GPU) or any other suitable. In particular embodiments, a method for k-selection may use in-register sorting. Each thread of a GPU may maintain a local queue of smallest values called a thread queue, which may be stored in register memory. A warp of a GPU may maintain a queue of distance values called a warp queue. In particular embodiments, a warp of a GPU may refer to a wavefront of a GPU and a warp queue may be a wavefront queue. 

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Garcia and Johnson with predictable results to achieve the claimed invention because one of ordinary skill in the art would have been prompted to implement the teachings of Garcia such that identifying object vectors in a collection can be performed similar to a query vector using parallel processing as suggested by Johnson (see Johnson para. 6).



Claim 6

The processor as recited in claim 5, wherein the processor is further configured to select a first compute unit as a candidate for dispatch responsive to determining the first compute unit has a lowest load-rating among the plurality of compute units for a first resource.

Garcia teaches “proper balance of thread computations among the plurality of cores and SIMD processing units available” (see para. 6; and see also paras. 26-27, 31).

Garcia also teaches “When kernels are serially executed 1101, PPU resources are underutilized. Therefore, idle PPU cores 203 could be executing instructions from other kernels instead of ineffectively consuming computing cycles. In order to address this issue, the PPU 102 can issue and execute instructions from different kernels concurrently 1102. As FIG. 11 shows, concurrent execution 1102 may reduce the execution time in 

Garcia does not explicitly teach “a lowest load-rating”.

However, in an analogous art, Johnson teaches “A similarity between two vectors may be defined based on a distance metric (e.g., an L.sup.2 distance, a cosine similarity, etc.) between the two vectors. In particular embodiments, a similarity search may be a k-nearest neighbor (k-NN) search, which may identify the k most similar objects or object vectors to a query or query vector. In particular embodiments, a k-NN search may be an exact nearest neighbor search. In particular embodiments, a k-NN search may be an approximate nearest neighbor (ANN) search. In particular embodiments, a similarity search may comprise accessing input comprising the distances values and performing k-selection. The distances values may be exact distance values or approximated distance values (e.g., distances between quantized vectors generated 

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Garcia and Johnson with predictable results to achieve the claimed invention because one of ordinary skill in the art would have been prompted to implement the teachings of Garcia such that identifying object vectors in a collection can be performed similar to a query 



Claim 7

The processor as recited in claim 5, wherein the plurality of performance counters track two or more of vector arithmetic logic unit (VALU) execution bandwidth, scalar ALU (SALU) execution bandwidth, local data share (LDS) bandwidth, Load Store Bus bandwidth, Vector Register File (VRF) bandwidth, Scalar Register File (SRF) bandwidth, cache subsystem capacity, cache bandwidth, and translation lookaside buffer (TLB) bandwidth.

Garcia teaches “the summed area table is split into a plurality of table chunks, and then stored in the shared memory of each core in order to improve the data locality, and thus increase the memory bandwidth during the cascade evaluation process” (e.g., “bandwidth” as claimed) (see para. 10), and “reducing the latency of memory operations through an efficient usage of the underlying cache hierarchy” (see para. 11; and see also paras. 27, 29-35, 38-43, 45-49).

Garcia does not explicitly teach “wherein the plurality of performance counters track two or more of vector arithmetic logic unit (VALU) execution, scalar ALU (SALU) execution, local data share (LDS), Load Store Bus, Vector Register File (VRF), Scalar Register File (SRF), cache subsystem capacity, cache, and translation lookaside buffer (TLB)”.

However, in an analogous art, Johnson teaches “A similarity between two vectors may be defined based on a distance metric (e.g., an L.sup.2 distance, a cosine similarity, etc.) between the two vectors. In particular embodiments, a similarity search may be a k-nearest neighbor (k-NN) search, which may identify the k most similar objects or object vectors to a query or query vector. In particular embodiments, a k-NN search may be an exact nearest neighbor search. In particular embodiments, a k-NN search may be an approximate nearest neighbor (ANN) search. In particular embodiments, a similarity search may comprise accessing input comprising the distances values and performing k-selection. The distances values may be exact distance values or approximated distance values (e.g., distances between quantized vectors generated by a quantizer or product quantizer). In 

Johnson also teaches “a source of information may be images and videos, with some meta-data. In particular embodiments, users may not provide extensive metadata to their pictures. In particular embodiments, automatic media analysis algorithms may produce vector data for information. As an example and not by way of limitation, vector data may be the outputs of a set of classifiers for random objects, applied to an image, text embeddings like word2vec, 

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Garcia and Johnson with predictable results to achieve the claimed invention because one of ordinary skill in the art would have been prompted to implement the teachings of Garcia such that identifying object vectors in a collection can be performed similar to a query vector using parallel processing as suggested by Johnson (see Johnson para. 6).



Regarding claims 12-14, they are method claims having similar limitations as cited in claims 5-7.  Thus, claims 12-14 are also rejected under the same rationale as cited in the rejection of rejected claim 5-7.
Regarding claims 19-20, they are system claims having similar limitations as cited in claims 5-7.  Thus, claims 19-20 are also rejected under the same rationale as cited in the rejection of rejected claim 5-7.

Response to Arguments
Applicant's arguments filed 12/21/2020 have been fully considered but they are not persuasive. 
Applicant argues for claims 1, 8, and 15 in pages 8-15 that Garcia does not teach “divide a workgroup into individual wavefronts for dispatch from the dispatch unit to separate compute units, responsive to determining that the workgroup does not fit within a single compute unit based on currently available resources of the plurality of compute units”.  Specifically, Applicant argues Garcia discloses no condition for decomposing either the input image or the kernel function because the input image is unconditionally decomposed into multiple equally-sized image chunks, and the kernel function is unconditionally decomposed into multiple equally-sized thread blocks.
The Examiner submits that Applicant’s arguments demonstrates an incorrect understanding of the claim language and Garcia, and fails to consider the reference in its entirety.  Specifically, the Examiner submits that given the broadest reasonable interpretation in light of the specification, one of ordinary skill in the art would understand that the claim language can refer to “unconditionally decomposing” as part of the “determining that the workgroup does not fit within a single compute unit based on currently available resources of the plurality of compute units”.  In other words, in order to determine that the workgroup does not fit within a single compute unit, one of ordinary skill in the art would understand that the workgroup would need to be decomposed to determine how many items are in the workgroup to know that it does not fit within a single compute unit.  In fact, the Examiner submits that Applicant’s arguments contradict the original specification by suggesting “decomposing” as taught by Garcia does not teach the claimed invention.  For example, the original specification discloses “Performance monitor module 620 conveys a fit or no-fit indication to WG allocation request buffer 615 based on the number of wavefronts of the given workgroup and the currently available CU resources” (see para 41 of the original specification), which appears to suggest that the workgroup has already 

Applicant argues for claims 1, 8, and 15 in pages 15-16 that Garcia does not disclose “based on reducing resource contention among the plurality of compute units”.  Specifically, Applicant argues that Garcia fails to disclose selecting a first PPU core over a second PPU core for dispatching an equally-sized thread block based on whether the data to be processed may be stored in the local shared memory of the first PPU core when the data to be processed are not able to be stored in the local shared memory of the second PPU core.  Furthermore, the Applicant asserts that the disclosure of Garcia describes that access latency is reduced and bandwidth of the memory subsystem is increased when data to be processed is stored locally rather than externally.
In response to applicant's argument that the references fail to show certain features of applicant’s invention, it is noted that the features upon which applicant relies (i.e., “selecting a first PPU core over a second PPU core for dispatching an equally-sized thread block based on whether the data to be processed may be stored in the local shared memory of the first PPU core when the data to be processed are not able to be stored in the local shared memory of the second PPU core”) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).  The Examiner notes that the claim 

Conclusion	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Shih-Wei Kraft whose telephone number is (571) 270-3388.  The examiner can normally be reached on Monday to Friday 6:30 AM to 3:30 PM.
If attempts to reach the above noted Examiner by telephone are unsuccessful, the Examiner’s supervisor, Dennis Chow, can be reached at the following telephone number: (571) 272-7767. 
The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
/SHIH-WEI KRAFT/Primary Examiner, Art Unit 2194