DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-5,15,16,17, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chen (patent application No. 2018/0121386) in view of Mantor (patent application No. 2018/0239606).

Chen taught the invention substantially as claimed including (as to clam 1) A method comprising: storing, at a cache(250,destination operand cache #1 in figs. 1B, 2, 3), a set of wavefronts, each wavefront comprising a number of work items(e.g., see  paragraph 0029)[note the result of ALUs execution provide the work items), for execution at an execution unit comprising a first arithmetic unit (ALU) pipeline and a second ALU pipeline(note the results of execution by ALUs are feedback to the input of the ALUs for further execution where the pipeline for the various ALUs provide the first and second pipelines (e.g., see figs. 1B, 2, 3 and paragraph 0028-0029); Chen did not expressly detail selectively executing either a single instruction or a dual instruction on the set of wavefronts in a first execution cycle both at the first ALU pipeline and at the second ALU pipeline. Mantor however taught this limitation (e.g., see fig. 7, steps 710,715,720 and paragraphs 0021-0022 and 0028)[note as pipeline being an ALU pipeline Chen taught ALU pipelines as discussed above and Mantor taught the Processing units being representative of any number of, and type of processing units including CPU (e.g., see paragraph 0017 of Mantor).

It would have been obvious to one of ordinary skill in the art to combine the teachings of Chen and Mantor. Both references were directed toward the problems of scheduling instructions and providing data for performing parallel execution of multiple data items using multiple ALUs in an SIMD manner. One of ordinary skill would have been motivated to incorporate the Mantor teaching of providing a mode detection for selectively scheduling/executing one instruction of plural instructions for a wave front at least to provide an efficient way to implement the processing of single or multiple parallel instructions of Chen to reduce time in changing between single and concurrent multiple instruction execution to improve throughput (e.g., see paragraph 0051 of Chen and see fig. 7, steps 710,715,720 and paragraphs 0021-0022 and 0028 of Mantor).

Due to the similarities between claims 1 and 15; claim 15 is rejected for the same reasons as claim 1. As to the plural ALUs in each pipeline Chen taught this limitation (e.g., see fig. 2)[pipelines of ALUs 362a, 365 and 362 each include plural ALUs].

As to claims 2,16 Chen and Mantor taught The method of claim 1, Mantor taught further comprising: transferring the set of wavefronts from a set of vector general purpose register (VGPR) banks to the cache (e.g., see fig 2 where the coupling of the VGPRs and the cache in a bidirectional manner provide this limitation as the VGPRs store wavefronts e.g., see fig. 3 of Mantor).

As to claims 3,20 Chen and Mantor taught The method of claim 2, Chen taught wherein the number of work items of a wavefront equals a number of ALUs of the first ALU pipeline plus a number of ALUs of the second ALU pipeline; and selectively executing comprises executing a single instruction both at the first ALU pipeline and at the second ALU pipeline in a first execution cycle (€.g., see paragraph 0052).

As to claim 4 Chen and Mantor taught The method of claim 3, Chen taught further comprising: distributing the work items of a wavefront of the set of wavefronts evenly among the set of VGPR banks(e.g,. see paragraph 0027-0028)|the wavefronts include a proper number of work items based on the dimension of the SIMD grouped for efficient processing][note since Chen taught SIMD work items grouped for efficient processing which would be for parallel processing (SIMD) then one of ordinary skill would have been motivated to evenly store the wavefronts in the VGPR banks so the each cycle workitem(s) could easily be addressed/accessed from each respective bank in parallel and therefore not having to wait for access multiple work items from one bank in a serial manner when another bank did not have a work-item to be processed for a particular cycle].

As to claim 5,17 Chen and Mantor taught The method of claim 1, Chen taught further comprising: storing results of the single instruction or the dual instruction at a buffer (RAMs 320,cache 250,destination cache #1)(e.g., see figs. 1B, 2 and paragraphs 0033 and 0062).

Claims 6,7,18,19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chen and Mantor as applied to claims 5,1,15 above, and further in view of Ubal, R. et al., (ACM paper entitiled Multi2Sim: A Simulation Framework for CPU-GPU Computing).

As to claim 6 Chen and Mantor taught The method of claim 5, Ubal taught further comprising: transferring the results of the single instruction or the dual instruction from the buffer to the cache in response to an instruction depending on the results (e.g., see section 3.2.2 on page 338)[note work items labeled as inactive, the result in the associated steam core is ignored preventing the work-item from changing the kernel state, the inactive state of the result(s) therefore provides transfer of the results depending on the results].

It would have been obvious to one or ordinary skill in the art to combine the teachings of Chen and Ubal. Both references were directed toward the problems of processing or work items in parallel multiple cores or ALUs on data processor. One of  ordinary skill in the art would have been motivated to incorporate the Ubal teachings of storing the results in cache depending on whether based the results at least to ensure that invalid results did not did not overwrite valid results to ensure further processing was done on valid data.

As to claims 7,19 Chen and Mantor taught The method of claim 1, Ubal taught wherein the number of work items of a wavefront (subwavefront) equals a number of ALUs of the first ALU pipeline(e.g., see section 3.3 on page 338); and Mantor taught selectively executing comprises executing a dual instruction comprising a first instruction to execute on a first wavefront at the first ALU pipeline and a second instruction to execute on a second wavefront at the second ALU pipeline in the first execution cycle(e.g., see fig. 7, steps 710,715,720 and paragraphs 0021-0022 and 0028).

It would have been obvious to one or ordinary skill in the art to combine the teachings of Chen and Ubal. Both references were directed toward the problems of processing or work items in parallel multiple cores or ALUs on data processor. One of ordinary skill in the art would have been motivated to incorporate the Ubal teachings of the number of work items equaling the number of ALUs at least to fully utilize each ALU (i.e., no idle ALUs) each cycle to ensure the best possible throughput.

As to claim 18 Mantor and Chen taught The device of claim 17, Ubal taught further comprising: a controller to transfer results from the buffer to the cache in response to an instruction depending on the results) (e.g., see section 3.2.2 on page 338)[note work items labeled as inactive, the result in the associated steam core is ignored preventing the work-item from changing the kernel state, the inactive state of the result(s) therefore provides transfer of the results depending on the results][note at least the local data share unit or data write selector provides the controller].

Claims 8-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mantor (patent application No. 2018/0239606) in view of Chen (patent application No. 2018/0121396) and Ubal et al., (ACM article Multi2Sim: A Simulation Framework for CPU-GPU Computing).

As to claim 8 Mantor taught A method, comprising: selectively executing either a single instruction or a dual instruction (e.g., see fig. 7, steps 710,715,720 and paragraphs 0021-0022 and 0028) but did not expressly detail both at a first arithmetic logic unit (ALU) pipeline comprising a plurality of ALUs and at a second ALU pipeline comprising a plurality of ALUs. Chen taught this limitation (e.g., see paragraph 0020 and 0031), and Chen taught execution in a first execution cycle based on a set of wavefronts stored at a cache (e.g., see fig 2 where the coupling of the VGPRs and the cache in a bidirectional manner provide this limitation as the VGPRs store wavefronts e.g., see fig. 3 of Mantor) , Mantor and Chen did not expressly detail wherein a first wavefront of the set of wavefronts comprises a number of work items equal to a number of ALUs in the first ALU pipeline plus a number of ALUs in the second ALU pipeline Ubal however taught this limitation (e.g., see fig. 3,5 and section 3.3 on page 338, and section 3.1 where UBal teaches the compute units comprise ALU with multiple stream cores)[note the number of subwavefronts is the same as the number of work items].

It would have been obvious to one of ordinary skill in the art to combine the teachings of Mantor and Chen. Both references were directed toward the problems of scheduling instructions and providing data for performing parallel execution of multiple data item using multiple ALUs in a SIMD manner. One of ordinary skill would have been motivated to incorporate the Chen teaching of both at a first arithmetic logic unit (ALU) pipeline comprising a plurality of ALUs and at a second ALU pipeline comprising a plurality of ALUS least provide increased parallel execution of each group of work items of wavefront tor increased throughput.

It would have been obvious to one or ordinary skill in the art to combine the teachings of Mantor and Ubal. Both references were directed toward the problems of processing or work items in parallel multiple cores or CPUs or ALUs on data processor. One of ordinary skill in the art would have been motivated to incorporate the Ubal teachings of wherein a first wavefront of the set of wavefronts comprises a number of work items equal to a number of ALUs in the first ALU pipeline plus a number of ALUs in the second ALU pipeline at least to provide optimum utilization of the ALUs to provide optimum throughput.

As to claim 9 Mantor and Chen and Ubal taught The method of claim 8, Mantor taught further comprising: transferring the set of wavefronts from a set of vector general purpose register (VGPR) banks to the cache (e.g., see fig 2) where the coupling of the VGPRs and the cache in a bidirectional manner provide this limitation as the VGPRs store wavefronts (e.g., see fig. 3 of Mantor).

As to claim 10 Mantor and Chen and Ubal taught The method of claim 9, Mantor taught further comprising: storing at the cache read values from the set of VGPR banks(e.g., see fig 2) where the coupling of the VGPRs and the cache in a bidirectional manner provide this limitation as the VGPRs store wavefronts (e.g., see fig. 3 of Mantor).

As to claim 11, Mantor and Chen and Ubal taught The method of claim 8, Chen taught further comprising: storing results of the single instruction or the dual instruction at a buffer (cache 250, destination cache #1) (e.g., see figs. 1B, 2 and paragraph 0062).

As to claim 12 Mantor and Chen and Ubal taught The method of claim 11, Ubal taught further comprising: transferring the results from the buffer to the cache in response to an instruction depending on the results (e.g., see section 3.2.2 on page 338)[note work items labeled as inactive, the result in the associated steam core is ignored preventing the work-item from changing the kernel state, the inactive state of the result(s) therefore provides transfer of the results depending on the results].

As to claim 13 Mantor and Chen and Ubal taught The method of claim 8, Chen wherein the dual instruction comprises a first instruction to execute on a second wavefront at the first ALU pipeline and a second instruction to execute on a third wavefront at the second ALU pipeline in the first execution cycle (e.g., see figs. 2 where Chen show ALU pipelines (the pipelines including 362a, 365, 362b respectively) each comprising multiple ALUs) (e.g., see paragraph 0054) also Mantor taught plural compute units that each contain multiple execution units (e.g., see paragraph 0019) Mantor taught executing plural instruction on a portions of a wavefront (e.g., see paragraph 0028). On the other hand, Ubal taught subwavefronts that contain as many work-items as there are stream cores (e.g., see section 3.3 on page 338). As to execution of second and third wavefront in the first execution cycle as understood this would include simultaneous multithreading or hyper threading which the Examiner takes official notice was well known in the art at the time of the claimed invention. One of ordinary skill would have been motivated to implement the parallel processing of Mantor and Chen and Ubal using simultaneous multithreading to optimize the use of the ALUs so reduce the idle time of the ALUs and therefore increase throughput.

As to claim 14 Mantor and Chen and Ubal taught The method of claim 13, Ubal taught wherein the number of work items of the second wavefront of the set of wavefronts equals a number of ALUs of the first ALU pipeline and the number of work items of the third wavefront equals a number of ALUs of the second ALU pipeline (e.g., see section 3.3 on page 338 and fig. 3 of Ubal) the plural subwavefronts each have the same number of work-items stream cores in the compute where there are three compute units) .
Response to Arguments
Applicant's arguments filed 08/09/2022 have been fully considered but they are not persuasive. .
The rejections are hereby maintained as set forth in the last office action (and repeated above). 
The Applicant argues in substance that: the cited prior art did not teach (as to claim 1) storing, at a cache, a set of wavefronts, each wavefront comprising a number of work items , for execution at an execution unit comprising a first arithmetic unit (ALU) pipeline and a second ALU pipeline.
 As to this  argument Applicant specifically alleges that Chen’s Do$ 250  is used to store results  not a set of wavefronts for execution at an execution unit comprising a first arithmetic unit (ALU)  pipeline and a second ALU pipeline. 
Note: the  instant application in paragraph [0011](1st and 2nd lines) states “In some embodiments wavefronts include either N work items or 2N work items”.
Note:   paragraph [0010] (3rd and 4th lines) of the instant application recites “ALU pipelines each include a number of ALUs (also referred to as lanes) that execute on wavefronts (operands)… “; and note paragraph [0025] (3rd and 4th lines) of the instant application  recites “the processing cores 122 include a cache to expand the number of operands (wavefronts) received…: “
Therefore as the claim is understood the ALUs are referred to as lanes and the wavefronts are operands or work items.  Chen taught  pipelined lanes  of threads in fig. 2 comprising ALUs 362a-362b and, as to the wavefronts (operands), the results as detailed in the rejection above provide the wavefronts. Also note there is specifically detailed in paragraph [0062](1st thru 5th lines)  of Chen  that states: “Do$ 250 stores the most recent ALU results  which might be re-used as source operands of the next instruction….Waves can share the same Do$ 250”.  Mantor taught the single or dual instruction execution  instruction limitations (in the outstanding rejection above). Therefore the Examiner contends the combination of Chen and Mantor taught  the claimed limitations.


Applicant argues  that the cited prior art does not teach (as to claim 15) A cache to store a first set of wavefronts, each wavefront comprising a number of work items, with an execution unit comprising  a first arithmetic unit (ALU) pipeline  and a second (ALU) pipeline to selectively execute either a single instruction or a dual instruction on the first set of wavefronts;; As to this  argument Applicant specifically alleges that Chen’s Do$ 250  is used to store results  not a set of wavefronts for execution at an execution unit comprising a first arithmetic unit (ALU)  pipeline and a second ALU pipeline.  The  Examiner contends that  Chen taught the wavefronts and lanes  and Mantor taught single and dual instruction limitations  as discussed the argument for claim 1 above which similarly provides the limitations for claim 15.  


Applicant argues that the cited prior art does not teach (As to claim 8) selectively executing either a single instruction or a dual instruction, both at a first arithmetic logic unit (ALU) pipeline comprising a plurality of ALUs and at a second ALU. pipeline comprising a plurality of ALUs, in a first execution cycle based on a set of wavefronts stored at a cache. As to this  argument Applicant specifically alleges that Chen’s Do$ 250  is used to store results  not a set of wavefronts for execution at an execution unit comprising a first arithmetic unit (ALU)  pipeline and a second ALU pipeline.  The Examiner contends that the Chen taught the wavefronts and lanes and Mantor taught instruction limitations as discussed the argument for claim 1 above.  This similarly provides the limitations of claim 8.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC COLEMAN whose telephone number is (571)272-4163. The examiner can normally be reached M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on 0-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

ERIC . COLEMAN
Primary Examiner
Art Unit 2183



EC
/ERIC COLEMAN/Primary Examiner, Art Unit 2183