DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .   

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1, 12 and 16 rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.  Claims 1, 12 and 16 have been amended to recite, “including to provide a greater number of entries for a first compute kernel than for a second compute kernel, wherein the first compute kernel is indicated as less complex than the second compute kernel.”  The specification discloses in [0040], “… there are four entries available in buffer circuitry 500.  In this example, buffer size adjustment circuitry 310 has determined to reduce the buffering entries by 2.  Therefore, in this example, there are only two credits available to workload parser circuitry to send additional work to buffer circuitry 500.  This reduction in buffering may advantageously reduce workload imbalances, in some embodiments, for more complex kernels.”  The specification discloses that the number of entries in the buffer circuitry may be reduced (in this case from 4 to 2) for more complex kernels.  But, the specification does not disclose how this compares to the number of entries available in the buffer circuitry for less complex kernels.  The specification does not disclose providing a greater number of entries for a first compute kernel than a second compute kernel, when the first compute kernel is less complex than the second compute kernel.  The specification discloses in [0067] and [0068], “… the graphics processor adjusts a limit on the number of entries used in the buffer circuitry based on information indicating complexity of the compute kernel… the graphics processor is configured to adjust limits for entries used in first and second buffer circuitry by different amounts.”  The specification discloses that the limit on the number of entries is adjusted based on the complexity of the compute kernel.  Then, the limits for entries are adjusted, for the first and second buffers, by different amounts.  The specification does not disclose whether the less complex kernel or the more complex kernel has more entries.  Nowhere in the Specification is it disclosed that a greater number of entries is required for a first compute kernel, when the first compute kernel is less complex than the second kernel.  This limitation is considered to be New Matter.  Dependent claims 2-11, 13-15 and 17-20 depend from independent claims 1, 12 and 16 respectively.  Claims 2-11, 13-15 and 17-20 are rejected for depending from a rejected independent claim.  For examination purposes, Examiner interprets this claim according to the specification, such that the number of entries is adjusted by different amounts, based on the complexity of the compute kernel.  

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
 Claims 1, 12, 16;  2, 11 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nijasure et al. U.S. Patent No. 11,010,862 in view of Fahs et al. U.S. Pub. No. 2014/0168245 and Howes et al. U.S. Pub. No. 2017/0053374.  
Re:  claims 1, 12 and 16, Nijasure teaches 
1. An apparatus, comprising: a graphic processor that includes:  shader circuitry configured to process compute work from a compute kernel; (“Fig. 1 is a block diagram of a processing system 100… The processing system 100 includes … The processing cores 128 of the GPU 108 are also interchangeably referred to as shader cores or streaming multi-processors (SMXs)… Each of the one or more processing cores 128 executes a respective instantiation of a particular work-item to process incoming data… Each work-item represents a single instantiation of, for example, a collection of parallel executions of a kernel invoked on a device by a command that is to be executed in parallel. ”; Nijasure, col. 5, lines 34-46)
The processing cores are shader cores (shader circuitry).  Each of the processing cores executes work items (compute work) from a kernel (compute kernel).  
(“Figs. 1-7 disclose systems and techniques to improve the efficiency and bandwidth of graphics processing pipelines…  The non-transitory computer readable storage medium can include, for example, a magnetic or optical disk storage device, solid state storage devices such as Flash memory, a cache, random access memory (RAM) or other non-volatile memory devices… The executable instructions stored on the non-transitory computer readable storage medium are in source code, assembly language code object code, or other instruction format that is interpreted or otherwise executable by one or more processors. ”; Nijasure, col. 2, lines 56-57)
The processing system includes techniques (methods) and a non- transitory computer readable storage medium.  
Nijasure is silent, however, Fahs teaches multiple distributed workload parser circuits configured to send compute work to the shader circuitry; (“GPCs 208 receive processing tasks to be executed from a work distribution unit within a task/work unit 207… GPCs 208 can be programmed to execute processing tasks relating to a wide variety of applications, including… image rendering operations (e.g., tessellation shader, vertex shader, geometry shader, and/or pixel shader programs)…”; Fahs, [0032], [0035], Fig. 2)
Fig. 2 illustrates plural PPUs and that each PPU includes a task/work unit 207 (multiple distributed workload parser circuits).  The work distribution unit within the task/work unit (distributed workload parser circuits) sends processing tasks (compute work) to the GPCs (shader circuitry), which execute shader programs.  
primary workload parser circuitry configured to send, via a communications fabric, compute work from the compute kernel to the distributed workload parser circuits; (“… CPU 102 is the master processor of the computer system 100, controlling and coordinating operations of other system components… CPU 102 issues commands that control the operations of PPUs 202… each PPU 202 includes an I/O (input/output) unit 205 that communicates with the rest of computer system 100 via communication path 113, which connects… directly to CPU 102… communication path 113 is a PCI Express link, in which lanes are allocated to each PPU 202… An I/O unit 205… receives all incoming packets… form communication path 113, directing the incoming packets to appropriate components of PPU 202.  For example, commands related to processing tasks may be directed to a host interface 206… Host interface 206 reads each pushbuffer and outputs the command stream stored in the pushbuffer to a front end 212… GPCs 208 receive processing tasks to be executed from a work distribution unit within a task/work unit 207… The task work unit 207 receives tasks from the front end 212…”; Fahs, [0028], [0029], [0030], [0032], Figs 1-2)
The CPU (primary workload parser circuitry) writes a stream of commands (Fah, [0028]), (compute work from a compute kernel) that are sent to each of the PPUs I/O unit via communication path 113 (communication fabric).  The stream of commands is then sent from the I/O unit to the GPCs 208 via the host interface, the front end, and the task/work unit (distributed workload parser circuits). (Fahs, [0032], Figs. 1-2).  
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the system of Nijasure by adding the feature of multiple distributed workload parser circuits configured to send compute work to the shader circuitry; primary workload parser circuitry configured to send, via a communications fabric, compute work from the compute kernel to the distributed workload parser circuits, in order to generate program code that leverages the parallel architecture of the GPU without being required to implement texture oriented memory access operations, as taught by Fahs. ([0010])  
Nijasure is silent, however, Howes and Fahs teach and buffer circuitry configured to buffer compute work, including portions of the compute kernel, received by one or more of the distributed workload parser circuits from the primary workload parser circuitry; (“In the SIMD structure, GPU 14 executes a plurality of instances of the same program (sometimes also referred to as a kernel).  For instance, graphics processing, and some non-graphics related processing, require the same operations to be performed, but on different data… GPU 14 may execute shader programs… that perform graphics related tasks and execute kernels that perform non-graphics related tasks.  GPU 14 includes at least one core… and the shader programs or kernels execute on the core… GPU 14 is described as executing instructions of kernels… Each of the processing elements may store the resulting, final value of the operations performed by the processing element in a general purpose register (GPR) of the core.”; Howes, [0041], [0042], [0045])
The GPU performs non-graphics related processing for kernels (compute kernels, which would include portions of the compute kernels).  The results of performing non-graphics processing (compute work) are stored in the general purpose registers (buffer circuitry).  Nijasure and Howes are silent, however, Fahs teaches buffer circuitry configured to buffer compute work… received by one or more of the distributed workload parser circuits from the primary workload parser circuitry
(“Any one of GPCs 208 may process data to be written to any of the DRAMs 220 within parallel processing memory 204”; Fahs, [0034])
As discussed above, the stream of commands are sent from the CPU (primary workload parser circuitry) to the I/O unit of the PPU to the GPCs 208 via the host interface, the front end, and the task/work unit (distributed workload parser circuits). (Fahs, [0032], Figs. 1-2).  Each GPC processes data to be written to any of the DRAMs 220 (buffer circuitry configured to buffer compute work, including portions of the compute kernel, received by one or more of the distributed workload parser circuits from the primary workload parser circuitry).  The DRAMs are considered to include buffer circuitry.  Howes and Fahs can be combined with Nijasure such that the stream of commands of Fahs is the compute kernel of Howes.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify buffer circuitry configured to buffer compute work, including portions of the compute kernel, received by one or more of the distributed workload parser circuits from the primary workload parser circuitry, in order to create sufficient storage for executing the high priority set of instructions, as taught by Howes ([0004]) and in order to generate program code that leverages the parallel architecture of the GPU without being required to implement texture oriented memory access operations, as taught by Fahs ([0010]).  
Nijasure is silent, however, Howes teaches wherein the graphics processor is configured to dynamically adjust a limit on the number of entries used in the buffer circuitry based on information indicating complexity of the compute kernel including to provide a greater number of entries for a first compute kernel than for a second compute kernel, wherein the first compute kernel is indicated as less complex than the second compute kernel. (“GPU may spill data from the dynamic GPRs (e.g., dynamic memory locations of the one or more GPRs), but not from the static GPRs… for the wavefronts allocated to a kernel, and in some examples, only the amount of data needed to allow instructions of the higher priority kernel to execute… the number of memory locations… needed changes dynamically during execution.  For instance, more memory locations may be needed if a wavefront enters a more complex subroutine, and decreases on exit from those subroutines.  For example, if there is an if/then/else instruction in instructions of a wavefront, one of the instructions may go through the if-condition and another may go through the else-condition.  The if-condition may require fewer memory locations, and the else-condition may require more memory locations, but whether the if-condition or else-condition is met is not known until execution (i.e., known dynamically).”; Howes, [0055], [0081])
As discussed above in the 112(a) rejection, the amended portion of this limitation is considered to be New Matter.  For examination purposes, Examiner interprets this claim according to the specification, such that the number of entries is adjusted by different amounts, based on the complexity of the compute kernel.  The GPU adjusts the number of memory locations allocated during execution of the kernel based on the complexity of the kernel.  More memory locations are needed when a wavefront enters a more complex subroutine.  For an if/then/else instruction of a wavefront, the if-condition requires fewer memory locations and the else-condition requires more memory locations.  For example, if a first compute kernel includes less complexity, such as an if-condition, then less memory locations are required.  If a second compute kernel includes more complexity, such as an else-condition, then more memory locations are required.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the system of Nijasure as modified by adding the feature of the graphics processor is configured to dynamically adjust a limit on the number of entries used in the buffer circuitry based on information indicating complexity of the compute kernel including to provide a greater number of entries for a first compute kernel than for a second compute kernel, wherein the first compute kernel is indicated as less complex than the second compute kernel, in order to create sufficient storage for executing the high priority set of instructions, as taught by Howes. ([0004])  
Re:  claims 2 and 13, Nijasure teaches
2. The apparatus of claim 1, wherein the compute kernel is specified in a compute control stream and wherein the compute control stream includes the information indicating complexity of the compute kernel. (“GPU 14 may execute a plurality of threads of a kernel in parallel.  A set of threads may be referred to as a wavefront… the number of memory locations… needed changes dynamically during execution.  For instance, more memory locations may be needed if a wavefront enters a more complex subroutine, and decreases on exit from those subroutines.  For example, if there is an if/then/else instruction in instructions of a wavefront, one of the instructions may go through the if-condition and another may go through the else-condition.  The if-condition may require fewer memory locations, and the else-condition may require more memory locations, but whether the if-condition or else-condition is met is not known until execution (i.e., known dynamically).”; Howes, [0050], [0081], Fig. 2)
Threads of a kernel/wavefront (compute control stream) are executed in parallel.  The number of memory locations needed for the kernel/wavefront changes dynamically during execution based on the complexity of the kernel/wavefront.  If the kernel/wavefront includes an if/then/else instruction, the if-condition may be less complex requiring fewer memory locations and the else-condition may be more complex requiring more memory locations.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the system of Nijasure by adding the feature of the compute kernel is specified in a compute control stream and wherein the compute control stream includes the information indicating complexity of the compute kernel, in order to avoid over reserving or allocating memory locations, as taught by Howes. ([0082])  
Re:  claim 11, Nijasure teaches 
11. The apparatus of claim 1, wherein the apparatus is a computing device that further includes: a central processing unit; a display; and network interface circuitry. (“The processing system 100 includes a central processing unit (CPU) 102, a system memory 104… and a display device 110… The computer readable storage medium in some embodiments is… coupled to the computer system via a wired or wireless network (e.g., network accessible storage (NAS)).”; Nijasure, col. 4, lines 3-8, Fig. 1)
Fig. 1 illustrates that the processing system 100 (computing device) includes a central processing unit and a display device.  The processing system also includes a wired or wireless network (network interface circuitry).  
 Claim 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nijasure in view of Fahs and Howes as applied to claim 1 above, and further in view of Surti et al. U.S. Pub. No. 2018/0300951.  
Re:  claim 3, Nijasure is silent, however, Surti teaches 
3. The apparatus of claim 1, wherein the buffer circuitry includes a first entry configured to store a first batch of workgroups from the compute kernel and a second entry configured to store a second batch of workgroups from the compute kernel. (“… command streamer 1403 receives commands from the memory and sends the commands to a 3D pipeline 1312 and/or media pipeline 1316.  The commands are directives fetched from a ring buffer, which stores commands for the 3D media pipeline 1312 and media pipeline 1316… the ring buffer can additionally include batch command buffers storing batches of multiple commands.”; Surti, [0177], Fig. 14)
The ring buffer (buffer circuitry) includes batch command buffers, which store batches of multiple commands (which include a first batch and a second batch of workgroups from the compute kernel).  The batch buffers are considered to be entries in the ring buffer (which includes a first entry and a second entry).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the system of Nijasure as modified by adding the feature of the buffer circuitry includes a first entry configured to store a first batch of workgroups from the compute kernel and a second entry configured to store a second batch of workgroups from the compute kernel, in order to allow the command streamer to provide the command stream to the 3D pipeline, as taught by Surti. ([0177])  
Claim 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nijasure in view of Fahs and Howes as applied to claim 1 above, and further in view of Feeney U.S. Pub. No. 2019/0156528
Re:  claim 4, Nijasure is silent, however, Feeney teaches 
4. The apparatus of claim 1, wherein the buffer circuitry includes a first entry configured to store a first workgroup from the compute kernel and a second entry configured to store a second workgroup from the compute kernel. (“The second pass may fill an output buffer 708 with a number of intersected blocks 710 and an entry 712 for each intersected block including an identifier of the intersected block (or threadgroup identifier) and the bit mask 714 for the intersected block… in the third pass, the compute shader 138 may spawn a thread group for each intersected block included in an entry 712 based on the number of intersected blocks 710”; Feeney, [0064], [0065], Fig. 8)
The compute shader (compute kernel) provides a threadgroup (workgroup) for each intersected block included in an entry.  Each entry in the buffer (which includes a first entry and a second entry) stores a threadgroup (which includes a first workgroup and a second workgroup from the compute kernel) for an intersected block that is included in the entry.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the system of Nijasure as modified by adding the feature of the buffer circuitry includes a first entry configured to store a first workgroup from the compute kernel and a second entry configured to store a second workgroup from the compute kernel, in order to allow the use of the dispatch indirect command to avoid latency in transferring information, as taught by Feeney. ([0065])  
Claim 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nijasure in view of Fahs and Howes as applied to claim 1 above, and further in view of Fujimaki JP 2001-289805A.  
Re:  claim 9, Nijasure is silent, however, Fujimaki teaches 
9. The apparatus of claim 1, wherein the graphics processor includes:  one or more configurable registers whose value indicates an amount of the adjustment to the limit. (“The write request generation circuit 26 includes a reference value register 26_2 that stores a reference value that serves as a reference for increasing or decreasing the threshold value stored in the threshold value register 26_1… the write request signal generation circuit 26 compares the size of the empty area of the FIFO buffer 21 calculated by the subtractor 24 with the threshold value stored in the threshold value register 26_1, and sets the size of the empty area as the threshold value…”; Fujimaki, [0038], [0039])
 The threshold value stored in the threshold value register (configurable register) indicates the size of the empty area of the buffer (an amount of the adjustment to the limit).  Nijasure teaches that the graphics processor includes configurable registers. (“The processing system 100 includes… a graphics processing subsystem 106 including a graphics processing unit (GPU) 108… The graphics processing subsystem 106 includes a GPU data bus 1222 that communicably couples the GPU 108 to a graphics memory 124… one or more individual memory units of the graphics memory 210 is embodied as… one or more processor registers…”; Nijasure, col. 4, lines 3-8, col. 5, lines 7-9, Figs. 1-2)
Fig. 1 illustrates that the graphics processing subsystem includes a graphics memory.  The graphics memory includes one or more processor registers.  Fujimaki can be combined with Nijasure such that the registers of the graphics processing system of Nijasure is the threshold value register of Fujimaki.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the system of Nijasure as modified by adding the feature of the graphics processor includes:  one or more configurable registers whose value indicates an amount of the adjustment to the limit, in order to increase the efficiency of transmission processing while receiving a small load on the CPU, as taught by Fujimaki. ([0010]).  

Allowable Subject Matter
Claims 5, 6, 7, 8, 10, 14, 15, 17, 18, 19 and 20 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), 1st paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
Claims 5, 6, 7, 8, 10, 14, 15, 17, 18, 19 and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.  None of the prior art teaches or suggests:  
from claims 5 and 17 – second buffer circuitry configured to buffer compute work assigned to a portion of the shader circuitry by the first distributed workload parser circuit; wherein the dynamic adjustment limits the number of entries used in both the first and second buffer circuitry.
from claim 14 – second buffer circuitry configured to buffer compute work assigned to a portion of the shader circuitry by the first distributed workload parser circuit; wherein the adjusting limits the number of entries used in both the first and second buffer circuitry.
Claims 6, 7 and 8 depend from claim 5, and include all of the limitations of claim 5.  And, claims 18 and 19 depend from claim 17 and include all of the limitations of claim 17. 
Claims 10 and 20 – wherein the graphics processor is configured to use a credit system in which the primary workload parser circuit is allocated credits based on available entries in the buffer circuitry and wherein the dynamic adjustment reduces the number of credits.
Claim 15 – wherein the graphics processor uses a credit system in which the primary workload parser circuit is allocated credits based on available entries in the buffer circuitry and wherein the adjusting reduces the number of credits. 
As allowable subject matter has been indicated, applicant's reply must either comply with all formal requirements or specifically traverse each requirement not complied with.  See 37 CFR 1.111(b) and MPEP § 707.07(a).

Response to Arguments
Applicant's arguments filed 2/22/2022 have been fully considered but they are not persuasive.  Applicant argues:  
“… the cited references, taken singly or in combination, do not teach or suggest to “buffer compute work, including portions of [a] compute kernel” at all, much less to “dynamically adjust a limit on the number of entries used in buffer circuitry” including “to provide a greater number of entries for a first compute kernel than for a second compute kernel, wherein the first compute kernel is indicated as less complex than the second compute kernel” as recited in claim 1.  
Examiner disagrees.  As discussed above in the 35 U.S.C. § 112(a) Rejection, this amended limitation is considered to be New Matter:  “dynamically adjust a limit on the number of entries used in buffer circuitry” including “to provide a greater number of entries for a first compute kernel than for a second compute kernel, wherein the first compute kernel is indicated as less complex than the second compute kernel.”  Regarding the limitation of “buffer compute work, including portions of a compute kernel”, Howes and Fahs teach this limitation.  Howes teaches, “In the SIMD structure, GPU 14 executes a plurality of instances of the same program (sometimes also referred to as a kernel).  For instance, graphics processing, and some non-graphics related processing, require the same operations to be performed, but on different data… GPU 14 may execute shader programs… that perform graphics related tasks and execute kernels that perform non-graphics related tasks.  GPU 14 includes at least one core… and the shader programs or kernels execute on the core… GPU 14 is described as executing instructions of kernels… Each of the processing elements may store the resulting, final value of the operations performed by the processing element in a general purpose register (GPR) of the core.” (Howes, [0041], [0042], [0045]).  The GPU performs non-graphics related processing for kernels (compute kernels, which would include portions of the compute kernels).  The results of performing non-graphics processing (compute work) are stored in the general purpose registers (buffer circuitry).  Fahs teaches buffer circuitry configured to buffer compute work… received by one or more of the distributed workload parser circuits from the primary workload parser circuitry.  “Any one of GPCs 208 may process data to be written to any of the DRAMs 220 within parallel processing memory 204” (Fahs, [0034]).  The stream of commands are sent from the CPU (primary workload parser circuitry) to the I/O unit of the PPU to the GPCs 208 via the host interface, the front end, and the task/work unit (distributed workload parser circuits). (Fahs, [0032], Figs. 1-2).  Each GPC processes data to be written to any of the DRAMs 220 (buffer circuitry configured to buffer compute work, including portions of the compute kernel, received by one or more of the distributed workload parser circuits from the primary workload parser circuitry).  Howes and Fahs can be combined with Nijasure such that the stream of commands of Fahs is the compute kernel of Howes.  
Applicant's arguments filed 2/22/2022 have been fully considered but they are not persuasive.  Applicant argues:  
“… the cited passages of the references relating to buffering compute work appear to store graphics data being operated on, rather than graphics work or portions of compute kernels.  For example, cited paragraphs 0034 and 0032 of Fahs discuss “DRAM 220” that may be used to store “[r]ender targets, such as frame buffers or texture maps” and that “GPCs may process data to be written to any of the DRAMs.”  Similarly, Howes’ “dynamic GPRs” are cited for “dynamically adjust[ing] a limit on the number of entries used in buffer circuitry”  These dynamic general-purpose registers, however, appear to store data operated on by graphics programs rather than any graphics work.  Nijasure is not cited for these features and does not remedy the defects in Fahs and Howes discussed above.
Examiner disagrees.  It is the combination of references that teach the limitations.  Howes teaches, “In the SIMD structure, GPU 14 executes a plurality of instances of the same program (sometimes also referred to as a kernel).  For instance, graphics processing, and some non-graphics related processing, require the same operations to be performed, but on different data… GPU 14 may execute shader programs… that perform graphics related tasks and execute kernels that perform non-graphics related tasks.  GPU 14 includes at least one core… and the shader programs or kernels execute on the core… GPU 14 is described as executing instructions of kernels… Each of the processing elements may store the resulting, final value of the operations performed by the processing element in a general purpose register (GPR) of the core.” (Howes, [0041], [0042], [0045]).  The GPU performs non-graphics related processing for kernels (compute kernels, which would include portions of the compute kernels).  The results of performing non-graphics processing (compute kernels) are stored in the general purpose registers (buffer circuitry).  Fahs teaches buffer circuitry configured to buffer compute work… received by one or more of the distributed workload parser circuits from the primary workload parser circuitry.  “Any one of GPCs 208 may process data to be written to any of the DRAMs 220 within parallel processing memory 204” (Fahs, [0034]).  The stream of commands are sent from the CPU (primary workload parser circuitry) to the I/O unit of the PPU to the GPCs 208 via the host interface, the front end, and the task/work unit (distributed workload parser circuits). (Fahs, [0032], Figs. 1-2).  Each GPC processes data to be written to any of the DRAMs 220 (buffer circuitry configured to buffer compute work, including portions of the compute kernel, received by one or more of the distributed workload parser circuits from the primary workload parser circuitry).  The DRAMs are considered to include buffer circuitry.  
Applicant's arguments filed 2/22/2022 have been fully considered but they are not persuasive.  Applicant argues:  
“If the Examiner maintains art-based rejections.  Applicant respectfully requests explicit identification of the alleged “buffer circuitry” for “compute work, including portions of a compute kernel” as recited in claim 1.”
Examiner disagrees.  Howes teaches, Each of the processing elements may store the resulting, final value of the operations performed by the processing element in a general purpose register (GPR) of the core.” (Howes, [0045]).  The results of performing non-graphics processing (compute work, which includes portions of the compute kernel) is stored in the general purpose registers (buffer circuitry).  Fahs teaches buffer circuitry configured to buffer compute work… received by one or more of the distributed workload parser circuits from the primary workload parser circuitry.  “Any one of GPCs 208 may process data to be written to any of the DRAMs 220 within parallel processing memory 204” (Fahs, [0034]).  The stream of commands are sent from the CPU (primary workload parser circuitry) to the I/O unit of the PPU to the GPCs 208 via the host interface, the front end, and the task/work unit (distributed workload parser circuits). (Fahs, [0032], Figs. 1-2).  Each GPC processes data to be written to any of the DRAMs 220 (buffer circuitry configured to buffer compute work, including portions of the compute kernel, received by one or more of the distributed workload parser circuits from the primary workload parser circuitry).  The DRAMs are considered to include buffer circuitry.  Howes and Fahs can be combined with Nijasure such that the stream of commands of Fahs is the compute kernel of Howes.  
Applicant's arguments filed 2/22/2022 have been fully considered but they are not persuasive.  Applicant argues:  
“… Applicant submits that any adjustments to buffer limits in the reference appear to occur in the opposite direction to that recited in amended claim 1.  In particular, the Howes reference states that “more memory locations may be needed if a wavefront enters a more complex subroutine” Howes at 0081… Thus, Howes appears to increase dynamic GPR availability for more complex work, while amended claim 1 recites to provide a “greater number of entries” for a “less complex” compute kernel.  As discussed in Applicant’s specification, lowering the buffer size for more complex work may advantageously maintain fast launch rates for fast-running work while reducing or avoiding workload imbalances for other types of work.  See specification at 0016.  Applicant submits that Fahs and Nijasure do not remedy these defects in Howes.”
Examiner disagrees.  As discussed above, in the 35 U.S.C. § 112(a) Rejection.  This amended limitation is considered to be New Matter.  For example, the specification discloses in [0040], “… there are four entries available in buffer circuitry 500.  In this example, buffer size adjustment circuitry 310 has determined to reduce the buffering entries by 2.  Therefore, in this example, there are only two credits available to workload parser circuitry to send additional work to buffer circuitry 500.  This reduction in buffering may advantageously reduce workload imbalances, in some embodiments, for more complex kernels.”  The specification discloses that the number of entries in the buffer circuitry may be reduced (in this case from 4 to 2) for more complex kernels.  But, the specification does not disclose how this compares to the number of entries available in the buffer circuitry for less complex kernels.  The specification does not disclose providing a greater number of entries for a first compute kernel than a second compute kernel, when the first compute kernel is less complex than the second compute kernel.  Howes teaches, “GPU may spill data from the dynamic GPRs (e.g., dynamic memory locations of the one or more GPRs), but not from the static GPRs… for the wavefronts allocated to a kernel, and in some examples, only the amount of data needed to allow instructions of the higher priority kernel to execute… the number of memory locations… needed changes dynamically during execution.  For instance, more memory locations may be needed if a wavefront enters a more complex subroutine, and decreases on exit from those subroutines.  For example, if there is an if/then/else instruction in instructions of a wavefront, one of the instructions may go through the if-condition and another may go through the else-condition.  The if-condition may require fewer memory locations, and the else-condition may require more memory locations, but whether the if-condition or else-condition is met is not known until execution (i.e., known dynamically).” (Howes, [0055], [0081]).  For examination purposes, Examiner interprets this claim according to the specification, such that the number of entries is adjusted by different amounts, based on the complexity of the compute kernel.  The GPU adjusts the number of memory locations allocated during execution of the kernel based on the complexity of the kernel.  More memory locations are needed when a wavefront enters a more complex subroutine.  For an if/then/else instruction of a wavefront, the if-condition requires fewer memory locations and the else-condition requires more memory locations.  For example, if a first compute kernel includes less complexity, such as an if-condition, then less memory locations are required.  If a second compute kernel includes more complexity, such as an else-condition, then more memory locations are required.  
Applicant's arguments filed 2/22/2022 have been fully considered but they are not persuasive.  Applicant argues:  
“For at least these reasons, Applicant submits that amended claim 1 is allowable, and that all of its dependent claims are likewise allowable.  Similar arguments apply to the other independent claims and their dependents.  Accordingly, Applicant submits that the other independent claims, as well as their dependents, are also allowable. ”
Examiner disagrees.  Claims 1-4, 9, 11-13 and 16 are rejected.  

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  Schzukin et al. U.S. Patent No. 6,694,388 and Qiu et al. U.S. Pub. No. 2018/0322078.  
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DONNA J RICKS whose telephone number is (571)270-7532.  The examiner can normally be reached on M-F 7:30am-5pm EST (alternate Fridays off).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on 571-272-2976.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 



/Donna J. Ricks/Examiner, Art Unit 2612 




/JENNIFER MEHMOOD/Supervisory Patent Examiner, Art Unit 2612