DETAILED ACTION
Claims 1-20 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 4-6, 16-17, and 22-26 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
The following limitations are unclear:
As per claim 4, lines 2-4 recites “determine a benefit value for each resource of the set of resources based on a cost function; and select the resource from the set of resources based on the benefit value” it is unclear what constitutes a benefit value of each resource in the context of the claim. Is the benefit related to the use of global memory or fast shared resource? Is the resource being selected by the different program versions based on the benefit value? 
As per claim 5 does not cure the deficiencies set forth above for claim 4. Therefore, it is rejected for the same reasons above.
As per claim 6, lines 4-5 discuss the step of selecting a second resource based on the benefit value. It is unclear from the context of the claim why is the resource selected. If a claim fails to interrelate essential elements of the invention as defined by applicant(s) in the specification, the claim may be rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph, as indefinite. Then on lines 6-7 a third version is generated which access the second resources from the fast shared resource which suggests that the second resources is located in the fast shared resource. Further in lines 8-9 a determination on whether the second resource should be stored on the global memory or the fast shared resource. As such it is unclear how a third version is generated for accessing the second resource in the fast shared resource if it is not yet determined where the second resource will be stored.
As per claim 7-9 do not cure the deficiencies set forth above for claim 6. Therefore, they are rejected for the same reasons above.
As per claims 10, 15, and 27, all include the term “uniformly” which is a relative term that renders the claim indefinite. The term “uniformly” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention.
Claims 16-17, and 22-26 have similar limitations as the ones above on claims 4-10 and therefore are rejected for the same reasons.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 10-15, 18-21, 27, and 28 are rejected under 35 U.S.C. 103 as being unpatentable over Udayakumaran (US 2021/0011697 A1) in view of Gangani et al. (US 2021/0200608 A1).

Regarding claim 1, Udayakumaran teaches the invention substantially as claimed including an apparatus for graphics processing ([0001] graphics rendering; [0011]; [0013] The APD 116 is configured to accept compute commands and graphics rendering commands from processor 102, to process those compute and graphics rendering commands, and to provide pixel output to display device 118 for display), comprising: 
a memory ([0069] The methods or flow charts provided herein can be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution); and 
at least one processor coupled to the memory ([0069] The methods or flow charts provided herein can be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor.) and configured to: 
determine a set of resources, wherein each resource of the set of resources is accessed by a plurality of threads of a graphics processing unit (GPU) program through memory access instructions ([0055] The compiler 404 performs this compilation step without generating any multiversion shaders, in order to identify the resource usage for the shaders (i.e., GPU program). The initial compiled shaders 502 therefore do not include multiversion shaders. The compiler 404 determines, for each region 506, the number of resources that particular region needs. The compiler 404 may use any technically feasible technique to divide compiled shaders 502 into regions. A region is a sub-set or whole of a shader program (i.e., plurality of threads). In some examples, resources are registers, but may be any other type of execution resources.); 
generate a first version of the GPU program accessing a resource of the set of resources from a global memory (Abstract; [0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132). The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory (i.e., first version of the GPU program). The result is that the remaining set of instructions uses a number of registers less than or equal to the threshold. It should be noted that the difference between “registers” and “memory” is that registers are low latency but low capacity memory elements while “memory” has a higher latency but higher capacity than registers (i.e., global memory). [0060] Then, shaders are compiled with versions based on the resource usage of all shaders that are compiled. [0061] It should be understood that throughout the present disclosure, the versions of instructions are different “versions” in the sense that the different versions have different resource utilization.); 
generate a second version of the GPU program accessing the resource from a fast shared resource (Abstract; [0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132). The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory. The result is that the remaining set of instructions uses a number of registers less than or equal to the threshold (i.e., second version of the GPU program). It should be noted that the difference between “registers” and “memory” is that registers are low latency but low capacity memory elements (i.e., fast shared resource) while “memory” has a higher latency but higher capacity than registers. [0061] It should be understood that throughout the present disclosure, the versions of instructions are different “versions” in the sense that the different versions have different resource utilization. [0060] Then, shaders are compiled with versions based on the resource usage of all shaders that are compiled. [0061] It should be understood that throughout the present disclosure, the versions of instructions are different “versions” in the sense that the different versions have different resource utilization.);  and 
transmit the first version of the GPU program and the second version of the GPU program to a second processor ([0013] The APD 116 is configured to accept compute commands and graphics rendering commands from processor 102, to process those compute and graphics rendering commands, and to provide pixel output to display device 118 for display…Thus, although various functionality is described herein as being performed by or in conjunction with the APD 116, in various alternatives, the functionality described as being performed by the APD 116 is additionally or alternatively performed by other computing devices having similar capabilities that are not driven by a host processor (e.g., processor 102) and configured to provide (graphical) output to a display device 118. [0018] An application or other entity (a “host”) requests that shader programs be executed by the accelerated processing device 116. [0021] Shader programs to be executed on the APD 116 are compiled from source code into machine instructions [0047]).

	While Udayakumaran does teaches a memory, Udayakumaran does not expressly teach a global memory. 

However, Gangani teaches a global memory ([0016] the graphics processor starts executing the shader kernels of the command buffer. For example, the graphics processor may receive a shader kernel identifying one or more operations to perform. In some examples, executing a shader kernel may include performing one or more mathematical operations on input data. In some such examples, the graphics processor may access (e.g., receive and/or retrieve) the input data from a memory (e.g., a global memory) that is accessible to the application processor and to the graphics processor. One or more processing elements of the graphics processor may then perform the one or more mathematical operations on the input data and generate output data. The graphics processor may then store the generated output data at the memory (e.g., the global memory).).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Gangani with the teachings of Udayakumaran to further define the memory as a global memory from which shader instructions can access data. The modification would have been motivated by the desire of combining known elements to yield predictable results.

Regarding claim 2, Udayakumaran teaches wherein the second processor is configured to: 
determine whether to store the resource on the global memory or the fast shared resource based on hardware resources available for the GPU program and utilize the second version of the GPU program if the resource is to be stored on the fast shared resource ([0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132). The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory. The result is that the remaining set of instructions uses a number of registers less than or equal to the threshold. It should be noted that the difference between “registers” and “memory” is that registers are low latency but low capacity memory elements while “memory” has a higher latency but higher capacity than registers.).

Regarding claim 3, Udayakumaran teaches wherein the second processor is configured to: 
utilize the first version of the GPU program if the resource is to be stored on the global memory ([0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132). The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory.).

Regarding claim 10, Udayakumaran teaches wherein each resource of the set of resources is uniformly accessed by the plurality of threads of the GPU program ([0034] The rate at which a shader program executes is dependent on the number of resources consumed by each unit of execution of the shader program.).

Regarding claim 11, Udayakumaran teaches wherein the resource is a texture ([0027] a texture) or a buffer ([0034] More specifically, there is a fixed number of computing resources, such as registers, cache memory, or higher level memory local to a compute unit 132. A register is typically the lowest level storage space available to execution units of a processor and instructions of the processor typically refer to registers by name (e.g., “r1” for register 1), instead of address as is typically the case for memory.).

Regarding claim 12, Udayakumaran teaches the invention substantially as claimed including an apparatus for graphics processing ([0001] graphics rendering; [0011]; [0013] The APD 116 is configured to accept compute commands and graphics rendering commands from processor 102, to process those compute and graphics rendering commands, and to provide pixel output to display device 118 for display), comprising: 
a memory ([0069] The methods or flow charts provided herein can be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution); and 
at least one processor coupled to the memory ([0069] The methods or flow charts provided herein can be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor.) and configured to: 
receive a first version of a graphics processing unit (GPU) program and a second version of the GPU program, the first version of the GPU program being configured to access a resource from a global memory, the second version of the GPU program being configured to access the resource from a fast shared resource ([0018] An application or other entity (a “host”) requests that shader programs be executed by the accelerated processing device 116. [0021] Shader programs to be executed on the APD 116 are compiled from source code into machine instructions [0047]; [0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132). The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory (i.e., first version of the GPU program). The result is that the remaining set of instructions uses a number of registers less than or equal to the threshold (i.e., second version of the GPU program). It should be noted that the difference between “registers” and “memory” is that registers are low latency but low capacity memory elements (i.e., fast shared resource) while “memory” has a higher latency but higher capacity than registers (i.e., global memory).)
 determine whether to store the resource on the global memory or the fast shared resource based on hardware resources available for the GPU program ([0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132).); and 
utilize the second version of the GPU program if the resource is to be stored on the fast shared resource ([0057] instructions that use this subset of registers into instructions that read from and write to memory. The result is that the remaining set of instructions uses a number of registers less than or equal to the threshold (i.e., second version of the GPU program).).

While Udayakumaran does teaches a memory, Udayakumaran does not expressly teach a global memory. 

However, Gangani teaches a global memory ([0016] the graphics processor starts executing the shader kernels of the command buffer. For example, the graphics processor may receive a shader kernel identifying one or more operations to perform. In some examples, executing a shader kernel may include performing one or more mathematical operations on input data. In some such examples, the graphics processor may access (e.g., receive and/or retrieve) the input data from a memory (e.g., a global memory) that is accessible to the application processor and to the graphics processor. One or more processing elements of the graphics processor may then perform the one or more mathematical operations on the input data and generate output data. The graphics processor may then store the generated output data at the memory (e.g., the global memory).).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Gangani with the teachings of Udayakumaran to further define the memory as a global memory from which shader instructions can access data. The modification would have been motivated by the desire of combining known elements to yield predictable results.

Regarding claim 13, Udayakumaran teaches wherein the processor is further configured to: utilize the first version of the GPU program if the resource is to be stored on the global memory ([0057] The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory (i.e., first version of the GPU program).).

Regarding claim 14, Udayakumaran teaches wherein the first version of the GPU program and the second version of the GPU program are received from a second processor configured to: 
determine a set of resources comprising the resource, wherein each resource of the set of resources is accessed by a plurality of threads of the GPU program through memory access instructions, generate the first version of the GPU program accessing the resource from the global memory ([0055] The compiler 404 performs this compilation step without generating any multiversion shaders, in order to identify the resource usage for the shaders (i.e., GPU program). The initial compiled shaders 502 therefore do not include multiversion shaders. The compiler 404 determines, for each region 506, the number of resources that particular region needs. The compiler 404 may use any technically feasible technique to divide compiled shaders 502 into regions. A region is a sub-set or whole of a shader program (i.e., plurality of threads). In some examples, resources are registers, but may be any other type of execution resources.);, and generate the second version of the GPU program accessing the resource from the fast shared resource ([0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132). The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory (i.e., first version of the GPU program). The result is that the remaining set of instructions uses a number of registers less than or equal to the threshold (i.e., second version of the GPU program). It should be noted that the difference between “registers” and “memory” is that registers are low latency but low capacity memory elements (i.e., fast shared resource) while “memory” has a higher latency but higher capacity than registers (i.e., global memory).).

Regarding claim 15, Udayakumaran teaches wherein each resource of the set of resources is uniformly accessed by the plurality of threads of the GPU program ([0034] The rate at which a shader program executes is dependent on the number of resources consumed by each unit of execution of the shader program.).

Regarding claim 18, it is a system type claim having similar limitations as claim 12 above. Therefore, it is rejected under the same reasons above. 

Regarding claim 19, Udayakumaran teaches wherein the processor is further configured to: 
utilize the first version of the GPU program if the resource and the second resource are both to be stored on the global memory, wherein the first version of the GPU program accesses the resource and the second resource from the global memory (Abstract; [0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132). The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory (i.e., first version of the GPU program). The result is that the remaining set of instructions uses a number of registers less than or equal to the threshold. It should be noted that the difference between “registers” and “memory” is that registers are low latency but low capacity memory elements while “memory” has a higher latency but higher capacity than registers (i.e., global memory).

Regarding claim 20, Udayakumaran teaches wherein the resource is a texture ([0027] a texture) or a buffer ([0034] More specifically, there is a fixed number of computing resources, such as registers, cache memory, or higher level memory local to a compute unit 132. A register is typically the lowest level storage space available to execution units of a processor and instructions of the processor typically refer to registers by name (e.g., “r1” for register 1), instead of address as is typically the case for memory.).

Regarding claim 21, it is a method type claim having similar limitations as claim 1 above. Therefore it is rejected under the same rationale above.

Regarding claim 27, it is a method type claim having similar limitations as claim 10 above. Therefore it is rejected under the same rationale above.

Regarding claim 28, it is a method type claim having similar limitations as claim 11 above. Therefore it is rejected under the same rationale above.

Claims 4-9, 16-17, and 22-26 are rejected under 35 U.S.C. 103 as being unpatentable over Udayakumaran and Gangani, as applied to claim 1, in further view of, Kroft et al. (US 4,370,710).

Regarding claim 4, Udayakumaran teaches latency associated with the location of data (register vs memory) and Gangani discusses a cost function based on memory access latency but neither Udayakumaran nor Gangani expressly teach wherein the processor is further configured to: determine a benefit value for each resource of the set of resources based on a cost function, and select the resource from the set of resources based on the benefit value of the resource.

	However, Kroft teaches wherein the processor is further configured to: determine a benefit value for each resource of the set of resources based on a cost function, and select the resource from the set of resources based on the benefit value of the resource (Col. 4, lines 52-65: a cache buffer memory is considered to be any small fast memory holding the most recently accessed data in a computer system and its immediately surrounding neighbors in a logical sense. Because the access time of this cache buffer memory is usually an order of magnitude faster than the main or central computer memory and the standard software practice is to localized data, the effective memory access time is considerably reduced in a computer system in which a cache buffer memory is included. The cost increment for the cache memory when compared with the cost of a central memory alone shows the cost effectiveness of the cache memory because of the reduced access time.).
	
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to retain frequently accessed data in  cache registers rather than global/central memory in order to reduce access time as taught by Kroft with the teachings of Udayakumaran of allowing data on registers based on a threshold and spilling the rest to a memory. 

Regarding claim 5, Kroft teaches wherein the cost function is based on a number of reduced memory access instructions associated with storing the resource on the fast shared resource (Col. 4, lines 52-65: Because the access time of this cache buffer memory is usually an order of magnitude faster than the main or central computer memory and the standard software practice is to localized data, the effective memory access time is considerably reduced in a computer system in which a cache buffer memory is included. The cost increment for the cache memory when compared with the cost of a central memory alone shows the cost effectiveness of the cache memory because of the reduced access time.), an amount of saved memory access bandwidth associated with storing the resource on the fast shared resource, or an amount of reduced register use associated with storing the resource on the fast shared resource.
	 
Regarding claim 6, Udayakumaran teaches wherein the processor is further configured to: 
select a second resource from the set of resources based on the benefit value of the second resource ([0056] The compiler 404 sets a resource usage threshold and determines whether each region 506 is below or above the set threshold. The compiler 404 recompiles the shader sources 402 based on the determination of which regions 506 are above the threshold. For a region 506 having resource usage above the threshold, the compiler 404 recompiles that region 506 as a compiled multiversion shader 406 that includes two or more versions for the region 506. One of the versions is a version unmodified by a resource use reduction technique that would reduce the resource usage as compared with the version The other version is the version that is modified by the compiler to reduce resource usage to less than or equal to the threshold. The compiler 404 includes each of those generated versions into the compiled multiversion shader 406. For a region 506 having resource usage below or equal to the threshold, the compiler 404 retains the version of that region 506 as compiled in an initial compiled shader 502. For shader programs where all regions 506 have resource usage below the threshold, the compiler 404 marks the initial compiled shader 502 as the compiled shader 408 to be included in the stitched shader program 422. For shader programs where at least one region 506 has resource usage above the threshold, the compiler 404 recompiles the shader program as a compiled multiversion shader 506 that includes at least two versions of the at least one region 506 that has resource usage above the threshold. At least one version has resource usage less than or equal to the threshold and at least one other version has resource usage above the threshold. At least one such version may be an unmodified version of the region 506, that is, the version of the region in the initial compiled shader 506.); and 
generate a third version of the GPU program accessing the second resource from the fast shared resource, wherein the second processor is configured to: 
determine whether to store the second resource on the global memory or the fast shared resource based on the hardware resources available for the GPU program and utilize the third version of the GPU program if the second resource is to be stored on the fast shared resource (Abstract; [0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132). The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory. The result is that the remaining set of instructions uses a number of registers less than or equal to the threshold (i.e., third version of the GPU program). It should be noted that the difference between “registers” and “memory” is that registers are low latency but low capacity memory elements (i.e., fast shared resource) while “memory” has a higher latency but higher capacity than registers. [0061] It should be understood that throughout the present disclosure, the versions of instructions are different “versions” in the sense that the different versions have different resource utilization. [0060] Then, shaders are compiled with versions based on the resource usage of all shaders that are compiled. [0061] It should be understood that throughout the present disclosure, the versions of instructions are different “versions” in the sense that the different versions have different resource utilization.).

In addition, Kroft teaches wherein the processor is further configured to: 
determine a benefit value for each resource of the set of resources based on a cost function and select the resource from the set of resources based on the benefit value of the resource (Col. 4, lines 52-65: a cache buffer memory is considered to be any small fast memory holding the most recently accessed data in a computer system and its immediately surrounding neighbors in a logical sense. Because the access time of this cache buffer memory is usually an order of magnitude faster than the main or central computer memory and the standard software practice is to localized data, the effective memory access time is considerably reduced in a computer system in which a cache buffer memory is included. The cost increment for the cache memory when compared with the cost of a central memory alone shows the cost effectiveness of the cache memory because of the reduced access time.). 

Regarding claim 7, Udayakumaran teaches wherein the second processor is configured to: utilize the first version of the GPU program if the resource and the second resource are both to be stored on the global memory, wherein the first version of the GPU program accesses the resource and the second resource from the global memory (Abstract; [0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132). The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory (i.e., first version of the GPU program). The result is that the remaining set of instructions uses a number of registers less than or equal to the threshold. It should be noted that the difference between “registers” and “memory” is that registers are low latency but low capacity memory elements while “memory” has a higher latency but higher capacity than registers (i.e., global memory).

Regarding claim 8, Udayakumaran teaches wherein the processor is configured to select the resource from the set of resources and select the second resource from the set of resources by determining that the benefit value of the resource and the benefit value of the second resource both exceed a threshold value ([0056] The compiler 404 sets a resource usage threshold and determines whether each region 506 is below or above the set threshold. The compiler 404 recompiles the shader sources 402 based on the determination of which regions 506 are above the threshold. For a region 506 having resource usage above the threshold, the compiler 404 recompiles that region 506 as a compiled multiversion shader 406 that includes two or more versions for the region 506. One of the versions is a version unmodified by a resource use reduction technique that would reduce the resource usage as compared with the version The other version is the version that is modified by the compiler to reduce resource usage to less than or equal to the threshold. The compiler 404 includes each of those generated versions into the compiled multiversion shader 406. For a region 506 having resource usage below or equal to the threshold, the compiler 404 retains the version of that region 506 as compiled in an initial compiled shader 502. For shader programs where all regions 506 have resource usage below the threshold, the compiler 404 marks the initial compiled shader 502 as the compiled shader 408 to be included in the stitched shader program 422. For shader programs where at least one region 506 has resource usage above the threshold, the compiler 404 recompiles the shader program as a compiled multiversion shader 506 that includes at least two versions of the at least one region 506 that has resource usage above the threshold. At least one version has resource usage less than or equal to the threshold and at least one other version has resource usage above the threshold. At least one such version may be an unmodified version of the region 506, that is, the version of the region in the initial compiled shader 506. [0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132). The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory. The result is that the remaining set of instructions uses a number of registers less than or equal to the threshold. It should be noted that the difference between “registers” and “memory” is that registers are low latency but low capacity memory elements while “memory” has a higher latency but higher capacity than registers.).

Regarding claim 9, Udayakumaran teaches wherein the processor is configured to select the second resource from the set of resources by determining that the benefit value of the second resource is at least a threshold percentage of the benefit value of the first resource ([0056] The compiler 404 sets a resource usage threshold and determines whether each region 506 is below or above the set threshold. The compiler 404 recompiles the shader sources 402 based on the determination of which regions 506 are above the threshold. For a region 506 having resource usage above the threshold, the compiler 404 recompiles that region 506 as a compiled multiversion shader 406 that includes two or more versions for the region 506. One of the versions is a version unmodified by a resource use reduction technique that would reduce the resource usage as compared with the version The other version is the version that is modified by the compiler to reduce resource usage to less than or equal to the threshold. The compiler 404 includes each of those generated versions into the compiled multiversion shader 406. For a region 506 having resource usage below or equal to the threshold, the compiler 404 retains the version of that region 506 as compiled in an initial compiled shader 502. For shader programs where all regions 506 have resource usage below the threshold, the compiler 404 marks the initial compiled shader 502 as the compiled shader 408 to be included in the stitched shader program 422. For shader programs where at least one region 506 has resource usage above the threshold, the compiler 404 recompiles the shader program as a compiled multiversion shader 506 that includes at least two versions of the at least one region 506 that has resource usage above the threshold. At least one version has resource usage less than or equal to the threshold and at least one other version has resource usage above the threshold. At least one such version may be an unmodified version of the region 506, that is, the version of the region in the initial compiled shader 506. [0057] One example resource is number of registers used. One example technique for reducing the number of registers used by a region or shader is register spilling. In register spilling, the compiler 404 begins with a first form of compiled instructions that uses a number of registers above a threshold. Then the compiler 404 identifies a subset of the number of registers to “spill” into memory (such as a local memory in the compute unit 132). The compiler 404 then converts instructions that use this subset of registers into instructions that read from and write to memory. The result is that the remaining set of instructions uses a number of registers less than or equal to the threshold. It should be noted that the difference between “registers” and “memory” is that registers are low latency but low capacity memory elements while “memory” has a higher latency but higher capacity than registers.).

Regarding claim 16, Udayakumaran teaches latency associated with the location of data (register vs memory) and Gangani discusses a cost function based on memory access latency but neither Udayakumaran nor Gangani expressly teach wherein the second processor is configured to: determine a benefit value for each resource of the set of resources based on a cost function, and select the resource from the set of resources based on the benefit value of the resource.
	
However, Kroft teaches wherein the second processor is configured to: determine a benefit value for each resource of the set of resources based on a cost function, and select the resource from the set of resources based on the benefit value of the resource (Col. 4, lines 52-65: a cache buffer memory is considered to be any small fast memory holding the most recently accessed data in a computer system and its immediately surrounding neighbors in a logical sense. Because the access time of this cache buffer memory is usually an order of magnitude faster than the main or central computer memory and the standard software practice is to localized data, the effective memory access time is considerably reduced in a computer system in which a cache buffer memory is included. The cost increment for the cache memory when compared with the cost of a central memory alone shows the cost effectiveness of the cache memory because of the reduced access time.).
	
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to retain frequently accessed data in  cache registers rather than global/central memory in order to reduce access time as taught by Kroft with the teachings of Udayakumaran of allowing data on registers based on a threshold and spilling the rest to a memory.

Regarding claim 17, Kroft teaches wherein the cost function is based on a number of reduced memory access instructions associated with storing the resource on the fast shared resource (Col. 4, lines 52-65: Because the access time of this cache buffer memory is usually an order of magnitude faster than the main or central computer memory and the standard software practice is to localized data, the effective memory access time is considerably reduced in a computer system in which a cache buffer memory is included. The cost increment for the cache memory when compared with the cost of a central memory alone shows the cost effectiveness of the cache memory because of the reduced access time.), an amount of saved memory access bandwidth associated with storing the resource on the fast shared resource, or an amount of reduced register use associated with storing the resource on the fast shared resource.

Regarding claim 22, it is a method type claim having similar limitations as claim 4 above. Therefore it is rejected under the same rationale above.

Regarding claim 23, it is a method type claim having similar limitations as claim 5 above. Therefore it is rejected under the same rationale above.

Regarding claim 24, it is a method type claim having similar limitations as claim 6 above. Therefore it is rejected under the same rationale above.

Regarding claim 25, it is a method type claim having similar limitations as claim 8 above. Therefore it is rejected under the same rationale above.

Regarding claim 26, it is a method type claim having similar limitations as claim 9 above. Therefore it is rejected under the same rationale above.



Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JORGE A CHU JOY-DAVILA whose telephone number is (571)270-0692. The examiner can normally be reached Monday-Friday, 9:00am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Meng-Ai T An can be reached on (571)-272-3756. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JORGE A CHU JOY-DAVILA/Primary Examiner, Art Unit 2195