DETAILED ACTION
This Office Action is in response to the Applicants' communication filed on April 7, 2022, which amends the independent claims 1, 10, and 13, amends the dependent claims 5-6 and 17, and presents arguments, is hereby acknowledged. Claims 1-20 are currently pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Applicant’s arguments filed on April 7, 2022, have been fully considered.
	Applicant argues that by this response, the independent claims 1, 10, and 13 are hereby amended to add a new limitation “provide access to data stored in the memory circuitry according to multiple threadgroup spaces that are assigned to a given threadgroup and not accessible to other threadgroups” in order to overcome the 35 U.S.C. §103 rejection.
Examiner replies that the amended claims with new limitation may overcome the cited portions of the prior arts. However, a newly found art, Potter, etc. (US 20180182058 A1) teaches that provide access to data stored in the memory circuitry according to multiple threadgroup spaces that are assigned to a given threadgroup and not accessible to other threadgroups (See Potter: Figs. 1-2, and [0043], “Local memory 230, in some embodiments, is a physical memory implementation of an API-defined threadgroup memory space and is accessible to compute kernel and fragment processing tasks (e.g., from a compute data master scheduler and a pixel data master scheduler respectively) and may not be accessible for other tasks such as vertex processing. All work items within a thread group (a group of threads assigned to the same shader processing element(s) and scheduled to share a memory context) see the same allocation in local memory. In some embodiments, for fragment processing, local memory is tile-scoped such that all threads corresponding to a tile see the same local memory allocation. As discussed in further detail below, these threads may include fragment threads and mid-render compute threads. Note that threads may also be assigned to the same single-instruction multiple data (SIMD) group, which may include a portion of the threads in a thread group, where threads in the SIMD group execute the same instructions (other than instructions that are predicated off, in embodiments with predicated execution) and are scheduled to execute in parallel using parallel hardware. The number of threads assigned to SIMD groups may vary, in different embodiments”). The remaining arguments of the applicant are mooted in view of the newly found art.
Examiner respectfully further replies that the Applicant's arguments have been fully considered and a new ground of rejections have been made. Accordingly, new grounds of rejection are set forth below. Since the new grounds of rejection are necessitated by Applicant's amendments to the claims, the present action is made final.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Acharya, etc. (US 20170221173 A1) in view of Clohset, etc. (US 20120139926 A1), further in view of Potter, etc. (US 20180182058 A1).
Regarding claim 1, Acharya teaches that an apparatus (See Acharya: Fig. 1, and [0023], "FIG. 1 is a block diagram illustrating an example computing device 2 that may be used to implement the techniques of this disclosure. Computing device 2 may comprise a personal computer, a desktop computer, a laptop computer, a computer workstation, a video game platform or console, a wireless communication device (such as, e.g., a mobile telephone, a cellular telephone, a satellite telephone, and/or a mobile telephone handset), a landline telephone, an Internet telephone, a handheld device such as a portable video game device or a personal digital assistant (PDA), a personal music player, a video player, a display device, a television, a television set-top box, a server, an intermediate network device, a mainframe computer, an advanced driver assistance system (ADAS), a virtual reality headset, a drone, or any other type of device that processes and/or displays graphical data"), comprising:
first and second graphics shader cores configured to execute instructions for multiple threadgroups, wherein the first graphics shader core is configured to execute (See Acharya: Fig. 1, and [0098], "In some examples, GPU 12 may preempt different parts of a graphics processing pipeline executing on GPU 12 at varied preemption granularities. To perform graphics operations, GPU 12 may implement a graphics processing pipeline. The graphics processing pipeline includes performing functions as defined by software or firmware executing on GPU 12 and performing functions by fixed-function units that are hardwired to perform very specific functions. The software or firmware executing on the GPU 12 may be referred to as shaders, and the shaders may execute on one or more shader cores of GPU 12. Shaders provide users with functional flexibility because a user can design the shaders to perform desired tasks in any conceivable manner. The fixed-function units, however, are hardwired for the manner in which the fixed-function units perform tasks. Accordingly, the fixed-function units may not provide much functional flexibility"):
a first threadgroup with multiple single-instruction multiple-data (SIMD) groups configured to execute a first shader program (See Acharya: Fig. 2, and [0088], "In one example, if processing units 34 comprise parallel processors (e.g., shader processors), command stream 36A may include, in its set of commands, data parallel code that may be executed by the parallel processors of processing units 34. For example, processing units 34 may execute the same set of commands (or sub-draw call level commands indicated by draw level commands) of command stream 36A to operate on multiple data values in parallel. For each instance of the set of commands, processing units 34 may spawn a thread, also referred to a kernel, for that instance of the set of commands, and may execute the thread on one of its parallel processors. The group of threads or kernels for a particular set of commands may be grouped into one or more workgroups, and processing units 34 may execute a workgroup of the kernels in parallel"); and
a second threadgroup with multiple SIMD groups configured to execute a second, different shader program (See Acharya: Fig. 1, and [0117], "GPU 12 may, in response to its one or more processing units reaching a point in the first set of commands indicated by the second preemption boundary, perform a context switch to the second set of commands by saving a state of the GPU associated with execution of the first set of commands and dispatch one or more of the second set of commands for execution on the one or more processing units of GPU 12");
memory circuitry (See Acharya: Fig. 1, and [0024], "As illustrated in the example of FIG. 1, computing device 2 includes a user input interface 4, a CPU 6, a memory controller 8, a system memory 10, a graphics processing unit (GPU) 12, a local memory 14, a display interface 16, a display 18 and bus 20. User input interface 4, CPU 6, memory controller 8, GPU 12 and display interface 16 may communicate with each other using bus 20. Bus 20 may be any of a variety of bus structures, such as a third generation bus (e.g., a HyperTransport bus or an lnfiniBand bus), a second generation bus (e.g., an Advanced Graphics Port bus, a Peripheral Component Interconnect (PCI) Express bus, or an Advanced eXentisible Interface (AXI) bus) or another type of bus or device interconnect. It should be noted that the specific configuration of buses and communication interfaces between the different components shown in FIG. 1 is merely exemplary, and other configurations of computing devices and/or other graphics processing systems with the same or different components may be used to implement the techniques of this disclosure"); and
control circuitry, wherein the control circuitry is configured to: 
provide access to data stored in the memory circuitry according to a shader memory space that is accessible to threadgroups executed by the first graphics shader core, including the first and second threadgroups (See Acharya: Fig. 1, and [0027], "Memory controller 8 facilitates the transfer of data going into and out of system memory 10. For example, memory controller 8 may receive memory read and write commands, and service such commands with respect to memory 10 in order to provide memory services for the components in computing device 2. Memory controller 8 is communicatively coupled to system memory 10. Although memory controller 8 is illustrated in the example computing device 2 of FIG. 1 as being a processing module that is separate from both CPU 6 and system memory 10, in other examples, some or all of the functionality of memory controller 8 may be implemented on one or both of CPU 6 and system memory 10"), but is not accessible to threadgroups executed by the second graphics shader core; and provide access to data stored in the memory circuitry according to multiple threadgroup spaces that are assigned to a given threadgroup and not accessible to other threadgroups. 
However, Acharya fails to explicitly disclose that the control circuitry is configured to provide access to data stored in the memory circuitry according to a shader memory space that is accessible to threadgroups executed by the first graphics shader core, including the first and second threadgroups, but is not accessible to threadgroups executed by the second graphics shader core; and provide access to data stored in the memory circuitry according to multiple threadgroup spaces that are assigned to a given threadgroup and not accessible to other threadgroups.
However, Clohset teaches that the control circuitry is configured to provide access to data stored in the memory circuitry according to a shader memory space that is accessible to threadgroups executed by the first graphics shader core, including the first and second threadgroups, but is not accessible to threadgroups executed by the second graphics shader core (See Clohset: Figs. 1-3, and [0064], "Each thread in the system can define a plurality of fibres, each with a respective scheduling key. Further, a plurality of fibre routines can be defined, such that a given thread can define a number of fibres, which are instances of different fibre routines. As explained above, in an exemplary aspect, fibres of the same fibre routine (sharing a code base), and from the same parent (e.g., a parent thread or fibre), do not overlap in execution in implementations that avoid mutexes or locks for accessing at least one variable or data source that is available only to those fibres").
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Acharya to have the control circuitry is configured to provide access to data stored in the memory circuitry according to a shader memory space that is accessible to threadgroups executed by the first graphics shader core, including the first and second threadgroups, but is not accessible to threadgroups executed by the second graphics shader core as taught by Clohset in order to allow better usage of parallel computation resources (See Clohset: Figs. 1-2, and [0051], "One area that remains a topic of consideration is how to subdivide a given computing task to take advantage of a parallelized computation resource. In some aspects, the following relates to methods, components and systems of computation that provide capabilities to subdivide a computing task in ways that can allow better usage of parallel computation resources"). Acharya teaches a method and system that may dispatch first set commands to one or more GPU, generate a notification from the host device indicating that a second set commands are ready to be executed, and select the preemption scheme that balances interrupting lower priority commands in time such that the GPU can be free to execute the higher priority set of commands by its scheduling deadline while minimizing the overhead to perform context switching, while Clohset teaches a system and method that may select processors and allocate memory associated with those processors in order to group the workloads according to compatibility of memory usage requirements. Therefore, it is obvious to one of ordinary skill in the art to modify Acharya by Clohset to control memory accessibility for each processing unit (shader cores) in order to maintain the data consistency. The motivation to modify Acharya by Clohset is "Use of known technique to improve similar devices (methods, or products) in the same way".
However, Acharya, modified by Clohset, fails to explicitly disclose that provide access to data stored in the memory circuitry according to multiple threadgroup spaces that are assigned to a given threadgroup and not accessible to other threadgroups.
However, Potter teaches that provide access to data stored in the memory circuitry according to multiple threadgroup spaces that are assigned to a given threadgroup and not accessible to other threadgroups (See Potter: Figs. 1-2, and [0043], “Local memory 230, in some embodiments, is a physical memory implementation of an API-defined threadgroup memory space and is accessible to compute kernel and fragment processing tasks (e.g., from a compute data master scheduler and a pixel data master scheduler respectively) and may not be accessible for other tasks such as vertex processing. All work items within a thread group (a group of threads assigned to the same shader processing element(s) and scheduled to share a memory context) see the same allocation in local memory. In some embodiments, for fragment processing, local memory is tile-scoped such that all threads corresponding to a tile see the same local memory allocation. As discussed in further detail below, these threads may include fragment threads and mid-render compute threads. Note that threads may also be assigned to the same single-instruction multiple data (SIMD) group, which may include a portion of the threads in a thread group, where threads in the SIMD group execute the same instructions (other than instructions that are predicated off, in embodiments with predicated execution) and are scheduled to execute in parallel using parallel hardware. The number of threads assigned to SIMD groups may vary, in different embodiments”).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Acharya to have provide access to data stored in the memory circuitry according to multiple threadgroup spaces that are assigned to a given threadgroup and not accessible to other threadgroups as taught by Potter in order to enforce consistency at the threadgroup level or the device level (See Potter: Figs. 1-2, and [0063], "In table 1, “gm” refers to global memory (where “lm” refers to local memory, e.g. for threadgroup-scoped variables). As shown in table 1, synchronization instructions are used at different scopes to enforce consistency at the threadgroup level or the device level. In the example of Table 1, device-scoped synchronization occurs at L2 cache 250 while threadgroup-scoped synchronization occurs at L1 data cache 220. Further, the device-scoped expansions use write-through cache controls for L1 data cache 220 while the threadgroup-scoped expansions do not. Note that threadgroup-scoped operations may not require synchronization operations but may be merely implemented as atomic.1m.OP, for example"). Acharya teaches a method and system that may dispatch first set commands to one or more GPU, generate a notification from the host device indicating that a second set commands are ready to be executed, and select the preemption scheme that balances interrupting lower priority commands in time such that the GPU can be free to execute the higher priority set of commands by its scheduling deadline while minimizing the overhead to perform context switching; while Potter teaches a system and method that may have several a local memory being accessible to a set of the shader processing elements and not accessible to other ones of the shader processing elements to maintain memory consistency. Therefore, it is obvious to one of ordinary skill in the art to modify Acharya by Potter to control memory accessibility for different thread groups in order to maintain the data consistency. The motivation to modify Acharya by Potter is "Use of known technique to improve similar devices (methods, or products) in the same way".
Regarding claim 2, Acharya, Clohset, and Potter teach all the features with respect to claim 1 as outlined above. Further, Clohset teaches that the apparatus of claim 1, wherein a first cache in the first graphics shader core is a coherence point for the shader memory space (See Clohset: Fig. 9, and [0137], "The example of FIG. 9 also shows that each ALU 471-473 maintains a port to cache 480. Cache 480 stores thread local data as exemplified by thread local memory 485-487; cache 480 also can store cache global variables 488. Cache 480 also includes a plurality of fibre memory locations 490-492. The example of FIG. 9 also comprises a broadcast input queue 495. In the example of FIG. 7, each ALU 471-473 can use cache 480 in a manner similar to a register set such that SIMD cluster controller 455 schedule instructions for different threads and different fibres on an instruction by instruction basis without incurring latency") and a second, higher-level cache in the apparatus is a coherence point for device memory space (See Clohset: Fig. 2, and [0099], "Exemplary system 10 also may comprise a cache hierarchy 15 that includes one or more levels of cache memory, and a system memory interface 16 that can interface with a main memory, which can be implemented as one or more of high speed graphics RAM, DRAM, and the like. Approaches to large scale memory capacity may be adapted as new technologies are developed, and usage of well-known acronyms, such as DRAM, is not intended to confine the applicability of disclosed aspects to a given process or memory technology").
Regarding claim 3, Acharya, Clohset, and Potter teach all the features with respect to claim 1 as outlined above. Further, Acharya and Clohset teach that the apparatus of claim 1, wherein the control circuitry is further configured to provide access to data stored in the memory circuitry according to the following memory spaces:
a threadgroup memory space for the first threadgroup that is accessible to the first threadgroup but not accessible to any other threadgroups (See Clohset: Fig. 3, and [0100], "FIG. 3 depicts another exemplary system 202 in which disclosed aspects can be practiced. System 202 comprises a packet unit 105, which includes an empty stack 108, a local storage allocator 208, a ready stack 210, a collection definition memory 107, and a packer 109. Packet unit 105 can communicate with coarse scheduler 222, which can include a thread memory status module 220 and a thread scheduler 221. In some aspects, threads can be allocated execution resources by thread scheduler 221, and status module 220 can track memory usage by such threads. Such memory status can be used in scheduling instances of fibre routines. Packet unit 105 collects groupings of fibres to be distributed among the plurality of compute clusters, which will perform work specified by the fibres, as described below. Coarse scheduler 222 tracks usage of computation resources in the plurality of computation clusters, such as memory allocation and usage. In some implementations, an allocation of a portion of a local memory in a particular computation cluster is static and assigned when setting up the thread on that computation cluster. Coarse scheduler 222 also can allocate fibres for execution in the clusters");
a thread memory space that is accessible to a single thread (See Clohset: Fig. 3, and [0100], "In some aspects, threads can be allocated execution resources by thread scheduler 221, and status module 220 can track memory usage by such threads. Such memory status can be used in scheduling instances of fibre routines. Packet unit 105 collects groupings of fibres to be distributed among the plurality of compute clusters, which will perform work specified by the fibres, as described below. Coarse scheduler 222 tracks usage of computation resources in the plurality of computation clusters, such as memory allocation and usage"); and
a device memory space that is accessible to threadgroups executed by both the first and second graphics shader cores (See Acharya: Fig. 1, and [0028], "System memory 10 may store program modules and/or instructions that are accessible for execution by CPU 6 and/or data for use by the programs executing on CPU 6. For example, system memory 10 may store user applications and graphics data associated with the applications. System memory 10 may additionally store information for use by and/or generated by other components of computing device 2. For example, system memory 10 may act as a device memory for GPU 12 and may store data to be operated on by GPU 12 as well as data resulting from operations performed by GPU 12. For example, system memory 10 may store any combination of texture buffers, depth buffers, stencil buffers, vertex buffers, frame buffers, or the like. In addition, system memory 10 may store sets of commands, such as command streams, for processing by GPU 12. System memory 10 may include one or more volatile or non-volatile memories or storage devices, such as, for example, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), Flash memory, a magnetic data media or an optical storage media").
Regarding claim 4, Acharya, Clohset, and Potter teach all the features with respect to claim 1 as outlined above. Further, Acharya teaches that the apparatus of claim 1, wherein the shader memory space is also accessible to one or more co-processors for the first graphics shader core (See Acharya: Fig. 1, and [0024], "As illustrated in the example of FIG. 1, computing device 2 includes a user input interface 4, a CPU 6, a memory controller 8, a system memory 10, a graphics processing unit (GPU) 12, a local memory 14, a display interface 16, a display 18 and bus 20. User input interface 4, CPU 6, memory controller 8, GPU 12 and display interface 16 may communicate with each other using bus 20. Bus 20 may be any of a variety of bus structures, such as a third generation bus (e.g., a HyperTransport bus or an lnfiniBand bus), a second generation bus (e.g., an Advanced Graphics Port bus, a Peripheral Component Interconnect (PCI) Express bus, or an Advanced eXentisible Interface (AXI) bus) or another type of bus or device interconnect. It should be noted that the specific configuration of buses and communication interfaces between the different components shown in FIG. 1 is merely exemplary, and other configurations of computing devices and/or other graphics processing systems with the same or different components may be used to implement the techniques of this disclosure").
Regarding claim 5, Acharya, Clohset, and Potter teach all the features with respect to claim 4 as outlined above. Further, Clohset teaches that the apparatus of claim 4, wherein the one or more co-processors include ray intersection circuitry (See Clohset: Fig. 1, and [0075], "The ray identifiers can be provided from a ready packet list 164 that is controlled (via control 166) from a packet unit 155. In the example of ray intersection testing, ready packet list 164 can contain a list of ray identifiers to be tested for intersection against one or more shapes identified in the packet (either by reference or by included data). Abstraction point 160 receives such a packet from ready packet list 164 and splits the ray identifiers among the buffers 168a-168n based on which of the processing elements 169a-169n are to process such rays. In one example, the ray identifiers are distributed according to which processing element has cache access to definition data for the identified ray") configured to, based on an instruction of the first threadgroup, traverse a spatially organized data structure to determine one or more primitives against which a ray is to be tested for intersection (See Clohset: Fig. 5, and [0113], "FIG. 5 is only exemplary in that a scheduler process can provide a variety of collection points, within shader modules, based on calls to provided modules, based on access to defined regions of memory that have been loaded with object data for certain objects or object portions, and so on. In one aspect, ray intersection processing can be collected at a start of diffuse lighting calculations, such that diffuse lighting calculations can proceed for a number of rays that intersected portions of the same object, and in other examples, rays that intersected portions of the same or different object, and whose shaders use a diffuse lighting call can be collected")..
Regarding claim 6, Acharya, Clohset, and Potter teach all the features with respect to claim 5 as outlined above. Further, Clohset teaches that the apparatus of claim 5, wherein the ray intersection circuitry is configured to: 
initiate the second threadgroup to test the one or more primitives against the ray (See Clohset: Fig. 25, and {0189], "FIG. 25 depicts an example process by which local storage allocations can be determined. According to the example, a local storage allocation process may be initiated responsive to receiving 1152 a request for a new computation instance of a code module (or equivalent mechanism to identify a configuration or some other programmatic configuration to be executed, which will be referred to as a code module, for simplicity). In most cases, a data set or a portion thereof, or an initial portion of a data set will be specified as well. At 1154, a determination whether the code module instance has been profiled is made. If so, then at 1156 characteristics of the code module are identified, and based on these characteristics and the request, the new instance is prioritized at 1158. If the code module (or instance thereof) has not been profiled, then profiling (1155) can be performed. The code module can be parsed (1170) for flags or compiler or programmer provided allocation information and an entry in a profiled modules list or table can be created (1172)");
wherein both the first threadgroup and the second threadgroup operate on ray information stored in the shader memory space (See Clohset: Fig. 6, and [0122], "Although a buffered approach was described above, aspects of ray sorting and collection described herein do not require such buffering. For example, groupings of ray information for which intersections have been determined can be outputted immediately after intersection testing, without an intermediate buffering. For example, in some cases, intersection testing resources can concurrently test 32, 64 or more rays for intersection with selections of primitives that can be related to, or part of, the same scene object. Any rays found to intersect from that concurrently testing can be outputted as a group, without buffering, such as buffering to await more rays intersecting the same object. In other implementations, buffering can be used to aggregate hundreds or even thousands of rays for outputting to shading").
Regarding claim 7, Acharya, Clohset, and Potter teach all the features with respect to claim 1 as outlined above. Further, Acharya teaches that the apparatus of claim 1, wherein the first graphics shader core is configured to execute load, store, and atomics instructions that target the shader memory space (See Acharya: Figs. 1-3, and [0048], "After execution of the preempting command stream has completed, command engine 32 may restore the saved state of the preempted command stream that was executing prior to the preemption notification. Restoring the saved state may involve, for example, reloading state variables into the one or more registers of GPU 12 and/or reloading a saved memory state into local GPU memory. In examples where the GPU state is saved to memory 10, command engine 32 may reload the saved state stored in memory 10 onto GPU 12 for further execution of the preempted command stream"; and [0060], "In the techniques described in this disclosure, GPU 12 may issue multiple preemption commands at different preemption granularities to interrupt processing units 34's execution of a first set of commands in response to GPU 12 receiving an indication that a second, higher-priority set of commands is ready for execution. GPU 12 may select an appropriate granularity at which to preempt execution of a lower priority set of commands in lieu of executing a higher priority set of commands such that the higher priority set of commands may meet an associated scheduling deadline but while minimizing overhead needed for preemption").
Regarding claim 8, Acharya, Clohset, and Potter teach all the features with respect to claim 1 as outlined above. Further, Acharya teaches that the apparatus of claim 1, wherein the first graphics shader core is configured to execute a first SIMD group of the first threadgroup to use the shader memory space to store intermediate graphics work at thread granularity to be further processed by threads of a dynamically-formed SIMD group (See Acharya: Fig. 1, and [0028], "System memory 10 may store program modules and/or instructions that are accessible for execution by CPU 6 and/or data for use by the programs executing on CPU 6. For example, system memory 10 may store user applications and graphics data associated with the applications. System memory 10 may additionally store information for use by and/or generated by other components of computing device 2. For example, system memory 10 may act as a device memory for GPU 12 and may store data to be operated on by GPU 12 as well as data resulting from operations performed by GPU 12. For example, system memory 10 may store any combination of texture buffers, depth buffers, stencil buffers, vertex buffers, frame buffers, or the like. In addition, system memory 10 may store sets of commands, such as command streams, for processing by GPU 12. System memory 10 may include one or more volatile or non-volatile memories or storage devices, such as, for example, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), Flash memory, a magnetic data media or an optical storage media"; and [0111], "GPU 12 may, in response to GPU 12 failing to preempt execution of the first set of commands within an elapsed time period after issuing the first preemption command, dynamically issue a second preemption command at a second preemption granularity to the one or more processing units of the GPU, wherein the second preemption granularity is finer-grained than the first preemption granularity {108). For example, if the first preemption command was at a draw call level granularity, the second preemption command may be at a sub-draw call level granularity (e.g., primitive level preemption granularity or pixel tile level preemption granularity) or an instruction level granularity. Such an issuance of the second preemption command at the second preemption granularity may be transparent to the host device, such as CPU 6, such that CPU 6 does not receive any indications that GPU 12 has issued the second preemption command").
Regarding claim 9, Acharya, Clohset, and Potter teach all the features with respect to claim 8 as outlined above. Further, Clohset teaches that the apparatus of claim 8, wherein the dynamically-formed SIMD group includes a set of threads determined to have the same condition result for a conditional control transfer instruction (See Clohset: Fig. 4, and [0106], "FIG. 4 depicts an architecture with a unified datapath for distributing information describing heterogeneous workloads. As an example, one type of workload can be composed of workloads that may largely consist of a temporal stream of data elements that are processed by a fixed sequence of pipeline stages; in some cases, portions of these pipeline stages can be programmable. In some cases, the temporal stream of data elements may be processed by a small kernel of program code. Another type of workload is one that has many decision points, or branches, conditional operators and the like, such that a group of data elements being processed may not all need to have the same operations performed on them (e.g., one data element may evaluate to take one branch of a conditional, and another data element may evaluate to take a different branch). Such workloads may benefit from sophisticated approaches to storage of data elements, and approaches to handling accesses to memories in order to obtain data (e.g., main memory accesses)").
Regarding claim 10, Acharya, Clohset, and Potter teach all the features with respect to claim 1 as outlined above. Further, Acharya, Clohset, and Potter teach that a method (See Acharya: Fig. 1, and [0023], "FIG. 1 is a block diagram illustrating an example computing device 2 that may be used to implement the techniques of this disclosure. Computing device 2 may comprise a personal computer, a desktop computer, a laptop computer, a computer workstation, a video game platform or console, a wireless communication device (such as, e.g., a mobile telephone, a cellular telephone, a satellite telephone, and/or a mobile telephone handset), a landline telephone, an Internet telephone, a handheld device such as a portable video game device or a personal digital assistant (PDA), a personal music player, a video player, a display device, a television, a television set-top box, a server, an intermediate network device, a mainframe computer, an advanced driver assistance system (ADAS), a virtual reality headset, a drone, or any other type of device that processes and/or displays graphical data"), comprising:
executing, by first and second graphics shader cores, instructions of multiple threadgroups, including executing, by the first graphics shader core (See Acharya: Fig. 1, and [0098], "In some examples, GPU 12 may preempt different parts of a graphics processing pipeline executing on GPU 12 at varied preemption granularities. To perform graphics operations, GPU 12 may implement a graphics processing pipeline. The graphics processing pipeline includes performing functions as defined by software or firmware executing on GPU 12 and performing functions by fixed-function units that are hardwired to perform very specific functions. The software or firmware executing on the GPU 12 may be referred to as shaders, and the shaders may execute on one or more shader cores of GPU 12. Shaders provide users with functional flexibility because a user can design the shaders to perform desired tasks in any conceivable manner. The fixed-function units, however, are hardwired for the manner in which the fixed-function units perform tasks. Accordingly, the fixed-function units may not provide much functional flexibility"):
a first threadgroup with multiple single-instruction multiple-data (SIMD) groups configured to execute a first shader program (See Acharya: Fig. 2, and [0088], "In one example, if processing units 34 comprise parallel processors (e.g., shader processors), command stream 36A may include, in its set of commands, data parallel code that may be executed by the parallel processors of processing units 34. For example, processing units 34 may execute the same set of commands (or sub-draw call level commands indicated by draw level commands) of command stream 36A to operate on multiple data values in parallel. For each instance of the set of commands, processing units 34 may spawn a thread, also referred to a kernel, for that instance of the set of commands, and may execute the thread on one of its parallel processors. The group of threads or kernels for a particular set of commands may be grouped into one or more workgroups, and processing units 34 may execute a workgroup of the kernels in parallel"); and 
a second threadgroup with multiple SIMD groups configured to execute a second, different shader program (See Acharya: Fig. 1, and [0117], "GPU 12 may, in response to its one or more processing units reaching a point in the first set of commands indicated by the second preemption boundary, perform a context switch to the second set of commands by saving a state of the GPU associated with execution of the first set of commands and dispatch one or more of the second set of commands for execution on the one or more processing units of GPU 12"); and
providing, by control circuitry, access to data stored in memory circuitry according to a shader memory space (See Acharya: Fig. 1, and [0024], "As illustrated in the example of FIG. 1, computing device 2 includes a user input interface 4, a CPU 6, a memory controller 8, a system memory 10, a graphics processing unit (GPU) 12, a local memory 14, a display interface 16, a display 18 and bus 20. User input interface 4, CPU 6, memory controller 8, GPU 12 and display interface 16 may communicate with each other using bus 20. Bus 20 may be any of a variety of bus structures, such as a third generation bus (e.g., a HyperTransport bus or an lnfiniBand bus), a second generation bus (e.g., an Advanced Graphics Port bus, a Peripheral Component Interconnect (PCI) Express bus, or an Advanced eXentisible Interface (AXI) bus) or another type of bus or device interconnect. It should be noted that the specific configuration of buses and communication interfaces between the different components shown in FIG. 1 is merely exemplary, and other configurations of computing devices and/or other graphics processing systems with the same or different components may be used to implement the techniques of this disclosure") that is accessible to threadgroups executed by the first graphics shader core (See Acharya: Fig. 1, and [0027], "Memory controller 8 facilitates the transfer of data going into and out of system memory 10. For example, memory controller 8 may receive memory read and write commands, and service such commands with respect to memory 10 in order to provide memory services for the components in computing device 2. Memory controller 8 is communicatively coupled to system memory 10. Although memory controller 8 is illustrated in the example computing device 2 of FIG. 1 as being a processing module that is separate from both CPU 6 and system memory 10, in other examples, some or all of the functionality of memory controller 8 may be implemented on one or both of CPU 6 and system memory 10"), but is not accessible to threadgroups executed by the second graphics shader core (See Acharya: Fig. 1, and [0027], "Memory controller 8 facilitates the transfer of data going into and out of system memory 10. For example, memory controller 8 may receive memory read and write commands, and service such commands with respect to memory 10 in order to provide memory services for the components in computing device 2. Memory controller 8 is communicatively coupled to system memory 10. Although memory controller 8 is illustrated in the example computing device 2 of FIG. 1 as being a processing module that is separate from both CPU 6 and system memory 10, in other examples, some or all of the functionality of memory controller 8 may be implemented on one or both of CPU 6 and system memory 10"), including the first and second threadgroups, but is not accessible to threadgroups executed by the second graphics shader core (See Clohset: Figs. 1-3, and [0064], "Each thread in the system can define a plurality of fibres, each with a respective scheduling key. Further, a plurality of fibre routines can be defined, such that a given thread can define a number of fibres, which are instances of different fibre routines. As explained above, in an exemplary aspect, fibres of the same fibre routine (sharing a code base), and from the same parent (e.g., a parent thread or fibre), do not overlap in execution in implementations that avoid mutexes or locks for accessing at least one variable or data source that is available only to those fibres"); and
providing, by the control circuitry, access to data stored in the memory circuitry according to multiple threadgroup spaces that are dedicated to a given threadgroup and not accessible to other threadgroups (See Potter: Figs. 1-2, and [0043], “Local memory 230, in some embodiments, is a physical memory implementation of an API-defined threadgroup memory space and is accessible to compute kernel and fragment processing tasks (e.g., from a compute data master scheduler and a pixel data master scheduler respectively) and may not be accessible for other tasks such as vertex processing. All work items within a thread group (a group of threads assigned to the same shader processing element(s) and scheduled to share a memory context) see the same allocation in local memory. In some embodiments, for fragment processing, local memory is tile-scoped such that all threads corresponding to a tile see the same local memory allocation. As discussed in further detail below, these threads may include fragment threads and mid-render compute threads. Note that threads may also be assigned to the same single-instruction multiple data (SIMD) group, which may include a portion of the threads in a thread group, where threads in the SIMD group execute the same instructions (other than instructions that are predicated off, in embodiments with predicated execution) and are scheduled to execute in parallel using parallel hardware. The number of threads assigned to SIMD groups may vary, in different embodiments”).
Regarding claim 11, Acharya, Clohset, and Potter teach all the features with respect to claim 10 as outlined above. Further, Acharya teaches that the method of claim 10, further comprising: accessing, by one or more co-processors for the first graphics shader core, data stored in the shader memory space (See Acharya: Fig. 1, and [0024], "As illustrated in the example of FIG. 1, computing device 2 includes a user input interface 4, a CPU 6, a memory controller 8, a system memory 10, a graphics processing unit (GPU) 12, a local memory 14, a display interface 16, a display 18 and bus 20. User input interface 4, CPU 6, memory controller 8, GPU 12 and display interface 16 may communicate with each other using bus 20. Bus 20 may be any of a variety of bus structures, such as a third generation bus (e.g., a HyperTransport bus or an lnfiniBand bus), a second generation bus (e.g., an Advanced Graphics Port bus, a Peripheral Component Interconnect (PCI) Express bus, or an Advanced eXentisible Interface (AXI) bus) or another type of bus or device interconnect. It should be noted that the specific configuration of buses and communication interfaces between the different components shown in FIG. 1 is merely exemplary, and other configurations of computing devices and/or other graphics processing systems with the same or different components may be used to implement the techniques of this disclosure").
Regarding claim 12, Acharya, Clohset, and Potter teach all the features with respect to claim 10 as outlined above. Further, Acharya and Clohset teach that the method of claim 10, wherein a first SIMD group of the first threadgroup uses the shader memory space to store intermediate graphics work at thread granularity to be further processed by threads of a dynamically-formed SIMD group (See Acharya: Fig. 1, and [0028], "System memory 10 may store program modules and/or instructions that are accessible for execution by CPU 6 and/or data for use by the programs executing on CPU 6. For example, system memory 10 may store user applications and graphics data associated with the applications. System memory 10 may additionally store information for use by and/or generated by other components of computing device 2. For example, system memory 10 may act as a device memory for GPU 12 and may store data to be operated on by GPU 12 as well as data resulting from operations performed by GPU 12. For example, system memory 10 may store any combination of texture buffers, depth buffers, stencil buffers, vertex buffers, frame buffers, or the like. In addition, system memory 10 may store sets of commands, such as command streams, for processing by GPU 12. System memory 10 may include one or more volatile or non-volatile memories or storage devices, such as, for example, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), Flash memory, a magnetic data media or an optical storage media"; and [0111], "GPU 12 may, in response to GPU 12 failing to preempt execution of the first set of commands within an elapsed time period after issuing the first preemption command, dynamically issue a second preemption command at a second preemption granularity to the one or more processing units of the GPU, wherein the second preemption granularity is finer- grained than the first preemption granularity {108). For example, if the first preemption command was at a draw call level granularity, the second preemption command may be at a sub-draw call level granularity (e.g., primitive level preemption granularity or pixel tile level preemption granularity) or an instruction level granularity. Such an issuance of the second preemption command at the second preemption granularity may be transparent to the host device, such as CPU 6, such that CPU 6 does not receive any indications that GPU 12 has issued the second preemption command"); and
wherein the dynamically-formed SIMD group includes a set of threads determined to have the same condition result for a conditional control transfer instruction (See Clohset: Fig. 4, and [0106], "FIG. 4 depicts an architecture with a unified datapath for distributing information describing heterogeneous workloads. As an example, one type of workload can be composed of workloads that may largely consist of a temporal stream of data elements that are processed by a fixed sequence of pipeline stages; in some cases, portions of these pipeline stages can be programmable. In some cases, the temporal stream of data elements may be processed by a small kernel of program code. Another type of workload is one that has many decision points, or branches, conditional operators and the like, such that a group of data elements being processed may not all need to have the same operations performed on them (e.g., one data element may evaluate to take one branch of a conditional, and another data element may evaluate to take a different branch). Such workloads may benefit from sophisticated approaches to storage of data elements, and approaches to handling accesses to memories in order to obtain data (e.g., main memory accesses)").
Regarding claim 13, Acharya, Clohset, and Potter teach all the features with respect to claim 1 as outlined above. Further, Acharya, Clohset, and Potter teach that a non-transitory computer readable storage medium having stored thereon design information that specifies a design of at least a portion of a hardware integrated circuit in a format recognized by a semiconductor fabrication system that is configured to use the design information to produce the circuit according to the design (See Acharya: Fig. 1, and [0023], "FIG. 1 is a block diagram illustrating an example computing device 2 that may be used to implement the techniques of this disclosure. Computing device 2 may comprise a personal computer, a desktop computer, a laptop computer, a computer workstation, a video game platform or console, a wireless communication device (such as, e.g., a mobile telephone, a cellular telephone, a satellite telephone, and/or a mobile telephone handset), a landline telephone, an Internet telephone, a handheld device such as a portable video game device or a personal digital assistant (PDA), a personal music player, a video player, a display device, a television, a television set-top box, a server, an intermediate network device, a mainframe computer, an advanced driver assistance system (ADAS), a virtual reality headset, a drone, or any other type of device that processes and/or displays graphical data"), wherein the design information specifies that the circuit includes:
first and second graphics shader cores configured to execute instructions for multiple threadgroups, wherein the first graphics shader core is configured to execute (See Acharya: Fig. 1, and [0098], "In some examples, GPU 12 may preempt different parts of a graphics processing pipeline executing on GPU 12 at varied preemption granularities. To perform graphics operations, GPU 12 may implement a graphics processing pipeline. The graphics processing pipeline includes performing functions as defined by software or firmware executing on GPU 12 and performing functions by fixed-function units that are hardwired to perform very specific functions. The software or firmware executing on the GPU 12 may be referred to as shaders, and the shaders may execute on one or more shader cores of GPU 12. Shaders provide users with functional flexibility because a user can design the shaders to perform desired tasks in any conceivable manner. The fixed-function units, however, are hardwired for the manner in which the fixed-function units perform tasks. Accordingly, the fixed-function units may not provide much functional flexibility"):
a first threadgroup with multiple single-instruction multiple-data (SIMD) groups configured to execute a first shader program (See Acharya: Fig. 2, and [0088], "In one example, if processing units 34 comprise parallel processors (e.g., shader processors), command stream 36A may include, in its set of commands, data parallel code that may be executed by the parallel processors of processing units 34. For example, processing units 34 may execute the same set of commands (or sub-draw call level commands indicated by draw level commands) of command stream 36A to operate on multiple data values in parallel. For each instance of the set of commands, processing units 34 may spawn a thread, also referred to a kernel, for that instance of the set of commands, and may execute the thread on one of its parallel processors. The group of threads or kernels for a particular set of commands may be grouped into one or more workgroups, and processing units 34 may execute a workgroup of the kernels in parallel"); and
a second threadgroup with multiple SIMD groups configured to execute a second, different shader program (See Acharya: Fig. 1, and [0117], "GPU 12 may, in response to its one or more processing units reaching a point in the first set of commands indicated by the second preemption boundary, perform a context switch to the second set of commands by saving a state of the GPU associated with execution of the first set of commands and dispatch one or more of the second set of commands for execution on the one or more processing units of GPU 12"); 
memory circuitry (See Acharya: Fig. 1, and [0024], "As illustrated in the example of FIG. 1, computing device 2 includes a user input interface 4, a CPU 6, a memory controller 8, a system memory 10, a graphics processing unit (GPU) 12, a local memory 14, a display interface 16, a display 18 and bus 20. User input interface 4, CPU 6, memory controller 8, GPU 12 and display interface 16 may communicate with each other using bus 20. Bus 20 may be any of a variety of bus structures, such as a third generation bus (e.g., a HyperTransport bus or an lnfiniBand bus), a second generation bus (e.g., an Advanced Graphics Port bus, a Peripheral Component Interconnect (PCI) Express bus, or an Advanced eXentisible Interface (AXI) bus) or another type of bus or device interconnect. It should be noted that the specific configuration of buses and communication interfaces between the different components shown in FIG. 1 is merely exemplary, and other configurations of computing devices and/or other graphics processing systems with the same or different components may be used to implement the techniques of this disclosure"); and
control circuitry, wherein the control circuitry is configured to: 
provide access to data stored in the memory circuitry according to a shader memory space that is accessible to threadgroups executed by the first graphics shader core, including the first and second threadgroups (See Acharya: Fig. 1, and [0027], "Memory controller 8 facilitates the transfer of data going into and out of system memory 10. For example, memory controller 8 may receive memory read and write commands, and service such commands with respect to memory 10 in order to provide memory services for the components in computing device 2. Memory controller 8 is communicatively coupled to system memory 10. Although memory controller 8 is illustrated in the example computing device 2 of FIG. 1 as being a processing module that is separate from both CPU 6 and system memory 10, in other examples, some or all of the functionality of memory controller 8 may be implemented on one or both of CPU 6 and system memory 10"), but is not accessible to threadgroups executed by the second graphics shader core (See Acharya: Fig. 1, and [0027], "Memory controller 8 facilitates the transfer of data going into and out of system memory 10. For example, memory controller 8 may receive memory read and write commands, and service such commands with respect to memory 10 in order to provide memory services for the components in computing device 2. Memory controller 8 is communicatively coupled to system memory 10. Although memory controller 8 is illustrated in the example computing device 2 of FIG. 1 as being a processing module that is separate from both CPU 6 and system memory 10, in other examples, some or all of the functionality of memory controller 8 may be implemented on one or both of CPU 6 and system memory 10"), but is not accessible to threadgroups executed by the second graphics shader core (See Clohset: Figs. 1-3, and [0064], "Each thread in the system can define a plurality of fibres, each with a respective scheduling key. Further, a plurality of fibre routines can be defined, such that a given thread can define a number of fibres, which are instances of different fibre routines. As explained above, in an exemplary aspect, fibres of the same fibre routine (sharing a code base), and from the same parent (e.g., a parent thread or fibre), do not overlap in execution in implementations that avoid mutexes or locks for accessing at least one variable or data source that is available only to those fibres"); and
provide access to data stored in the memory circuitry according to multiple threadgroup spaces that are dedicated to a given threadgroup and not accessible to other threadgroups (See Potter: Figs. 1-2, and [0043], “Local memory 230, in some embodiments, is a physical memory implementation of an API-defined threadgroup memory space and is accessible to compute kernel and fragment processing tasks (e.g., from a compute data master scheduler and a pixel data master scheduler respectively) and may not be accessible for other tasks such as vertex processing. All work items within a thread group (a group of threads assigned to the same shader processing element(s) and scheduled to share a memory context) see the same allocation in local memory. In some embodiments, for fragment processing, local memory is tile-scoped such that all threads corresponding to a tile see the same local memory allocation. As discussed in further detail below, these threads may include fragment threads and mid-render compute threads. Note that threads may also be assigned to the same single-instruction multiple data (SIMD) group, which may include a portion of the threads in a thread group, where threads in the SIMD group execute the same instructions (other than instructions that are predicated off, in embodiments with predicated execution) and are scheduled to execute in parallel using parallel hardware. The number of threads assigned to SIMD groups may vary, in different embodiments”).
Regarding claim 14, Acharya, Clohset, and Potter teach all the features with respect to claim 13 as outlined above. Further, Clohset teaches that the non-transitory computer readable storage medium of claim 13, wherein a first cache in the first graphics shader core is a coherence point for the shader memory space (See Clohset: Fig. 9, and [0137], "The example of FIG. 9 also shows that each ALU 471-473 maintains a port to cache 480. Cache 480 stores thread local data as exemplified by thread local memory 485-487; cache 480 also can store cache global variables 488. Cache 480 also includes a plurality of fibre memory locations 490-492. The example of FIG. 9 also comprises a broadcast input queue 495. In the example of FIG. 7, each ALU 471-473 can use cache 480 in a manner similar to a register set such that SIMD cluster controller 455 schedule instructions for different threads and different fibres on an instruction by instruction basis without incurring latency") and a second, higher-level cache shared by the second graphics shader core is a coherence point for device memory space (See Clohset: Fig. 2, and [0099], "Exemplary system 10 also may comprise a cache hierarchy 15 that includes one or more levels of cache memory, and a system memory interface 16 that can interface with a main memory, which can be implemented as one or more of high speed graphics RAM, DRAM, and the like. Approaches to large scale memory capacity may be adapted as new technologies are developed, and usage of well-known acronyms, such as DRAM, is not intended to confine the applicability of disclosed aspects to a given process or memory technology").
Regarding claim 15, Acharya, Clohset, and Potter teach all the features with respect to claim 13 as outlined above. Further, Acharya and Clohset teach that the non-transitory computer readable storage medium of claim 13, wherein the control circuitry is further configured to provide access to data stored in the memory circuitry according to the following memory spaces:
a threadgroup memory space for the first threadgroup that is accessible to the first threadgroup but not accessible to any other threadgroups (See Clohset: Fig. 3, and [0100], "FIG. 3 depicts another exemplary system 202 in which disclosed aspects can be practiced. System 202 comprises a packet unit 105, which includes an empty stack 108, a local storage allocator 208, a ready stack 210, a collection definition memory 107, and a packer 109. Packet unit 105 can communicate with coarse scheduler 222, which can include a thread memory status module 220 and a thread scheduler 221. In some aspects, threads can be allocated execution resources by thread scheduler 221, and status module 220 can track memory usage by such threads. Such memory status can be used in scheduling instances of fibre routines. Packet unit 105 collects groupings of fibres to be distributed among the plurality of compute clusters, which will perform work specified by the fibres, as described below. Coarse scheduler 222 tracks usage of computation resources in the plurality of computation clusters, such as memory allocation and usage. In some implementations, an allocation of a portion of a local memory in a particular computation cluster is static and assigned when setting up the thread on that computation cluster. Coarse scheduler 222 also can allocate fibres for execution in the clusters");
a thread memory space that is accessible to a single thread (See Clohset: Fig. 3, and [0100], "In some aspects, threads can be allocated execution resources by thread scheduler 221, and status module 220 can track memory usage by such threads. Such memory status can be used in scheduling instances of fibre routines. Packet unit 105 collects groupings of fibres to be distributed among the plurality of compute clusters, which will perform work specified by the fibres, as described below. Coarse scheduler 222 tracks usage of computation resources in the plurality of computation clusters, such as memory allocation and usage"); and 
a device memory space that is accessible to threadgroups executed by both the first and second graphics shader cores (See Acharya: Fig. 1, and [0028], "System memory 10 may store program modules and/or instructions that are accessible for execution by CPU 6 and/or data for use by the programs executing on CPU 6. For example, system memory 10 may store user applications and graphics data associated with the applications. System memory 10 may additionally store information for use by and/or generated by other components of computing device 2. For example, system memory 10 may act as a device memory for GPU 12 and may store data to be operated on by GPU 12 as well as data resulting from operations performed by GPU 12. For example, system memory 10 may store any combination of texture buffers, depth buffers, stencil buffers, vertex buffers, frame buffers, or the like. In addition, system memory 10 may store sets of commands, such as command streams, for processing by GPU 12. System memory 10 may include one or more volatile or non-volatile memories or storage devices, such as, for example, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), Flash memory, a magnetic data media or an optical storage media").
Regarding claim 16, Acharya, Clohset, and Potter teach all the features with respect to claim 13 as outlined above. Further, Acharya teaches that the non-transitory computer readable storage medium of claim 13, wherein the shader memory space is also accessible to one or more co- processors for the first graphics shader core (See Acharya: Fig. 1, and [0024], "As illustrated in the example of FIG. 1, computing device 2 includes a user input interface 4, a CPU 6, a memory controller 8, a system memory 10, a graphics processing unit (GPU) 12, a local memory 14, a display interface 16, a display 18 and bus 20. User input interface 4, CPU 6, memory controller 8, GPU 12 and display interface 16 may communicate with each other using bus 20. Bus 20 may be any of a variety of bus structures, such as a third generation bus (e.g., a HyperTransport bus or an lnfiniBand bus), a second generation bus (e.g., an Advanced Graphics Port bus, a Peripheral Component Interconnect (PCI) Express bus, or an Advanced eXentisible Interface (AXI) bus) or another type of bus or device interconnect. It should be noted that the specific configuration of buses and communication interfaces between the different components shown in FIG. 1 is merely exemplary, and other configurations of computing devices and/or other graphics processing systems with the same or different components may be used to implement the techniques of this disclosure").
Regarding claim 17, Acharya, Clohset, and Potter teach all the features with respect to claim 16 as outlined above. Further, Clohset teaches that the non-transitory computer readable storage medium of claim 16, wherein the one or more co-processors include ray intersection circuitry, wherein the ray intersection circuitry is configured to:
based on an instruction of the first threadgroup, traverse a spatially organized data structure to determine one or more primitives against which a ray is to be tested for intersection (See Clohset: Fig. 5, and [0113], "FIG. 5 is only exemplary in that a scheduler process can provide a variety of collection points, within shader modules, based on calls to provided modules, based on access to defined regions of memory that have been loaded with object data for certain objects or object portions, and so on. In one aspect, ray intersection processing can be collected at a start of diffuse lighting calculations, such that diffuse lighting calculations can proceed for a number of rays that intersected portions of the same object, and in other examples, rays that intersected portions of the same or different object, and whose shaders use a diffuse lighting call can be collected"); and
initiate the second threadgroup to test the one or more primitives against the ray (See Clohset: Fig. 25, and {0189], "FIG. 25 depicts an example process by which local storage allocations can be determined. According to the example, a local storage allocation process may be initiated responsive to receiving 1152 a request for a new computation instance of a code module (or equivalent mechanism to identify a configuration or some other programmatic configuration to be executed, which will be referred to as a code module, for simplicity). In most cases, a data set or a portion thereof, or an initial portion of a data set will be specified as well. At 1154, a determination whether the code module instance has been profiled is made. If so, then at 1156 characteristics of the code module are identified, and based on these characteristics and the request, the new instance is prioritized at 1158. If the code module (or instance thereof) has not been profiled, then profiling (1155) can be performed. The code module can be parsed (1170) for flags or compiler or programmer provided allocation information and an entry in a profiled modules list or table can be created (1172)");
wherein both the first threadgroup and the second threadgroup operate on ray information stored in the shader memory space (See Clohset: Fig. 6, and [0122], "Although a buffered approach was described above, aspects of ray sorting and collection described herein do not require such buffering. For example, groupings of ray information for which intersections have been determined can be outputted immediately after intersection testing, without an intermediate buffering. For example, in some cases, intersection testing resources can concurrently test 32, 64 or more rays for intersection with selections of primitives that can be related to, or part of, the same scene object. Any rays found to intersect from that concurrently testing can be outputted as a group, without buffering, such as buffering to await more rays intersecting the same object. In other implementations, buffering can be used to aggregate hundreds or even thousands of rays for outputting to shading").
Regarding claim 18, Acharya, Clohset, and Potter teach all the features with respect to claim 13 as outlined above. Further, Acharya teaches that the non-transitory computer readable storage medium of claim 13, wherein the first graphics shader core is configured to execute load, store, and atomics instructions that target the shader memory space (See Acharya: Figs. 1-3, and [0048], "After execution of the preempting command stream has completed, command engine 32 may restore the saved state of the preempted command stream that was executing prior to the preemption notification. Restoring the saved state may involve, for example, reloading state variables into the one or more registers of GPU 12 and/or reloading a saved memory state into local GPU memory. In examples where the GPU state is saved to memory 10, command engine 32 may reload the saved state stored in memory 10 onto GPU 12 for further execution of the preempted command stream"; and [0060], "In the techniques described in this disclosure, GPU 12 may issue multiple preemption commands at different preemption granularities to interrupt processing units 34's execution of a first set of commands in response to GPU 12 receiving an indication that a second, higher-priority set of commands is ready for execution. GPU 12 may select an appropriate granularity at which to preempt execution of a lower priority set of commands in lieu of executing a higher priority set of commands such that the higher priority set of commands may meet an associated scheduling deadline but while minimizing overhead needed for preemption"). 
Regarding claim 19, Acharya, Clohset, and Potter teach all the features with respect to claim 13 as outlined above. Further, Acharya teaches that the non-transitory computer readable storage medium of claim 13, wherein the first graphics shader core is configured to execute a first SIMD group of the first threadgroup to use the shader memory space to store intermediate graphics work at thread granularity to be further processed by threads of a dynamically-formed SIMD group (See Acharya: Fig. 1, and [0028], "System memory 10 may store program modules and/or instructions that are accessible for execution by CPU 6 and/or data for use by the programs executing on CPU 6. For example, system memory 10 may store user applications and graphics data associated with the applications. System memory 10 may additionally store information for use by and/or generated by other components of computing device 2. For example, system memory 10 may act as a device memory for GPU 12 and may store data to be operated on by GPU 12 as well as data resulting from operations performed by GPU 12. For example, system memory 10 may store any combination of texture buffers, depth buffers, stencil buffers, vertex buffers, frame buffers, or the like. In addition, system memory 10 may store sets of commands, such as command streams, for processing by GPU 12. System memory 10 may include one or more volatile or non-volatile memories or storage devices, such as, for example, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), Flash memory, a magnetic data media or an optical storage media"; and [0111], "GPU 12 may, in response to GPU 12 failing to preempt execution of the first set of commands within an elapsed time period after issuing the first preemption command, dynamically issue a second preemption command at a second preemption granularity to the one or more processing units of the GPU, wherein the second preemption granularity is finer-grained than the first preemption granularity {108). For example, if the first preemption command was at a draw call level granularity, the second preemption command may be at a sub-draw call level granularity (e.g., primitive level preemption granularity or pixel tile level preemption granularity) or an instruction level granularity. Such an issuance of the second preemption command at the second preemption granularity may be transparent to the host device, such as CPU 6, such that CPU 6 does not receive any indications that GPU 12 has issued the second preemption command").
Regarding claim 20, Acharya, Clohset, and Potter teach all the features with respect to claim 19 as outlined above. Further, Clohset teaches that the non-transitory computer readable storage medium of claim 19, wherein the dynamically-formed SIMD group includes a set of threads determined to have the same condition result for a conditional control transfer instruction (See Clohset: Fig. 4, and [0106], "FIG. 4 depicts an architecture with a unified datapath for distributing information describing heterogeneous workloads. As an example, one type of workload can be composed of workloads that may largely consist of a temporal stream of data elements that are processed by a fixed sequence of pipeline stages; in some cases, portions of these pipeline stages can be programmable. In some cases, the temporal stream of data elements may be processed by a small kernel of program code. Another type of workload is one that has many decision points, or branches, conditional operators and the like, such that a group of data elements being processed may not all need to have the same operations performed on them (e.g., one data element may evaluate to take one branch of a conditional, and another data element may evaluate to take a different branch). Such workloads may benefit from sophisticated approaches to storage of data elements, and approaches to handling accesses to memories in order to obtain data (e.g., main memory accesses)").


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 



Any inquiry concerning this communication or earlier communications from the examiner should be directed to GORDON G LIU whose telephone number is (571)270-0382. The examiner can normally be reached Monday - Friday 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on 571-272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GORDON G LIU/Primary Examiner, Art Unit 2612