DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted is considered by the examiner.

Claim Objections
Claims 38-39 are objected to because of the following informalities: in line 1 of claims 38-39, “The machine-readable medium of claim 35” should be changed to “The non-transitory machine-readable medium of claim 35”.  Appropriate correction is required.

Response to Arguments
Applicant’s arguments with respect to claim(s) 21, 24-25, 27-28, 331-32, 34-35 and 38-39 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 21, 24-25, 27-28, 331-32, 34-35 and 38-39 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dulik, JR. et al.  (US Publication Number 2013/0268942 A1, hereinafter “Duluk”) in view of Ashbaugh (US Publication Number 2015/0161758 A1).

(1) regarding claim 21:
As shown in fig. 1, Duluk disclosed an apparatus (para. [0023], note that FIG. 1 is a block diagram illustrating a computer system 100) comprising: 
one or more processors including a graphics processor to process data (para. [0024], note that  the parallel processing subsystem 112 incorporates circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constitutes a graphics processing unit (GPU)), the one or more processors including one or more units to process a plurality of shader threads (para. [0035], note that GPCs 208 can be programmed to execute processing tasks relating to a wide variety of applications, including but not limited to image rendering operations (e.g., tessellation shader, vertex shader, geometry shader, and/or pixel shader programs i.e. shader threads));
para. [0035], note that PPUs 202 may transfer data from system memory 104 and/or local parallel processing memories 204 into internal (on-chip) memory, process the data); and 
a shared local memory (SLM) (para. [0086], note that each allocated shader local memory (511, 512, etc.) includes three partitions divided between all of the threads scheduled on the corresponding SM 310--local memory high 521, local memory low 522, and a call return stack 523), the SLM including a local shader data cache (para. [0084], note that FIG. 5A illustrates a virtual address space 510 allocated by device driver 103 as shader local memory, according to one example embodiment of the present disclosure. Shader local memory is a per -thread private data storage stored in PP memory 204 and directly addressable by LSUs 303 via L1 cache 320); 
wherein the one or more processors are to: 
detect and prepare the local shader data cache in the SLM (para. [0088], note that LSUs 303 may address the virtual memory addresses in shader local memory directly via L1 cache 320. In the event of a cache miss, a memory page will be fetched from PP memory 204 to the L1 cache 320 via memory interface 214), fetch shader data for one or more of the plurality of shader threads (para. [0089], note that each TMD 322 may require a vastly varying amount of resources depending on the particular programs (i.e., thread groups or warps) stored in the CTAs associated with the TMD 322. One TMD 322 may include a warp that requires very little shader local memory per thread), and provide the cached shader data from the local shader data cache in shader thread (para. [0088], note that device driver 103 allocates memory in a virtual address space 510 to each active SM 310. The physical memory locations that correspond to the virtual memory addresses are located in a buffer 540 within PP memory 204. LSUs 303 may address the virtual memory addresses in shader local memory directly via L1 cache 320). 
Duluk disclosed most of the subject matter as described as above except for specifically teaching the shader data to be fetched at the local shader data cache, and cache the fetched shader data in the local shader data cache, and wherein the cached shader data is available for processing of each of the plurality of shader threads.
However, it would be obvious for Duluk teach the shader data to be fetched at the local shader data cache, and cache the fetched shader data in the local shader data cache (para. [0047], note that each SM 310 also has access to level two (L2) caches that are shared among all GPCs 208 and may be used to transfer data between threads… Additionally, a level one-point-five (L1.5) cache 335 may be included within the GPC 208, configured to receive and hold data fetched from memory via memory interface 214 requested by SM 310, including instructions, uniform data, and constant data, and provide the requested data to SM 310. Also see para. [0085], Each SM 310 that is activated within PPU 202 is allocated a certain amount of shader local memory (511, 512, etc.).). 
At the time of filing for the invention, it would have been obvious to a person of ordinary skilled in the art for Duluk to teach the shader data to be fetched at the local shader data cache, and cache the fetched shader data in the local shader data cache. The suggestion/motivation for doing so would have been in order to efficiently allocate 
In addition to that, Ashbaugh teaches wherein the cached shader data is available for processing of each of the plurality of shader threads (para. [0017], note that the system memory 12 may include main memory, global memory, etc., when the shared local memory 14 includes a relatively lower-latency memory physically nearer to one or more of the processors (e.g., an L1 cache). In addition, data such as, for example, graphics data (e.g., a tile of an image, etc.) may be transferred from the system memory 12 to the shared local memory 14 to provide relatively faster access to the data. For example, a problem may be partitioned into work to be performed in parallel by two or more execution elements (e.g, work items, threads, etc.), wherein the two or more execution elements may be grouped together into one or more element blocks 16, 18, . . . X (e.g., sub-group, warp, etc.)).
At the time of filing for the invention, it would have been obvious to a person of ordinary skilled in the art for Ashbaugh to teache wherein the cached shader data is available for processing of each of the plurality of shader threads. The suggestion/motivation for doing so would have been in order to provide relatively faster access to processing the data (para. [0017]). Therefore, it would have been obvious to combine Duluk with Ashbaugh to obtain the invention as specified in claim 1.

(2) regarding claim 24: 
para. [0047], note that a level one-point-five (L1.5) cache 335 may be included within the GPC 208, configured to receive and hold data fetched from memory via memory interface 214 requested by SM 310, including instructions, uniform data, and constant data, and provide the requested data to SM 310).

(3) regarding claim 25:
Duluk further disclosed the apparatus of claim 21, wherein the shader threads include three-dimensional (3D) shader threads (para. [0052], note that the thread ID, which can be defined as a one-dimensional or multi-dimensional numerical value controls various aspects of the thread's processing behavior). 

(4) regarding claim 27: 
Duluk further disclosed the apparatus of claim 21, wherein the graphics processor is co-located with an application processor on a common semiconductor package (para. [0106], note that non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory) on which information is permanently stored).

The proposed rejection of Duluk and Ashbaugh, as explained in apparatus claims 21, 24-25 and 27, renders obvious the steps of the method (fig. 6) of claims 28, 31-32 and 34 and the non-transitory machine-readable medium (para. [0106]) claims 35, 38-39 because these steps occur in the operation of the proposed rejection as discussed above. Thus, the arguments similar to that presented above for claims 21, 24-25 and 27 are equally applicable to claims 28, 31-32, 34-35 and 38-39.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.

Bakhoda et al. (NPL, “Analyzing CUDA Workloads Using a Detailed GPU Simulator”, 2009) disclosed Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow’s manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communication from the examiner should be directed to Hilina K Demeter whose telephone number is (571) 270-1676. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Benny Tieu could be reached at (571) 272- 7490. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about PAIR system, see http://pari-direct.uspto.gov. Should you have 
/HILINA K DEMETER/Primary Examiner, Art Unit 2674