Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
2. 	This Office Action is taken in response to Applicants’ Amendments and Remarks filed on 10/22/2021 regarding application 16/986,643 filed on 8/6/2020.  
 	Claims 1-20 are pending for consideration.

3.				Response to Amendments and Remarks 
	Applicants’ amendments and remarks have been fully and carefully considered, with the Examiner’s response set forth below.
(1) In view of the amendments and remarks, double patenting rejections have been withdrawn.  
	(2 In response to the amendments and remarks, an updated claim analysis has been made. Refer to the corresponding sections of the following Office Action for details.

4.					Examiner’s Note
(1) In the case of amending the Claimed invention, Applicant is respectfully requested to indicate the portion(s) of the specification which dictate(s) the structure relied on for proper interpretation and also to verify and ascertain the metes and bounds of the claimed invention. This will assist in expediting compact prosecution.  MPEP Amendments not pointing to specific support in the disclosure may be deemed as not complying with provisions of 37 C.F.R.  1.131(b), (c), (d), and (h) and therefore held not fully responsive.  Generic statements such as “Applicants believe no new matter has been introduced” may be deemed insufficient.
(2) Examiner has cited particular columns/paragraph and line numbers in the references applied to the claims above for the convenience of the applicant. Although the specified citations are representative of the teachings of the art and are applied to specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested from the applicant in preparing responses, to fully consider the references in entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the Examiner.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

5.	Claims 1-2, 8-9, and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Mehrara et al. (US Patent Application Publication 2014/0164745, hereinafter Mehrara), in view of Lee et al. (US Patent Application Publication 2013/0238877, hereinafter Lee), and further in view of Rychlik (US Patent 7,962,731).
	As to claim 1, Mehrara teaches A method comprising: 
executing a first work-group on a processor [as shown in figure 2, PPU (220), and a plurality of GPCs (208); the Streaming Multiprocessors (SM) as shown in figure 3, 310; The series of instructions transmitted to a particular GPC 208 constitutes a thread, as previously defined herein, and the collection of a certain number of concurrently executing threads across the parallel processing engines (not shown) within an SM 310 is referred to herein as a "warp" or "thread group." As used herein, a "thread group" refers to a group of threads concurrently executing the same program on different input data, with one thread of the group being assigned to a different processing engine within an SM 310 … (¶ 0048-0049); Lee also teaches a processor – as shown in figure 1], wherein the first work-group comprises a plurality of work-items that are executed in parallel to perform a defined function [The series of instructions transmitted to a particular GPC 208 constitutes a thread, as previously defined herein, and the collection of a certain number of concurrently executing threads across the parallel processing engines (not shown) within an SM 310 is referred to herein as a "warp" or "thread group." As used herein, a "thread group" refers to a group of threads concurrently executing the same program on different input data, with one thread of the group being assigned to a different processing engine within an SM 310 … (¶ 0048-0049)], wherein the processor comprises internal registers files [local register file, Lee also teaches this limitation – scalar register file , figure 1, 160; vector register file , figure 1, 130], wherein the processor uses a cache hierarchy including a lowest level cache and at least one other cache [L1 cache, figure 3, 320; L1.5 cache, figure 3, 208; … Each SM 310 includes an instruction L1 cache 370 that is configured to receive instructions and constants from memory via an L1.5 cache (not shown) within the GPC 208 … (¶ 0047)], determining a space requirement to store a local memory content for the first work-group [local memory as shown in figure 2, 204, the PP memory; also shown in figure 3, 306, the shared memory; Lee more expressively teaches determining a space requirement to store a local memory content – first memory, figure 1, 110; An interrupt may be detected by the core 140 sensing an interrupt occurrence event. The core 140 may determine whether or not the first memory 110 can store the vector register file data that is currently being executed by comparison between the amount of the vector register file data and the available capacity of the first memory 110. If it is determined that the first memory 110 can store the vector register file data therein and accordingly the first instruction is generated, the core 140 reads the vector register file data currently being executed from the vector register file 130 and stores the read vector register file data in the first memory 110 … (¶ 0044-0045)], wherein the local memory content for the first work-group is only accessible by the plurality of work-items included in the first work-group [as shown in figure 2, where each the local memory PPU memory (204(0)-204(U-1)) is designated to each PPU (2020(0)-202(U-1), and is accessed by the associated PPU exclusively; Local register file 304 is used by each thread as scratch space; each register is allocated for the exclusive use of one thread, and data in any of local register file 304 is accessible only to the thread to which the register is allocated … (¶ 0057)];
determining whether an available space in the internal register files is sufficient to store the local memory content for the first work-group, and in response to a determination that the available space in the internal register files is sufficient to store the local memory content for the first work-group, storing the local memory content in the available space in the internal register files [this limitation is taught by Rychlik – Physical Register File (PFR), figure 3, 28; figure 4, step 42, “sufficient unallocated registers in PFR?” Yes, then steps 50; A method of operating a stacked register file architecture according to one embodiment is depicted in flow diagram form in FIG. 4. Initially, a stacked register controller, which may comprise the Register Save Engine (RSE) 30, receives a request to allocate one or more registers in the Physical Register File (PRF) 28 for exclusive use by the procedure to write and read data, such as operands for or results of arithmetic or logical instructions (block 40). The RSE 30 determines whether there are sufficient unallocated registers remaining in the PRF 28 (block 42). Initially, there are, and the requested number of PRF 28 registers are allocated to the new procedure (block 50). This process may repeat several times, as each procedure calls a successive procedure (block 40) … (c7 L5-31); A method of managing a register file system … and determining whether the register file includes sufficient unallocated registers to accommodate the request … (claim 1)].
	Regarding claim 1, Mehrara teaches a local memory [local memory as shown in figure 2, 204, the PP memory; also shown in figure 3, 306, the shared memory, but 
	However, it is a well-known and commonly used practice in the art to a space requirement to store a local memory content to ensure that there is sufficient space available for accommodate new data without overwriting the existing data.
	For example, Lee specifically teaches determining a space requirement to store a local memory content [first memory, figure 1, 110; An interrupt may be detected by the core 140 sensing an interrupt occurrence event. The core 140 may determine whether or not the first memory 110 can store the vector register file data that is currently being executed by comparison between the amount of the vector register file data and the available capacity of the first memory 110. If it is determined that the first memory 110 can store the vector register file data therein and accordingly the first instruction is generated, the core 140 reads the vector register file data currently being executed from the vector register file 130 and stores the read vector register file data in the first memory 110 … (¶ 0044-0045)].
	Therefore, it would have been obvious for one of ordinary skills in the art prior to Applicant’s invention to determine a space requirement to store a local memory content, as demonstrated by Lee, and to incorporate it into the existing scheme disclosed by Mehrara, to ensure that there is sufficient space available for accommodate new data without overwriting the existing data.
Further regarding claim 1, Mehrara in view of Lee does not teach determining whether an available space in the internal register files is sufficient to store the local memory content for the first work-group, and in response to a determination that the 
However, Rychlik specifically teaches the cited limitations [Physical Register File (PFR), figure 3, 28; figure 4, step 42, “sufficient unallocated registers in PFR?” Yes, then steps 50; A method of operating a stacked register file architecture according to one embodiment is depicted in flow diagram form in FIG. 4. Initially, a stacked register controller, which may comprise the Register Save Engine (RSE) 30, receives a request to allocate one or more registers in the Physical Register File (PRF) 28 for exclusive use by the procedure to write and read data, such as operands for or results of arithmetic or logical instructions (block 40). The RSE 30 determines whether there are sufficient unallocated registers remaining in the PRF 28 (block 42). Initially, there are, and the requested number of PRF 28 registers are allocated to the new procedure (block 50). This process may repeat several times, as each procedure calls a successive procedure (block 40) … (c7 L5-31); A method of managing a register file system … and determining whether the register file includes sufficient unallocated registers to accommodate the request … (claim 1)].
Therefore, it would have been obvious for one of ordinary skills in the art prior to Applicant’s invention to determine whether an available space in the internal register files is sufficient to store the local memory content for the first work-group, and in response to a determination that the available space in the internal register files is sufficient to store the local memory content for the first work-group, storing the local memory content in the available space in the internal register files, as demonstrated by 
	As to claim 2, Mehrara in view of Lee & Rychlik teaches The method of claim 1 including: in response to a determination that the available space in the internal register files is not sufficient to store the local memory content for the first work-group [Rychlik -- Physical Register File (PFR), figure 3, 28; figure 4, step 42, “sufficient unallocated registers in PFR?” Yes, then steps 50; A method of operating a stacked register file architecture according to one embodiment is depicted in flow diagram form in FIG. 4. Initially, a stacked register controller, which may comprise the Register Save Engine (RSE) 30, receives a request to allocate one or more registers in the Physical Register File (PRF) 28 for exclusive use by the procedure to write and read data, such as operands for or results of arithmetic or logical instructions (block 40). The RSE 30 determines whether there are sufficient unallocated registers remaining in the PRF 28 (block 42). Initially, there are, and the requested number of PRF 28 registers are allocated to the new procedure (block 50). This process may repeat several times, as each procedure calls a successive procedure (block 40) … (c7 L5-67); A method of managing a register file system … and determining whether the register file includes sufficient unallocated registers to accommodate the request … (claim 1) A method of managing a register file system … and determining whether the register file includes sufficient unallocated registers to accommodate the request … (claim 1)]: storing a first portion of the local memory content for the first work-group in the available space in the internal register files, and storing a second portion of the local memory content for the first work-group in the lowest level cache [Mehrara -- … Due to the low access time, and low access energy of the dedicated local register files 404, values which are used frequently are advantageously stored in the dedicated local register files 404.  Additional values that are used, though infrequently, may be stored in the master register file 406.  Values that are used even less frequently may be stored in memory that is external to the multi-level register file hierarchy 400 such as L1 cache 320 (¶ 0067)].
	As to claims 8, they recite substantially the same limitations as in claim 1, and are rejected by the same reasons as claim 1. Refer to “As to claim 1” presented earlier in this Office Action for details.
As to claims 9, they recite substantially the same limitations as in claim 2, and are rejected by the same reasons as claim 2. Refer to “As to claim 2” presented earlier in this Office Action for details.
As to claims 15, they recite substantially the same limitations as in claim 1, and are rejected by the same reasons as claim 1. Refer to “As to claim 1” presented earlier in this Office Action for details.
As to claims 16, they recite substantially the same limitations as in claim 2, and are rejected by the same reasons as claim 2. Refer to “As to claim 2” presented earlier in this Office Action for details.
6.	Claims 3, 10, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Mehrara in view of Lee & Rychlik, and further in view of Doerr et al. (US Patent Application Publication 2014/0351551, hereinafter Doerr).
As to claim 3, Mehrara in view of Lee & Rychlik teaches a work group [Mehrara -- The series of instructions transmitted to a particular GPC 208 constitutes a thread, as previously defined herein, and the collection of a certain number of concurrently executing threads across the parallel processing engines (not shown) within an SM 310 is referred to herein as a "warp" or "thread group." As used herein, a "thread group" refers to a group of threads concurrently executing the same program on different input data, with one thread of the group being assigned to a different processing engine within an SM 310 … (¶ 0048-0049)], but does not teaches the work group is an OpenCL work group.
	However, OpenCL work groups are well known and commonly used in the art.
	For example, Doerr specifically teaches an OpenCL work group [The Open Computing language (OpenCL) is a framework for writing programs with the objective to enable execution across heterogeneous platforms comprising central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors. It is designed to support close-to-hardware interface with limited abstraction … The language is extended to support parallelism with vector types and operations, synchronization, and functions to work with work-items/groups. An application programming interface (API) is used to define and then control the platform. OpenCL, at a course-level, supports parallel computing using task-based and data-based parallelism (¶ 0022-0023)].
	Therefore, it would have been obvious for one of ordinary skills in the art prior to Applicant’s invention to use an OpenCL work group, as demonstrated by Doerr, and to incorporate it into the existing scheme disclosed by Mehrara in view of Lee & Rychlik, and functions to work with work-items/groups. An application programming interface (API) is used to define and then control the platform. OpenCL, at a course-level, supports parallel computing using task-based and data-based parallelism (¶ 0022-0023)].
As to claims 10, they recite substantially the same limitations as in claim 3, and are rejected by the same reasons as claim 3. Refer to “As to claim 3” presented earlier in this Office Action for details.
As to claims 17, they recite substantially the same limitations as in claim 3, and are rejected by the same reasons as claim 3. Refer to “As to claim 3” presented earlier in this Office Action for details.
 7.	Claims 4, 6-7, 11, 13-14, 18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Mehrara in view of Lee & Rychlik, and further in view of Vaidya et al. (US Patent Application Publication 2014/0181477, hereinafter Vaidya).
	As to claim 4, Mehrara in view of Lee & Rychlik does not teaches detecting a non-aligned write as recited in the claim.
	However, Vaidya teaches the cited limitations. Specifically, Vaidya teaches detecting a non-aligned write from a first register location in a source register to a second register location in a destination register, wherein the first register location is not aligned with the second register location [as shown in figure 4, where register 50a is the first/source register and register 50b is the second/destination register; SCC in accordance with an embodiment of the present invention is shown in FIG. 4.  In an embodiment, rearranging channel positions is done through operand in response to the detected non-aligned write, generating a plurality of instructions based on permutation group theory; and transforming the non-aligned write using the generated plurality of instructions [as shown in figure 4, where a plurality of SIMD4 instructions are generated in a pipeline fashion during the following cycles; SCC in accordance with an embodiment of the present invention is shown in FIG. 4.  In an embodiment, rearranging channel positions is done through operand swizzling (permutation) hardware prior to being dispatched to the execution pipeline.  In turn, destination operand positions are correspondingly unswizzled prior to writeback to the register file or other portion of a memory hierarchy … (¶ 0034-0036)].
	Therefore, it would have been obvious for one of ordinary skills in the art prior to Applicant’s invention to detect and handle a non-aligned write as recited in claim 6, as demonstrated by Vaidya, and to incorporate it into the existing scheme disclosed by Mehrara in view of Lee & Rychlik, because Vaidya teaches doing so allows taking advantage of cycle compression [Note that some divergence patterns do not favor BCC. In particular, when disabled channels in an instruction are not contiguous, or are contiguous but not favorably aligned to the SIMD pipeline width, BCC cannot be used to take advantage of cycle compression opportunities … (¶ 0031-0036)].
	As to claim 6, Mehrara in view of Lee & Rychlik & Vaidya teaches The method of claim 4 wherein generating the plurality of instructions comprises generating log (k-1) instructions to transform the source register, wherein k is a value calculated using permutation group theory [Vaidya -- as shown in figure 4, where a plurality of SIMD4 instructions are generated in a pipeline fashion during the following cycles; SCC in accordance with an embodiment of the present invention is shown in FIG. 4.  In an embodiment, rearranging channel positions is done through operand swizzling (permutation) hardware prior to being dispatched to the execution pipeline.  In turn, destination operand positions are correspondingly unswizzled prior to writeback to the register file or other portion of a memory hierarchy … (¶ 0034-0036)].
	As to claim 7, Mehrara in view of Lee & Rychlik & Vaidya teaches The method of claim 6, wherein k is a value calculated using permutation group theory [Vaidya -- as shown in figure 4, where a plurality of SIMD4 instructions are generated in a pipeline fashion during the following cycles; SCC in accordance with an embodiment of the present invention is shown in FIG. 4.  In an embodiment, rearranging channel positions is done through operand swizzling (permutation) hardware prior to being dispatched to the execution pipeline.  In turn, destination operand positions are correspondingly unswizzled prior to writeback to the register file or other portion of a memory hierarchy … (¶ 0034-0036)].
As to claims 11, they recite substantially the same limitations as in claim 4, and is rejected by the same reasons as claim 4. Refer to “As to claim 4” presented earlier in this Office Action for details.
As to claims 13, they recite substantially the same limitations as in claim 6, and is rejected by the same reasons as claim 6. Refer to “As to claim 6” presented earlier in this Office Action for details.

	As to claims 18, they recite substantially the same limitations as in claim 4, and are rejected by the same reasons as claim 4. Refer to “As to claim 4” presented earlier in this Office Action for details.
	As to claims 20, they recite substantially the same limitations as in claim 6, and is rejected by the same reasons as claim 6. Refer to “As to claim 6” presented earlier in this Office Action for details.
8.	Claims 5, 12, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Mehrara in view of Lee & Rychlik & Vaidya, and further in view of Doerr et al. (US Patent Application Publication 2014/0351551, hereinafter Doerr).
	As to claim 5, Mehrara in view of Lee & Rychlik & Vaidya teaches a work group [Mehrara -- The series of instructions transmitted to a particular GPC 208 constitutes a thread, as previously defined herein, and the collection of a certain number of concurrently executing threads across the parallel processing engines (not shown) within an SM 310 is referred to herein as a "warp" or "thread group." As used herein, a "thread group" refers to a group of threads concurrently executing the same program on different input data, with one thread of the group being assigned to a different processing engine within an SM 310 … (¶ 0048-0049)], but does not teaches the work group is an OpenCL compiler.
	However, OpenCL compilers are well known and commonly used in the art.
The Open Computing language (OpenCL) is a framework for writing programs with the objective to enable execution across heterogeneous platforms comprising central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors. It is designed to support close-to-hardware interface with limited abstraction … The language is extended to support parallelism with vector types and operations, synchronization, and functions to work with work-items/groups. An application programming interface (API) is used to define and then control the platform. OpenCL, at a course-level, supports parallel computing using task-based and data-based parallelism (¶ 0022-0023); The source code may be processed, e.g., by a compiler or other tool, including analyzing the information representing the multiple views specified or defined for the system. For example, in one embodiment, the compiler may be configured to recognize the information representing the multiple views in the application source code, and may extract and analyze the information. In other embodiments, the compiler may analyze the information in situ … (¶ 0045-0046)].
	Therefore, it would have been obvious for one of ordinary skills in the art prior to Applicant’s invention to use an OpenCL compiler, as demonstrated by Kyo, and to incorporate it into the existing scheme disclosed by Mehrara in view of Lee & Rychlik & Vaidya, because Doerr teaches doing so would support parallel operations [… The language is extended to support parallelism with vector types and operations, synchronization, and functions to work with work-items/groups. An application programming interface (API) is used to define and then control the platform. OpenCL, at 
	As to claims 12, they recite substantially the same limitations as in claim 5, and are rejected by the same reasons as claim 5. Refer to “As to claim 5” presented earlier in this Office Action for details.
	As to claims 19, they recite substantially the same limitations as in claim 5, and are rejected by the same reasons as claim 5. Refer to “As to claim 5” presented earlier in this Office Action for details.

Conclusion
9.	Claims 1-20 are rejected as explained above. 
10. 	THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE
MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
11.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHENG JEN TSAI whose telephone number is 571-272-4244.  The examiner can normally be reached on Monday-Friday, 9-6.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Charles Rones can be reached on 571-272-4085. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
/SHENG JEN TSAI/Primary Examiner, Art Unit 2136                                                                                                                                                                                                        
November 4, 2021