Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
 				Examiner's Note. 
The references related to the application cited in the specification (para 10-12, ) are not considered by the examiner. If applicant wants the references to be considered, applicant should file an Information Disclosure Statement including all the references cited in the specification and provide copies of the Non-Patent Literature. 
 				Specification Objection

The disclosure (para 12 of the specification)  is objected to because it contains an embedded hyperlink and/or other form of browser-executable code. Applicant is required to delete the embedded hyperlink and/or other form of browser-executable code; references to websites should be limited to the top-level domain name without any prefix such as http:// or other browser-executable code. See MPEP § 608.01

Claim Objections
Claim 23 is objected to because of the following informalities:  
  
It appears that “claim 23” is a typographical error of --claim 23-- and it will be treated as if depending on claim 21 for the following rejection.  
   Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefore, subject to the conditions and requirements of this title.

Claims 15-18  are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.

       Claims 15-18 rejected under 35 U.S.C. 101 because the claimed invention is directed to A GPU instruction set architecture comprising: an ARRIVE operation …, a WAIT operation. While not explicitly describing whether the “instruction set” , “an ARRIVE operation” , “a WAIT operation” are executed; therefore A GPU instruction set architecture appears to be comprised of software without claiming execution of these “instruction set” , “an ARRIVE operation” , “a WAIT operation” a that are not stored anywhere.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), 


Claim(s) 1-4, 6, 8, 11-17 and 19-23 are rejected under 35 U.S.C. 102(a)(2) as being  anticipated by Homes et al. (US 2013/0117750, Howes hereinafter).

As to claim 1, Howes teaches a synchronization barrier (See FIGs. 5-6, para 50, “A first barrier wait instruction in each workitem, causes it to synchronize at synchronization point 501. Synchronizing at synchronization point 501 involves the workitems T1, T2, T3, and T4 waiting for the last workitem among them to arrive at 501, and then resuming execution concurrently or substantially concurrently” and “a barrier wait, a barrier arrive, a barrier skip, and a barrier reset” in para 63) comprising: 
 	a data structure stored in memory, the data structure comprising a counter (e.g., para 60-61,  “A counting semaphore is an exemplary mechanism by which a barrier with the above semantics can be implemented” , “initializing one or more memory locations and/or registers in dynamic memory and/or hardware”, “counts can be maintained in dynamic memory with the appropriate concurrency control mechanism in writing and reading to those memory locations”); 
 	the counter being modified by an operation performed by either an execution thread (e.g., “workitem x continues its execution of the instruction stream”) or a hardware operator (e.g., para 65, “the visit count is updated. According to an embodiment, the visit count is incremented by one to indicate that workitem x reached the barrier” and “After updating the visit count, at operation 628, workitem x continues its execution of the instruction 

As to claim 2, Howes teaches wherein the data structure stored in memory comprises a phase flag, an arrival counter (see  FIG. 5), and a further value used to reinitialize the arrival counter upon reset of the barrier (see FIG. 6, para 60-74, “A barrier release threshold ("release threshold") is the number of workitems the barrier is waiting on. According to an embodiment, barrier b is initialized with a release threshold that is equal to the number of workitems in the group that was started at operation 602. According to another embodiment, the barrier b is created (at operation 604) with a defined size regardless of the size of the group started at operation 602. Therefore, the release threshold is initialized to the defined size”, “a barrier wait, a barrier arrive, a barrier skip, and a barrier reset”).  

As to claim 3, Howes teaches where the hardware operator comprises hardware that performs copying (e.g., para 40, “a library function "loadFunction" in pseudocode that allows a selected workitem to copy data to a shared space with other workitems”, see FIG. 2B).  

As to claim 4, Howes teaches wherein the operation comprises an ARRIVE that is distinct from a WAIT and/or a WAIT that is distinct from an ARRIVE (e.g., para 63, “a barrier wait, a barrier arrive, a barrier skip, and a barrier reset”, see FIG. 5 and 6).  


As to claim 6, Howes teaches a computing system  (FIG. 7) comprising:

 	a memory access circuit (e.g., “System Memory 703”, FIG. 7)  that resets the counter and changes the phase indicator in response to either (a) execution of an instruction by a software thread (e.g., para 65, “the visit count is updated. According to an embodiment, the visit count is incremented by one to indicate that workitem x reached the barrier” and “After updating the visit count, at operation 628, workitem x continues its execution of the instruction stream. Subsequently, processing proceeds to operation 608 when the next synchronization instruction is encountered” in para 70) , and (b) when the counter indicates that all threads in a collection of threads and at least one copy operation have reached a synchronization point and all operations in said collection have completed.  

As to claim 8, Howes teaches wherein the instruction consists of an ARRIVE operation that does not include a WAIT operation or a WAIT operation that does not include an ARRIVE operation (See FIG. 6).  


As to claim 11, Howes teaches wherein the synchronization barrier primitive is stored in shared memory of a GPU (e.g., para 80,  86 , “GPU global cache memory 710 can be coupled to a system memory such as system memory 703, and/or graphics memory such as graphics memory 707, “Barrier synchronizer 709 includes logic to synchronize functions and processing logic on either or both GPU 702 and CPU 701. Barrier synchronizer 709 may be configured to synchronize workitems globally across groups of processors in a computer, in 

As to claim 12, Howes teaches wherein the synchronization barrier primitive is stored in a memory hierarchy which determines access by threads to the primitive (e.g., para 86, “barrier synchronizer 709 can be a computer program written in C or OpenCL, that when compiled and executing resides in system memory 703. In source code form and/or compiled executable form, barrier synchronizer 709 can be stored in persistent memory 704. In one embodiment, some or all of the functionality of barrier synchronizer 709 is specified in a hardware description language such as Verilog, RTL, netlists, to enable ultimately configuring a manufacturing process through the generation of maskworks/photomasks to generate a hardware device embodying aspects of the invention described herein” and “more complex cache memory hierarchies” in para 77).  

As to claim 13, Howes teaches a comparator that compares the count of the counter with a predetermined value (e.g., “614”, FIG. 6) and resets the primitive  (e.g., “620”, FIG. 6) based on results of the comparison.  

As to claim 14, Howes teaches wherein the primitive's phase indicator is structured to be read first by an ARRIVE command (e.g., “608”, FIG. 6)  and then by a WAIT command (e.g., “610”, FIG. 6) , so that a thread can determine whether the primitive's phase indicator has changed phase state ( See FIG. 5).  


 	an ARRIVE operation (e.g., “608”, FIG. 6. Also, see FIG. 8) that reads at least a phase indicator portion of a synchronization barrier primitive stored in memory and causes the barrier primitive to advance a counter (e.g., “612”, FIG. 6) ; and 
 	a WAIT operation (e.g., “610”, FIG. 6)  that reads at least the phase indicator portion of the primitive stored in memory and compares the phase indicator  portion read by the ARRIVE operation with the phase indicator portion of the primitive read by the WAIT operation to determine whether the phase state of the barrier has changed ( e.g., “At operation 610, a decision is made whether the synchronization instruction is a barrier wait instruction, and if so, method 600 proceeds to operation 612”  “At operation 612, the visit count is updated. According to an embodiment, the visit count is incremented by one to indicate that workitem x reached the barrier”  “614, the sum of the updated visit count and the skip count is compared to the release threshold. If the sum is equal to or greater than the release threshold, then workitem x is the last workitem to arrive at the barrier, and the barrier is released at operation 618. Releasing the barrier, according to an embodiment, causes one or more count values to be reset and the blocked workitems to resume execution”).  

As to claim 16, Howes teaches an ADD operation that adds to a field stored with the synchronization barrier primitive (see FIG. 8), the field being used to reinitialize the primitive upon reset to a next phase state (e.g., see FIG. 3, para 45, “The barrier reset instruction resets the barrier to its original configuration” and “causes one or more count values to be reset” in para 66. Also, see “812”, FIG., 8 ).  

and a barrier reset module 810”,  in para 88, see FIG. 7 and 8).  


As to claim 19, Howes teaches a synchronization method (See FIGs.  7 and 8) comprising: 
 	storing in memory synchronization barrier indicia including a phase indicator and a counter (para 76, “store instructions and/or parameter values during the execution of an application on CPU cores 741 and 742, respectively.”, see FIGs. 5 and 6); 
 	executing an arrive instruction (e.g., “a barrier arrive”)  with at least one thread, thereby causing the counter to count and enabling the thread to read the phase indicator (e.g., para 86, “Barrier synchronizer 709 includes logic to synchronize functions and processing logic on either or both GPU 702 and CPU 701. Barrier synchronizer 709 may be configured to synchronize workitems globally across groups of processors in a computer, in each individual processor, and/or within each processing element of a processor” for “barrier wait, a barrier arrive, a barrier skip, and a barrier reset”, see FIG. 6) ; 
 	completing a task with a hardware controller, thereby causing the counter to count (e.g., para 86, “Barrier synchronizer 709 includes logic to synchronize functions and processing logic on either or both GPU 702 and CPU 701. Barrier synchronizer 709 may be configured 
resetting the counter (e.g., “a barrier reset”)  when the counter count indicates that a set of threads have executed arrive instructions and the hardware controller has completed tasks (e.g., para 63, “barrier wait, a barrier arrive, a barrier skip, and a barrier reset”, see FIG. 6) ; and 
 	 45executing a wait instruction  (e.g., “a barrier wait instruction “) with the thread, thereby enabling the thread to again read the phase indicator (e.g., para 63-64, “barrier wait, a barrier arrive, a barrier skip, and a barrier reset.”, “At operation 610, a decision is made whether the synchronization instruction is a barrier wait instruction, and if so, method 600 proceeds to operation 612”), the thread conditioning blocking on whether the phrase indicator has changed values  (e.g., para 66, “the blocked workitems to resume execution”, see FIG. 6).  


T1 proceeds without having to synchronize at the subsequent instances of the barrier”).  


As to claim 21, Howes teaches synchronization barrier comprising: 
 	a counter (e.g., “709”, FIG. 7)  providing a synchronization barrier count (see FIG. 6, para 60, “A counting semaphore is an exemplary mechanism by which a barrier with the above semantics” and “Barrier synchronizer 709 includes logic to synchronize functions and processing logic on either or both GPU 702 and CPU 701. Barrier synchronizer 709 may be configured to synchronize workitems globally across groups of processors in a computer, in each individual processor, and/or within each processing element of a processor” in para 86, see FIG. 7); and 
 	circuitry  (e.g.,  “GPU”, “CPU”, FIG. 7) operatively connected to the counter that modifies the synchronization barrier count in response to completion of operations performed by execution threads and hardware operators (e.g., see FIG. 7 and 8, para 89-91, “A barrier can be implemented using a semaphore (e.g., counting semaphore) and registers. Workitem The semaphore may be implemented in hardware or software. Workitems may be blocked when a barrier wait instruction is encountered”, “using a semaphore, and releasing the barrier may include releasing the semaphore. Workitems can be released when a barrier wait instruction is encountered and it turns out to be the last workitem to complete the requirements for number of workitems to reach the barrier”).  

As to claim 22, Howes teaches wherein the counter resides in memory (e.g., para 61, “Creation of the barrier b object in memory includes initializing one or more memory locations and/or registers in dynamic memory and/or hardware. For example, in relation to barrier b, several counts are required”).  

As to claim 23, see rejection of claim 11 above. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5, 18 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over  Homes et al. (US 2013/0117750, Howes hereinafter) in view of Ramesh et al. ( US 2010/0250809, Ramesh hereinafter).

As to claim 5, Howes teaches further  wherein the data structure is structured to be reset (see FIG. 5 and 6).  However, Howes does not teach in response to a fused load/store atomic that can be initiated by either a hardware engine or a software thread. Ramesh teaches wherein the data structure is structured to be reset in response to a fused load/store atomic that can be initiated by either a hardware engine or a software thread (e.g., para 38, “Counters may be updated based on atomic instructions. When contending tasks (or threads, processes) block, kernel calls may be made carrying sequence counts of the counters as part of argument data for the calls. The kernel may not need to access or modify a lock associated with the counters to support a synchronization primitive. Thus, lock content may remain at user level (or in user space) in an operating environment” and “a synchronization library 117 includes an atomic operation module 119 to update counters for a lock. An atomic operation may be a set of operations that can be combined together appeared as one single operation with only two possible outcomes as either a success or a failure. For example, the atomic operation module 119 may implement an atomic operation using CAS (Compare And Swap) instructions provided by a processor. Other instructions which can be used to implement lock-free or wait-free algorithms for atomic operations may be included” in para 42). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Ramesh  to have   “ implementation of the synchronizer into the kernel, so the scheduler always knows 


As to claim 18, Howes does not teach a fused load/store instruction that permits a hardware-based engine to reset the barrier primitive when the hardware-based engine completes a task assigned to it.  However, Ramesh teaches a fused load/store instruction that permits a hardware-based engine to reset the barrier primitive when the hardware-based engine completes a task assigned to it (see rejection of claim 5 above). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Ramesh  to have   “ implementation of the synchronizer into the kernel, so the scheduler always knows which thread or process holds the interlock and can prevent preemption for the duration” (See Ramesh, para 5).


As to claim 24, Howes does not teach wherein the circuitry is structured to reset the synchronization barrier count in response to a fused load/store atomic that can be initiated by each of copy hardware and software thread execution. However, Ramesh teaches wherein the circuitry is structured to reset the synchronization barrier count in response to a fused load/store atomic that can be initiated by each of copy hardware and software thread execution (e.g., para 38, “Counters may be updated based on atomic instructions. When contending tasks (or threads, processes) block, kernel calls may be made carrying sequence counts of the counters as part of argument data for the calls. The kernel may not need to access or modify a lock associated with the counters to support a synchronization primitive. Thus, lock content may remain at user level (or in user space) in an operating environment” and “a synchronization library 117 includes an atomic operation module 119 to update counters for a lock. An atomic operation may be a set of operations that can be combined together appeared as one single operation with only two possible outcomes as either a success or a failure. For example, the atomic operation module 119 may implement an atomic operation using CAS (Compare And Swap) instructions provided by a processor. Other instructions which can be used to implement lock-free or wait-free algorithms for atomic operations may be included” in para 42). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Ramesh  to have   “ implementation of the synchronizer into the kernel, so the scheduler always knows which thread or process holds the interlock and can prevent preemption for the duration” (See Ramesh, para 5).


Claims 7 and 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Homes et al. (US 2013/0117750, Howes hereinafter) in view of Rashid et al. ( US 2017/0161100, Rashid hereinafter).

As to claim 7, Howes teaches further  wherein the counter counts an aggregate of the number of execution thread arrive calls (e.g., see FIG. 6, “624”, “”626”, FIG. 6).  However, Howes does not teach the counter counts an aggregate of the number of copy operation completions.  Rashid teaches wherein the counter counts an aggregate of the number of copy operation completions 



As to claim 9, Howes teaches further    wherein primitive stored in memory further comprises a predetermined value, and hardware resets the counter by loading the predetermined value when the counter indicates that all threads in a collection of threads (See FIG. 5 and 6). However, Howes does not teach copy operations have reached a synchronization point and all copy operations in said collection have completed.copy complete" action. In one embodiment, the copy complete action is the execution of a flush command. In another embodiment, the copy complete action involves notifying one or more PCEs 412 that a blocking barrier may be discarded and that subsequent processing of copy operations and/or barriers may commence” in para 93). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Rashid to    “ perform load balancing of copy operations across the different links.” (See Rashid, para 14).

Claim 10  is  rejected under 35 U.S.C. 103 as being unpatentable over Homes et al. (US 2013/0117750, Howes hereinafter) in view of Rashid et al. ( US 2017/0161100, Rashid hereinafter), as applied to claim 9 above, and further in view of  Ramesh et al. ( US 2010/0250809, Ramesh hereinafter).
 .

As to claim 10, Howes and Rashid do not teach wherein the system is structured to allow a thread to dynamically change the predetermined value.  However, Ramesh teaches wherein the system is structured to allow a thread to dynamically change the predetermined value (e.g., para 38, “Counters may be updated based on atomic instructions. When contending tasks (or threads, processes) block, kernel calls may be made carrying sequence counts of the counters as part of argument data for the calls. The kernel may not need to access or modify a lock associated with the counters to support a synchronization primitive. Thus, lock content may remain at user level (or in user space) in an operating environment” and “a synchronization library 117 includes an atomic operation module 119 to update counters for a lock. An atomic operation may be a set of operations that can be combined together appeared as one single operation with only two possible outcomes as either a success or a failure. For example, the atomic operation module 119 may implement an atomic operation using CAS (Compare And Swap) instructions provided by a processor. Other instructions which can be used to implement lock-free or wait-free algorithms for atomic operations may be included” in para 42). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further  modify the method of Howes and Rashid by adopting the teachings of Ramesh  to have   “ implementation of the synchronizer into the kernel, so the scheduler always knows which thread or process holds the interlock and can prevent preemption for the duration” (See Ramesh, para 5).




Claim 1 is  rejected under 35 U.S.C. 103 as being unpatentable over  Homes et al. (US 2013/0117750, Howes hereinafter) in view of Ramesh et al. ( US 2010/0250809, Ramesh hereinafter).

As to claim 1, Howes teaches a synchronization barrier (See FIGs. 5-6, para 50, “A first barrier wait instruction in each workitem, causes it to synchronize at synchronization point 501. Synchronizing at synchronization point 501 involves the workitems T1, T2, T3, and T4 waiting for the last workitem among them to arrive at 501, and then resuming execution concurrently or substantially concurrently” and “a barrier wait, a barrier arrive, a barrier skip, and a barrier reset” in para 63) comprising: 
 	a data structure stored in memory, the data structure comprising a counter (e.g., para 60-61,  “A counting semaphore is an exemplary mechanism by which a barrier with the above semantics can be implemented” , “initializing one or more memory locations and/or registers in dynamic memory and/or hardware”, “counts can be maintained in dynamic memory with the appropriate concurrency control mechanism in writing and reading to those memory locations”). 
 	Howes teaches further the counter being modified by an operation performed by  an execution thread

 	Ramesh teaches the counter being modified by an operation performed by either an execution thread or a hardware operator (e.g., para 42-44, “a synchronization library 117 includes an atomic operation module 119 to update counters for a lock. An atomic operation may be a set of operations that can be combined together appeared as one single operation with only two possible outcomes as either a success or a failure. For example, the atomic operation module 119 may implement an atomic operation using CAS (Compare And Swap) instructions provided by a processor. Other instructions which can be used to implement lock-free or wait-free algorithms for atomic operations may be included”, “synchronizing a group of tasks may be coordinated based on a data structure or a synchronization identifier such as a lock”, see FIG. 1). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Ramesh  to have   “ implementation of the synchronizer into the kernel, so the scheduler always knows which thread or process holds the interlock and can prevent preemption for the duration” (See Ramesh, para 5).


Claim 6 is  rejected under 35 U.S.C. 103 as being unpatentable over  Homes et al. (US 2013/0117750, Howes hereinafter) in view of Rashid et al. ( US 2017/0161100, Rashid hereinafter). 



  	a synchronization barrier primitive stored in memory (e.g., “709”, FIG.7) , the primitive including a counter and a phase indicator (see FIG. 5 and 6); and 
a memory access circuit (e.g., “System Memory 703”, FIG. 7)  that resets the counter and changes the phase indicator .  
 	However, Howes does not teach resets the counter and changes the phase indicator in response to at least one copy operation have reached a synchronization point and all operations in said collection have completed. Rashid teaches resets the counter and changes the phase indicator in response to at least one copy operation have reached a synchronization point and all operations in said collection have completed (e.g., para 93, “monitor one or more counters that track outstanding barriers and/or received barrier signals. The LCE 402 may repeat step 1014 until all outstanding barrier signals are received, and then proceed to step 1016. At step 1016, the LCE 402 performs a "copy complete" action. In one embodiment, the copy complete action is the execution of a flush command. In another embodiment, the copy complete action involves notifying one or more PCEs 412 that a blocking barrier may be discarded and that subsequent processing of copy operations and/or barriers may commence”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Rashid to    “ perform load balancing of copy operations across the different links.” (See Rashid, para 14).



Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  Nikos Anastopoulos  et al . “Facilitating Efficient Synchronization of Asymmetric Threads on Hyper-Threaded Processors”.


Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDOU K SEYE whose telephone number is (571)270-1062. The examiner can normally be reached M-F 9-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Dennis Chow can be reached on 5712727767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) 





/ABDOU K SEYE/Examiner, Art Unit 2194                                                                                                                                                                                                        

/DOON Y CHOW/Supervisory Patent Examiner, Art Unit 2194