Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .	

Claim Objections
Claims 2, 7-9, 11, 15-18, 20 , 23 and 25 are objected to because of the following informalities: 
(i)   claim 2, line 2;   claim 9, line 1 ;  claim 15, line 8 ; and claim 17, line 4  --the-- needs to be inserted before “memory”.
 (ii) As to claim 7, lines 1-2 ,  the term   “the number’ lacks proper antecedent bases.
(iii)   Claim 8, line 2, “an”  before “changes” appears to be typographical error of --and--.
(iv) As to claim 15,  line 10, the terms  “the phase state” and “the barrier” lack proper antecedent bases.
 (v) As to claim 20,  line 4, the term  “the synchronization barrier” lacks proper antecedent bases.
(vi) As to claim 25,  line 3 the term  “the state” lacks proper antecedent bases.
(vii) An acronyms first recited in claims should be spelled out, e.g., first occurrence of (“GPU” in claims 11, 15 and 23) should be spelled out.
  (viii) Dependent claims 16-18 are affected by the objection of claim 15 above. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.



Claims 19-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention. 

   The following claims language is unclear and indefinite:

As per claim 19 , line 9 , “the task” is not clear whether it refers to “a task” in line 3 or 6 of claim 19.
 	Dependent claim 20 is affected by the rejection of claim 19 above.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s)  1-4, 6, 8, 11-17, 19-23 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Homes et al. (US 2013/0117750, Howes hereinafter) in view of Asaad et al. (US 2011/0219208, Asaad hereinafter).

As to claim 1, Howes teaches a synchronization barrier (See FIGs. 5-6, para 50, “A first barrier wait instruction in each workitem, causes it to synchronize at synchronization point 501. Synchronizing at synchronization point 501 involves the workitems T1, T2, T3, and T4 waiting for the last workitem among them to arrive at 501, and then resuming execution concurrently or substantially concurrently” and “a barrier wait, a barrier arrive, a barrier skip, and a barrier reset” in para 63) comprising: 
 	a data structure stored in memory, the data structure comprising a counter (e.g., para 60-61,  “A counting semaphore is an exemplary mechanism by which a barrier with the above semantics can be implemented” , “initializing one or more memory locations and/or registers in dynamic memory and/or hardware”, “counts can be maintained in dynamic memory with the appropriate concurrency control mechanism in writing and reading to those memory locations”); 
 	the counter being advanced  (e.g., “count is incremented “) by a first operation performed by  an execution thread  (e.g., para 65, “the visit count is updated. According to an embodiment, the visit count is incremented by one to indicate that workitem x reached the barrier” and “After updating the visit count, at operation 628, workitem x continues its execution of the instruction stream. Subsequently, processing proceeds to operation 608 when the next synchronization instruction is encountered” in para 70). 
However, Howes does not teach the counter being advanced by a second operation performed by a hardware operator that can advance the counter independently of the first operation performed by the execution thread .
Asaad teaches a counter being advanced (e.g., “write a "1" to the register, and when the operating system wishes to initiate a copy of the hardware performance values from memory it may write a "2" to the register”, “counters operable to collect counts of selected hardware-related activities”)  by a second operation (e.g., one of “hardware-related activities and events”, “a DMA controller operable to copy data”, “hardware to copy all of the memory in the specified range”)  performed by a hardware operator (e.g., “A performance counter unit 102 may be built into a microprocessor and includes a plurality of hardware performance counters 104, which are registers used to store the counts of hardware-related activities within a computer.”)  that can advance the counter independently ( e.g., para 1032-1040, “a device and method for copying performance counter data are provided. The device, in one aspect, may include at least one processor core, a memory, and a plurality of hardware performance counters operable to collect counts of selected hardware-related activities “,  “A direct memory access (DMA) mechanism allows software to specify a range of memory to be copied from and to, and hardware to copy all of the memory in the specified range”, “A performance counter unit 102 may be built into a microprocessor and includes a plurality of hardware performance counters 104, which are registers used to store the counts of hardware-related activities within a computer. Examples of activities of which the counters 104 may store counts may include, but are not limited to, cache misses, translation lookaside buffer (TLB) misses, the number of instructions completed, number of floating point instructions executed, processor cycles, input/output (I/O) requests, and other hardware-related activities and events. A memory device 108, which may be an L2 cache or other memory, stores various data related to the running of the computer system and its applications” and “ the operating system wishes to initiate a copy of the hardware performance information to memory it may write a "1" to the register, and when the operating system wishes to initiate a copy of the hardware performance values from memory it may write a "2" to the register”, “the operation may be performed synchronously by setting a third register. For example, R3, 108 can be set to "1" indicating that the hardware should not return control to the operating system after the write to R2 until the copying operation has completed” , “Referring to FIG. 3, a performance counter unit 102 may be built into a microprocessor, or in a multiprocessor system, and includes a plurality of hardware performance counters 118, which are registers used to store the counts of hardware-related activities within a computer” in  para 1079 and 1146) of  a  first operation performed by the execution thread ( e.g., para 88-89, “the list prefetch engine resets a value of a counter device which counts the number of mismatches between the valid cache miss address and addresses in list(s) in the ListRead array 115”, “he list prefetch engine 100 increments the value of the counter device” for “a large number (e.g., 1 million) of active threads run in the parallel computing system” in para 113. Also,  see FIG. 9,  para 408-415“The memory synchronization interface unit 904 includes a control unit 906 that collects and aggregates requests from one or more clients 901 (e.g., 4 thread memory synchronization controls of the L1P via decoder 902) and requests generation increments from the global generation counter unit 905 illustrated in FIG. 6 and receives current counts from that unit as well. The control unit 906 includes a respective set of registers 907 for each hardware thread. These registers may store information such as [0409] configuration for a current memory synchronization instruction issued by a core 52, [0410] when the currently operating memory synchronization instruction started, [0411] whether data has been sent to the central unit, and [0412] whether a generation change has been received” and  para 362-376, see FIG. 6). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Asaad  to provide    “ a new class of massively-parallel, distributed-memory scalable computer architectures for achieving 100 peta-OPS scale computing and beyond, at decreased cost, power and footprint.” (See Asaad, para 14).

As to claim 2, Howes teaches wherein the data structure stored in memory comprises a phase flag, an arrival counter (see  FIG. 5), and a further value used to reinitialize the arrival counter upon reset of the barrier (see FIG. 6, para 60-74, “A barrier release threshold ("release threshold") is the number of workitems the barrier is waiting on. According to an embodiment, barrier b is initialized with a release threshold that is equal to the number of workitems in the group that was started at operation 602. According to another embodiment, the barrier b is created (at operation 604) with a defined size regardless of the size of the group started at operation 602. Therefore, the release threshold is initialized to the defined size”, “a barrier wait, a barrier arrive, a barrier skip, and a barrier reset”).  

As to claim 3, Howes teaches where the hardware operator comprises hardware that performs copying (e.g., para 40, “a library function "loadFunction" in pseudocode that allows a selected workitem to copy data to a shared space with other workitems”, see FIG. 2B).  

As to claim 4, Howes teaches wherein the first operation comprises an ARRIVE that is distinct from a WAIT and/or a WAIT that is distinct from an ARRIVE (e.g., para 63, “a barrier wait, a barrier arrive, a barrier skip, and a barrier reset”, see FIG. 5 and 6).  


As to claim 6, Howes teaches a computing system  (FIG. 7) comprising:
 	 a synchronization barrier primitive stored in memory (e.g., “709”, FIG.7) , the primitive including a counter and a phase indicator (see FIG. 5 and 6);  wherein the counter is associated with a collection of threads (e.g., see FIG. 6 “para 54-74, “ At synchronization point 505, workitems T2, T3 and T4 synchronize. Note, that although T4 had previously issued a barrier arrive instruction (at 502), this previous issuance only exempted T4 from the next occurring barrier (i.e., at synchronization point 503). After synchronization at synchronization point 505, T2, T3, and T4 proceed to synchronize again at synchronization point 506.”, “ A counting semaphore is an exemplary mechanism by which a barrier with the above semantics can be implemented”, “various counts required to initialize barrier b are determined and their initial values are set. The number of visiting workitems ("visit count") defines the number of workitems that have reached barrier b. A workitem has "reached" a barrier, when it issues a barrier wait instruction or equivalent.”)  and at least one copy (e.g., “LoadFunction”, “to copy data”)  operation (e.g., e.g., para 40, “a library function "loadFunction" in pseudocode that allows a selected workitem to copy data to a shared space with other workitems”, see FIG. 2B);
 	a memory access circuit (e.g., see FIG. 7)  that resets the counter   and changes the phase indicator (e.g., para 37, “a "barrier reset." is issued. For example, the barrier can be reconfigured such that the number of workitems required to reach the barrier can be reduced to account for the exiting workitem. “ and “The reset causes T1 to synchronize with T2, T3 and T4 at synchronization point 507. The synchronization at point 507 may achieved by implementing of the reset as a self-synchronizing instruction or by user-specified synchronization instructions associated with the reset.” In para 55, see FIG. 5) , wherein the counter is associated with the collection (e.g., “The number of visiting workitems”) of threads  (e.g., para 62, “various counts required to initialize barrier b are determined and their initial values are set. The number of visiting workitems ("visit count") defines the number of workitems that have reached barrier b. A workitem has "reached" a barrier, when it issues a barrier wait instruction or equivalent. The visit count may be initialized to 0. The number of workitems that have executed a barrier skip instruction on barrier b may be tracked in a skipped count ("skip count")” and “the visit count is updated. According to an embodiment, the visit count is incremented by one to indicate that workitem x reached the barrier” in para 65).
 	Howes teaches further the counter indicating that all threads in the collection of threads have reached a synchronization point (e.g., para 54,  “ synchronization point 505, workitems T2, T3 and T4 synchronize. Note, that although T4 had previously issued a barrier arrive instruction (at 502), this previous issuance only exempted T4 from the next occurring barrier (i.e., at synchronization point 503). After synchronization at synchronization point 505, T2, T3, and T4 proceed to synchronize again at synchronization point 506”).
However, Howes does not explicitly  teach the at least one copy operation has reached a synchronization point and all operations in said collection of threads have completed. 
Asaad teaches wherein  a counter is associated with a collection of threads (see  “ThreadAD1”, “ThreadAD2”, FIG. 9) and at least one copy operation (e.g.,  see FIG. 9, para 407-413“Memory Synchronization Interface Unit”,  requests generation increments from the global generation counter unit 905 illustrated in FIG. 6 and receives current counts from that unit as well. The control unit 906 includes a respective set of registers 907 for each hardware thread. These registers may store information “ , “The core issuing the msync drains all loads and stored” and  “copying of performance counters directly to memory”, “ the software may write a value in a register that automatically triggers the state machine (hardware) to automatically perform direct copying of the hardware performance counter data to memory without further software intervention. In one aspect, the specifying of copying the performance counter data directly to memory and the hardware automatically performing the copying may occur while an operating system thread is in context.” in para 1163), and a memory access circuit (e.g., see  FIG. 9 and FIGs. 6, “905”)  that resets the counter (e.g., “start at zero after reset”)  and changes the phase indicator in response to  the counter indicating that all threads in  the collection of threads  and the at least one copy operation have reached a synchronization point  (e.g., para  374-375, “memory synchronization requests from other cores and process them all at once by broadcasting the identical grant to all of them, causing them all to wait for the same generations to clear. For instance, all requests for generation change from the hardware threads can be OR'd together to create a single generation change request. “ , “The generation counter (gen_cnt) 601 and the reclaim pointer (rcl_ptr) 602 both start at zero after reset. When a unit requests to advance to a new generation, it indicates the desired generation. There is no request explicit acknowledge sent back to the requestor, the requestor unit determines at whether its request has been processed based on the global current generation 601, 602. As the requested generation can be at most the gen_cnt+1, requests for any other generation at are assumed to have already been completed”))  and all operations in said collection of threads have completed (e.g., para 1086 and  1092  “When the copying is finished, the state machine 110 sets the context switch register to a value (e.g., "0") that indicates that the copying is completed. In another embodiment, the performance counters may generate an interrupt to signal the completion of copying. The interrupt may be used to notify the operating system that the copying has completed. In one embodiment, the hardware clears the context switch register 104. In another embodiment, the operating system resets the context switch register value 104 (e.g., "0") to indicate no copying” ,  “The operating system or the like may proceed in performing other operations while the hardware copies that data from the hardware performance control and data registers. At 208, after the hardware finishes copying, the hardware resets the value at register R1, for example, to "0" to indicate that the copying is done. At 208, prior to completing the context switch, the operating system or the like checks the value of register R2 to make sure it is "0" or another value, which indicates that the hardware has finished the copy” for “A global synchronization signal refers to a signal that can be used to notify a plurality of processors to synchronize, for example, to perform instructions, operations and others” in para 1660. Also, see Also,  see FIG. 9,  para 408-415“The memory synchronization interface unit 904 includes a control unit 906 that collects and aggregates requests from one or more clients 901 (e.g., 4 thread memory synchronization controls of the L1P via decoder 902) and requests generation increments from the global generation counter unit 905 illustrated in FIG. 6 and receives current counts from that unit as well. The control unit 906 includes a respective set of registers 907 for each hardware thread. These registers may store information such as [0409] configuration for a current memory synchronization instruction issued by a core 52, [0410] when the currently operating memory synchronization instruction started, [0411] whether data has been sent to the central unit, and [0412] whether a generation change has been received” and FIG. 6,  para 362-376). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Asaad  to provide    “ a new class of massively-parallel, distributed-memory scalable computer architectures for achieving 100 peta-OPS scale computing and beyond, at decreased cost, power and footprint.” (See Asaad, para 14).

As to claim 8, Howes teaches further wherein the instruction consisting of an ARRIVE operation that does not include a WAIT operation or a WAIT operation that does not include an ARRIVE operation (See FIG. 6).  However, Howes does not teach wherein the memory access circuit resets the counter and changes the phase indicator in response to executing of an instruction by a software thread . Asaad teaches wherein the memory access circuit resets the counter and changes the phase indicator in response to executing of an instruction by a software thread (e.g., para 1092, “The operating system or the like may proceed in performing other operations while the hardware copies that data from the hardware performance control and data registers. At 208, after the hardware finishes copying, the hardware resets the value at register R1, for example, to "0" to indicate that the copying is done. At 208, prior to completing the context switch, the operating system or the like checks the value of register R2 to make sure it is "0" or another value, which indicates that the hardware has finished the copy” for “A global synchronization signal refers to a signal that can be used to notify a plurality of processors to synchronize, for example, to perform instructions, operations and others” in para 1660. Also, see para 374-375, FIGs. 6-7 ). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Asaad  to provide    “ a new class of massively-parallel, distributed-memory scalable computer architectures for achieving 100 peta-OPS scale computing and beyond, at decreased cost, power and footprint.” (See Asaad, para 14).


As to claim 11, Howes teaches wherein the synchronization barrier primitive is stored in shared memory of a GPU (e.g., para 80,  86 , “GPU global cache memory 710 can be coupled to a system memory such as system memory 703, and/or graphics memory such as graphics memory 707, “Barrier synchronizer 709 includes logic to synchronize functions and processing logic on either or both GPU 702 and CPU 701. Barrier synchronizer 709 may be configured to synchronize workitems globally across groups of processors in a computer, in each individual processor, and/or within each processing element of a processor”, see FIG. 7 ).   

As to claim 12, Howes teaches wherein the synchronization barrier primitive is stored in a memory hierarchy which determines access by threads to the primitive (e.g., para 86, “barrier synchronizer 709 can be a computer program written in C or OpenCL, that when compiled and executing resides in system memory 703. In source code form and/or compiled executable form, barrier synchronizer 709 can be stored in persistent memory 704. In one embodiment, some or all of the functionality of barrier synchronizer 709 is specified in a hardware description language such as Verilog, RTL, netlists, to enable ultimately configuring a manufacturing process through the generation of maskworks/photomasks to generate a hardware device embodying aspects of the invention described herein” and “more complex cache memory hierarchies” in para 77).  

As to claim 13, Howes teaches a comparator that compares the count of the counter with a predetermined value (e.g., “614”, FIG. 6) and resets the primitive  (e.g., “620”, FIG. 6) based on results of the comparison.  

As to claim 14, Howes teaches wherein the primitive's phase indicator is structured to be read first by an ARRIVE command (e.g., “608”, FIG. 6)  and then by a WAIT command (e.g., “610”, FIG. 6) , so that a thread can determine whether the primitive's phase indicator has changed phase state ( See FIG. 5).  

As to claim 15, Howes teaches  A  non-transitory readable medium storing a GPU instruction set architecture  (see FIG. 7) comprising: 
 	an ARRIVE operation (e.g., “608”, FIG. 6. Also, see FIG. 8) that reads at least a phase indicator portion of a synchronization barrier primitive stored in memory and causes the barrier primitive to advance a counter by a first operation performed by an execution thread (e.g., “612”, FIG. 6) ; and 
 	a WAIT operation (e.g., “610”, FIG. 6)  that reads at least the phase indicator portion of the primitive stored in memory and compares the phase indicator  portion read by the ARRIVE operation with the phase indicator portion of the primitive read by the WAIT operation to determine whether the phase state of the barrier has changed ( e.g., “At operation 610, a decision is made whether the synchronization instruction is a barrier wait instruction, and if so, method 600 proceeds to operation 612”  “At operation 612, the visit count is updated. According to an embodiment, the visit count is incremented by one to indicate that workitem x reached the barrier”  “614, the sum of the updated visit count and the skip count is compared to the release threshold. If the sum is equal to or greater than the release threshold, then workitem x is the last workitem to arrive at the barrier, and the barrier is released at operation 618. Releasing the barrier, according to an embodiment, causes one or more count values to be reset and the blocked workitems to resume execution”).  
 However, Howes does not explicitly teach  to advance the counter by a second operation performed by a hardware operator. 
 Assaad teaches to advance a counter by a first operation performed by an execution thread  (see rejection of claim 1 above) and to advance the counter by a second operation performed by a hardware operator (e.g., see rejection of claim 1 above). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Asaad  to provide    “ a new class of massively-parallel, distributed-memory scalable computer architectures for achieving 100 peta-OPS scale computing and beyond, at decreased cost, power and footprint.” (See Asaad, para 14).

As to claim 16, Howes teaches an ADD operation that adds to a field stored with the synchronization barrier primitive (see FIG. 8), the field being used to reinitialize the primitive upon reset to a next phase state (e.g., see FIG. 3, para 45, “The barrier reset instruction resets the barrier to its original configuration” and “causes one or more count values to be reset” in para 66. Also, see “812”, FIG., 8 ).  

As to claim 17, Howes teaches a CREATE instruction that initializes and stores the synchronization barrier primitive to memory (e.g., para 61, “Creation of the barrier b object in memory includes initializing one or more memory locations and/or registers in dynamic memory and/or hardware. For example, in relation to barrier b, several counts are required “ and “Barrier synchronizer 800 includes a workitem blocking module 802, a barrier release module 804, a barrier workitem group module 806, a barrier skip module 808, and a barrier reset module 810”,  in para 88, see FIG. 7 and 8).  


As to claim 19, Howes teaches a synchronization method (See FIGs.  7 and 8) comprising: 
 	storing in memory synchronization barrier indicia including a phase indicator and a counter associated with a set of threads (para 76, “store instructions and/or parameter values during the execution of an application on CPU cores 741 and 742, respectively.”, see FIGs. 5 and 6); 
 	executing an arrive instruction (e.g., “a barrier arrive”)  with at least one thread (e.g., one of “workitems”)  of the set of threads, thereby causing the counter to count and enabling the thread to read the phase indicator (e.g., para 86, “Barrier synchronizer 709 includes logic to synchronize functions and processing logic on either or both GPU 702 and CPU 701. Barrier synchronizer 709 may be configured to synchronize workitems globally across groups of processors in a computer, in each individual processor, and/or within each processing element of a processor” for “barrier wait, a barrier arrive, a barrier skip, and a barrier reset”, see FIG. 6) . 
 	Howes teaches further completing a task with a hardware controller, thereby causing the counter to count (e.g., para 86, “Barrier synchronizer 709 includes logic to synchronize functions and processing logic on either or both GPU 702 and CPU 701. Barrier synchronizer 709 may be configured to synchronize workitems globally across groups of processors in a computer, in each individual processor, and/or within each processing element of a processor”, “all of the functionality of barrier synchronizer 709 is specified in a hardware description language such as Verilog, RTL, netlists, to enable ultimately configuring a manufacturing process through the generation of maskworks/photomasks to generate a hardware device” and “counts required to initialize barrier b are determined and their initial values are set. The number of visiting workitems ("visit count") defines the number of workitems that have reached barrier b. A workitem has "reached" a barrier, when it issues a barrier wait instruction or equivalent. The visit count may be initialized to 0” in para 62, see FIG. 6); 
resetting the counter (e.g., “a barrier reset”)  when the counter count indicates that the set of threads have executed arrive instructions and the hardware controller has completed task (e.g., para 63, “barrier wait, a barrier arrive, a barrier skip, and a barrier reset”, see FIG. 6) ; and 45executing a wait instruction  (e.g., “a barrier wait instruction “) with  the at least one thread, thereby enabling  the at least one  thread to again read the phase indicator (e.g., para 63-64, “barrier wait, a barrier arrive, a barrier skip, and a barrier reset.”, “At operation 610, a decision is made whether the synchronization instruction is a barrier wait instruction, and if so, method 600 proceeds to operation 612”), the at least one  thread conditioning blocking on whether the phase indicator has changed values  (e.g., para 66, “the blocked workitems to resume execution”, see FIG. 6).  
 	However, Howes does not teach the  counter associated  with a task performed by a hardware controller. 
 	Asaad teaches a counter associated with a set of threads  and a task performed by a hardware controller (see FIG.s, 6-7 and 9) ; causing the counter to count and enabling the thread to read the phase indicator; completing a task with the hardware controller, thereby causing the counter to count; resetting the counter when the counter count indicates that  the set of threads have executed arrive instructions and the hardware controller has completed the taskclaimed invention to modify the method of Howes by adopting the teachings of Asaad  to provide    “ a new class of massively-parallel, distributed-memory scalable computer architectures for achieving 100 peta-OPS scale computing and beyond, at decreased cost, power and footprint.” (See Asaad, para 14).


As to claim 20, Howes teaches opening a window of execution (e.g., FIG. 5)  from when at least one thread executes the arrive instruction (e.g., “502”, FIG. 5) to when the at least one thread executes the wait instruction (e.g., “503”, FIG. 5),  the  at least one  thread performing work (e.g. “T1”, FIG. 5)  that is asynchronous with respect to the synchronization barrier within the window of execution (e.g., para 51-52, “At point 502, workitem T4 issues a barrier arrive instruction. The barrier arrive instruction notifies the next instance of the barrier to not wait on T4, and T4 proceeds without having to synchronize at the second instance of the barrier”, “The barrier skip instruction notifies any subsequent instances of barrier to not wait on T1, and T1 proceeds without having to synchronize at the subsequent instances of the barrier”).  


As to claim 21, Howes teaches synchronization barrier comprising: 
 	a counter (e.g., “709”, FIG. 7)  providing a synchronization barrier count (see FIG. 6, para 60, “A counting semaphore is an exemplary mechanism by which a barrier with the above semantics” and “Barrier synchronizer 709 includes logic to synchronize functions and processing logic on either or both GPU 702 and CPU 701. Barrier synchronizer 709 may be configured to synchronize workitems globally across groups of processors in a computer, in each individual processor, and/or within each processing element of a processor” in para 86, see FIG. 7); and 
 	circuitry  (e.g.,  “GPU”, “CPU”, FIG. 7) operatively connected to the counter that advances the synchronization barrier count in response to completion of operations performed by execution threads and hardware operators (e.g., see FIG. 7 and 8, para 89-91, “A barrier can be implemented using a semaphore (e.g., counting semaphore) and registers. Workitem blocking may be implemented by causing blocked workitems to wait upon the semaphore. The semaphore may be implemented in hardware or software. Workitems may be blocked when a barrier wait instruction is encountered”, “using a semaphore, and releasing the barrier may include releasing the semaphore. Workitems can be released when a barrier wait instruction is encountered and it turns out to be the last workitem to complete the requirements for number of workitems to reach the barrier”).  
However, Howes does not teach the counter that advances the synchronization barrier count in response to completion of operations performed by hardware operators .	Assaad teaches  circuitry (e.g., see FIG. 9)operatively connected to a counter    that  advances the synchronization barrier count  (e.g., see FIG. 6, para 354-372, “Memory Synchronization Unit”, “The memory synchronization unit 905 shown in FIG. 6 allows grouping of memory accesses into generations and enables ordering by providing feedback when a generation of accesses has completed”) in response to completion of operations performed by execution threads (e.g., “901”, FIG. 9) and advances the synchronization barrier count (e.g., para 371, “ For a synchronization operation, a unit can request an increment of the current generation and wait for previous generations to complete.”) in response to completion of operations (e.g., “determines whether a request of a client has completed”) performed by hardware operators (e.g., para 408-415“The memory synchronization interface unit 904 includes a control unit 906 that collects and aggregates requests from one or more clients 901 (e.g., 4 thread memory synchronization controls of the L1P via decoder 902) and requests generation increments from the global generation counter unit 905 illustrated in FIG. 6 and receives current counts from that unit as well. The control unit 906 includes a respective set of registers 907 for each hardware thread. These registers may store information such as [0409] configuration for a current memory synchronization instruction issued by a core 52, [0410] when the currently operating memory synchronization instruction started, [0411] whether data has been sent to the central unit, and [0412] whether a generation change has been received”, “The control unit also tracks the changes of the global generation (gen_cnt) and determines whether a request of a client has completed. Generation completion is detected by using the reclaim pointer that is fed to observer latches in the L1P. The core waits for the L1P to handle the msyncs. Each hardware thread may be waiting for a different generation to complete. Therefore each one stores what the generation for that current memory synchronization instruction was. Each then waits individually for its respective generation to complete”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Asaad  to provide    “ a new class of massively-parallel, distributed-memory scalable computer architectures for achieving 100 peta-OPS scale computing and beyond, at decreased cost, power and footprint.” (See Asaad, para 14).

 
As to claim 22, Howes teaches wherein the counter resides in memory (e.g., para 61, “Creation of the barrier b object in memory includes initializing one or more memory locations and/or registers in dynamic memory and/or hardware. For example, in relation to barrier b, several counts are required”).  

As to claim 23, see rejection of claim 11 above. 

As to claim 25, Howes  does not teach circuitry associated with the counter that enables gating  (e.g., “using a counter device and a logic gate “) of further execution of the same or different execution thread based on the state of the counter. However, Assaad teaches  circuitry associated with the counter that enables gating of further execution of the same or different execution thread based on the state of the counter ( e.g.,  see  para 1670 and 1672,  “ clock generation circuit (e.g., the circuit 100 shown in FIG. 1) may receive a clock signal, e.g., a from a clock synthesizer 110, and generate a pulse width modified clock signal, e.g., by using a counter device and a logic gate. “ and  “ the hardware module 120 includes an incrementing counter device and a logical exclusive OR gate, the value of the counter device increments from 0 to 3 every rising edge of the first clock signal 220”  for “advance to be called on one or more threads” in para 1536) . Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Howes by adopting the teachings of Asaad  to provide    “ a new class of massively-parallel, distributed-memory scalable computer architectures for achieving 100 peta-OPS scale computing and beyond, at decreased cost, power and footprint.” (See Asaad, para 14).


Claims 5, 18 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over  Homes et al. (US 2013/0117750, Howes hereinafter) in view of Asaad et al. (US 2011/0219208, Asaad hereinafter), as applied to claims 1, 15 and 21 above, and further in view of Ramesh et al. ( US 2010/0250809, Ramesh hereinafter).
 .
As to claim 5, Howes teaches further  wherein the data structure is structured to be reset (see FIG. 5 and 6).  However, Howes  and Asaad do not teach in response to a fused load/store atomic that can be initiated by either a hardware engine or a software thread. Ramesh teaches wherein the data structure is structured to be reset in response to a fused load/store atomic that can be initiated by either a hardware engine or a software thread (e.g., para 38, “Counters may be updated based on atomic instructions. When contending tasks (or threads, processes) block, kernel calls may be made carrying sequence counts of the counters as part of argument data for the calls. The kernel may not need to access or modify a lock associated with the counters to support a synchronization primitive. Thus, lock content may remain at user level (or in user space) in an operating environment” and “a synchronization library 117 includes an atomic operation module 119 to update counters for a lock. An atomic operation may be a set of operations that can be combined together appeared as one single operation with only two possible outcomes as either a success or a failure. For example, the atomic operation module 119 may implement an atomic operation using CAS (Compare And Swap) instructions provided by a processor. Other instructions which can be used to implement lock-free or wait-free algorithms for atomic operations may be included” in para 42). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further  modify the method of Howes and Asaad by adopting the teachings of Ramesh  to have   “ implementation of the synchronizer into the kernel, so the scheduler always knows which thread or process holds the interlock and can prevent preemption for the duration” (See Ramesh, para 5).


As to claim 18, Howes and Asaad do not teach a fused load/store instruction that permits a hardware-based engine to reset the barrier primitive when the hardware-based engine completes a task assigned to it.  However, Ramesh teaches a fused load/store instruction that permits a hardware-based engine to reset the barrier primitive when the hardware-based engine completes a task assigned to it (see rejection of claim 5 above). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further  modify the method of Howes and Asaad by adopting the teachings of Ramesh  to have   “ implementation of the synchronizer into the kernel, so the scheduler always knows which thread or process holds the interlock and can prevent preemption for the duration” (See Ramesh, para 5).


As to claim 24, Howes and Asaad do not teach wherein the circuitry is structured to reset the synchronization barrier count in response to a fused load/store atomic that can be initiated by each of copy hardware and software thread execution. However, Ramesh teaches wherein the circuitry is structured to reset the synchronization barrier count in response to a fused load/store atomic that can be initiated by each of copy hardware and software thread execution (e.g., para 38, “Counters may be updated based on atomic instructions. When contending tasks (or threads, processes) block, kernel calls may be made carrying sequence counts of the counters as part of argument data for the calls. The kernel may not need to access or modify a lock associated with the counters to support a synchronization primitive. Thus, lock content may remain at user level (or in user space) in an operating environment” and “a synchronization library 117 includes an atomic operation module 119 to update counters for a lock. An atomic operation may be a set of operations that can be combined together appeared as one single operation with only two possible outcomes as either a success or a failure. For example, the atomic operation module 119 may implement an atomic operation using CAS (Compare And Swap) instructions provided by a processor. Other instructions which can be used to implement lock-free or wait-free algorithms for atomic operations may be included” in para 42). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further  modify the method of Howes and Asaad by adopting the teachings of Ramesh  to have   “ implementation of the synchronizer into the kernel, so the scheduler always knows which thread or process holds the interlock and can prevent preemption for the duration” (See Ramesh, para 5).


Claims 7 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over  Homes et al. (US 2013/0117750, Howes hereinafter) in view of Asaad et al. (US 2011/0219208, Asaad hereinafter), as applied to claim 6 above, and further in view of Rashid et al. ( US 2017/0161100, Rashid hereinafter).

As to claim 7, Howes teaches further  wherein the counter counts an aggregate of the number of execution thread arrive calls (e.g., see FIG. 6, “624”, “”626”, FIG. 6).  However, Howes  and Asaad do not teach the counter counts an aggregate of the number of copy operation completions.  Rashid teaches wherein the counter counts an aggregate of the number of copy operation completions and the number of execution thread arrive calls  (e.g.,  para 45, “Managing Copy Operations in Complex Processor Topologies” and “a logical copy engine in the copy subsystem of FIG. 4 coordinates a semaphore release operation by distributing barriers to physical copy engines”, “two copy commands, D.fwdarw.E and F.fwdarw.G, and a semaphore release command. The semaphore release command is dependent on the successful execution of copy commands D.fwdarw.E and F.fwdarw.G.”, “the semaphore release command in command queue 612-2. LCE 402 also distributes various barriers to PCEs 412-0 through 412-2 to be queued for processing”, “PCE 412-1 then transmits barrier signal 720 to indicate that copy command F.fwdarw.G has been executed. When LCE 402 receives barrier signals 716 and 720, LCE 402 then notifies PCE 412-2 that execution of the semaphore release command may commence”  in para 77-79). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further  modify the method of Howes and Asaad by adopting the teachings of Rashid to    “ perform load balancing of copy operations across the different links.” (See Rashid, para 14).


As to claim 9, Howes teaches further    wherein primitive stored in memory further comprises a predetermined value, and hardware resets the counter by loading the predetermined value when the counter indicates that all threads in a collection of threads (See FIG. 5 and 6). However, Howes and Asaad do not teach copy operations have reached a synchronization point and all copy operations in said collection have completed.copy complete" action. In one embodiment, the copy complete action is the execution of a flush command. In another embodiment, the copy complete action involves notifying one or more PCEs 412 that a blocking barrier may be discarded and that subsequent processing of copy operations and/or barriers may commence” in para 93). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further  modify the method of Howes and Asaad by adopting the teachings of Rashid to    “ perform load balancing of copy operations across the different links.” (See Rashid, para 14).

Claim 10  is  rejected under 35 U.S.C. 103 as being unpatentable over Homes et al. (US 2013/0117750, Howes hereinafter) in view of Asaad et al. (US 2011/0219208, Asaad hereinafter) and Rashid et al. ( US 2017/0161100, Rashid hereinafter), as applied to claim 9 above, and further in view of  Ramesh et al. ( US 2010/0250809, Ramesh hereinafter).
 .
As to claim 10, Howes, Asaad, and Rashid do not teach wherein the system is structured to allow a thread to dynamically change the predetermined value.  However, Ramesh teaches wherein the system is structured to allow a thread to dynamically change the predetermined value (e.g., para 38, “Counters may be updated based on atomic instructions. When contending tasks (or threads, processes) block, kernel calls may be made carrying sequence counts of the counters as part of argument data for the calls. The kernel may not need to access or modify a lock associated with the counters to support a synchronization primitive. Thus, lock content may remain at user level (or in user space) in an operating environment” and “a synchronization library 117 includes an atomic operation module 119 to update counters for a lock. An atomic operation may be a set of operations that can be combined together appeared as one single operation with only two possible outcomes as either a success or a failure. For example, the atomic operation module 119 may implement an atomic operation using CAS (Compare And Swap) instructions provided by a processor. Other instructions which can be used to implement lock-free or wait-free algorithms for atomic operations may be included” in para 42). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further modify the method of Howes, Asaad, and Rashid by adopting the teachings of Ramesh  to have   “ implementation of the synchronizer into the kernel, so the scheduler always knows which thread or process holds the interlock and can prevent preemption for the duration” (See Ramesh, para 5).

Response to Arguments
Claim Rejections under 35 U.S.C. § 101: 
 Applicant’s argument regard the §101 is found to be persuasive. Accordingly, the rejection has been withdrawn.

Claim Rejections - 35 USC § 103
	Applicant argues that: 
“claim amendments appear to overcome the current prior art rejections and that a new search is needed.”    	
  In response, Asaad et al. (US 2011/0219208) is added only as directly corresponding evidence to support the prior common knowledge finding as stated above. 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDOU K SEYE whose telephone number is (571)270-1062. The examiner can normally be reached M-F 9-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hyung SOUGH can be reached on 5712726799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ABDOU K SEYE/Examiner, Art Unit 2194                                                                                                                                                                                                        
/s. sough/spe, art unit 2192/2194