DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This non-final office action is responsive to the RCE filed on 01/04/2021.
Claims 1-3, 5-6, 8-10, 12-13, 15-17, 19-23 are pending.

Response to Amendment

Applicant has amended independent claims 1, 8, 15 and dependent claims 2, 5, 12, 19 to include new/old limitations in a form not previously presented necessitating new search and considerations.  Claims 4, 7, 11, 14, and 18 have been canceled by the Applicant.


Claim Rejections - 35 USC § 112

The following is a quotation of 35 U.S.C. 112(b):

(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.



Claims 1-3, 5-6, 8-10, 12-13, 15-17, 19-23 are rejected under 35 U.S.C. 112 (b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
The following claim language is not clearly understood:
Claim 1 recites “replace the threads with threads of the second thread group. It is unclear if the threads of the second thread group is in the barrier stall state or executing state.
Claims 8 and 15 recites elements of claim 1 and have similar deficiency as claim 1. Therefore, they are rejected for the same rational. Remaining dependent claims have also been rejected due to their dependency on the rejected independent claims.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:

A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.




Claims 1-3, 5-6, 8-10, 12-13, 15-17, 19-23 are rejected under 35 U.S.C. 103(a) as being unpatentable over Jin et al. (US Pub. No. 2016/0321777 A1, hereafter Jin) in view of Paltashev et al. (US Pub. No. 2010/0115249 A1, hereafter Paltashev) and further in view of Targowski (US Pub. No. 2014/0007111 A1, hereafter Targowski).
Jin and Targowski were cited in the last office action.


Highlighted claim elements are missing from the respective cited prior art.

As per claim 1, Jin teaches the invention substantially as claimed including an apparatus comprising: 
a thread scheduler to schedule thread groups across multiple dies representing multiple graphics processors ([0004] plurality of GPU cards (i.e. die) on a server [0030] thread scheduling fig 12 GPU0-GPU7, worker groups 0-3, workers table -1 binding between GPU cards, worker groups, each GPU groups is connected by PCIe; fig 2 104-108-GPU 110), the multiple graphics processors including separate sets of processing resources ([0071] each two GPU sockets are installed to a GPU-specific PCI slot, GPU sockets 0, 1, 2 and 3 are installed to a CPU through a PCIe switch, GPU sockets 4, 5, 6 and 7 are installed to another CPU, and the two CPUs are connected through IOH;  fig 3 GPU groups), wherein the thread scheduler is to schedule a first thread group for execution on a first set of processing resources of a first graphics processor on a first die ([0030] thread scheduling  [0009] worker threads, controlling worker groups, one or more GPUs [0010] binding each worker threads to a corresponding GPU [0012] controlling the plurality of GPUs to perform data processing in parallel through the worker threads fig 12 workers, CPU thread 1 worker group 0 GPU 0-1, table 1 binding between GPU data parallel CPU threads, GPU cards and worker groups worker group 0 PCIe switch) and a second set of processing resources of a second graphics processor on a second die ([0030] thread scheduling [0009] worker threads, controlling worker groups, one or more GPUs [0010] binding each worker threads to a corresponding GPU [0012] controlling the plurality of GPUs to perform data processing in parallel through the worker threads  fig 12 workers cpu thread 2 worker group 2 GPU 4-5, table 1 binding between GPU data parallel CPU threads, GPU cards and worker groups worker groups 2-3, GPU 4-7), the second die different from the first die (fig 12 GPU0-1, GPU4-5, each GPU as well as GPU groups are on a separate dies as they are connected PCIe Switch and I/O hub [0004] plurality of GPU cards (i.e. each card is a separate die)); and 
a hardware barrier logic to facilitate barrier synchronization of the thread groups across the multiple dies via a multi-die barrier (fig 5 full thread barrier, global/local thread synchronization [0075] CPU threads bound to GPU worker groups are in full thread barrier state, each worker group is associated with separate GPU groups on separate dies i.e. multiple GPU cards indicate multi-die barrier [0076] CPU threads bound to GPU worker groups are in full-thread barrier state [0127] synchronization waiting control logic, workers, controlled, worker group engine [0075] all CPU threads bound to the GPU worker groups are in full thread barrier state i.e. multiple GPU cards indicate multi-die barrier [0076] all CPU threads, bound, GOU worker groups are in full thread barrier state i.e. multi-die barrier fig 23 S112 fig 30 311),  
wherein the multi-die barrier includes a barrier instruction executed by the multiple graphics processors on the multiple dies to synchronize execution of the thread groups scheduled across the multiple dies (fig 23 make thread processing of all execution engines in a barrier state S112 fig 5 full thread barrier, global/local thread synchronization i.e. multiple GPU cards indicate multi-die barrier), 

wherein in response to a determination that threads of the first thread group have entered a barrier stall state caused by execution of the barrier instruction of the multi-die barrier ([0075] all CPU threads bound to the GPU worker groups are in full thread barrier state i.e. multiple GPU cards indicate multi-die barrier [0076] all CPU threads, bound, GPU worker groups are in full thread barrier state i.e. multi-die barrier fig 23 S112 fig 30 311), the thread scheduler is to replace the threads of the first thread group in the barrier stall state with threads of a second thread group.


Jin doesn’t specifically teach hardware logic, barrier instructions executed by multiple graphics processors, wherein in response to a determination that threads of the first thread group have entered a barrier stall state, the thread scheduler is to replace the threads of the first thread group in the barrier stall state with threads of a second thread group.

Paltashev however teaches hardware logic ([0055] logic, peripheral component interconnect exchange PCI-E interface fig 6 PCI-E memory redirection logic 612 CPU chip set) to facilitate barrier synchronization of the thread groups across the multiple dies via a multi-die barrier ([0058] fig 6 GPUs 602-608 synchronization in multiple GPU configuration [0057] inter-GPU synchronization, plurality of GPUs, barrier sync [0074] fig 9 sync barrier 910 multiple GPUs), 
barrier instructions executed by multiple graphics processors ([0057] inter-GPU synchronization, plurality of GPU, barrier sync [0058] GPUs, send fence/wait miss to CPU chipset [0033] GPU commands [0059] multiple producers/consumers; fig 9 [0055] plurality of GPU, connected, chipset interface fig 6 602-608 612), 
wherein in response to a determination that threads of the first thread group have entered a barrier stall state ( [0054] GPU, switch, stalled context [0074] multiple context GPUs, spinning wait [0068] first set of GPU, use data, generated by a second set of GPU [0069] [0071] synchronization, wait command, stall GPUC/D [0072] synchronization barrier, two context in GPU A 902 and GPU B 904 can reach a point where  GPU C906 and GPU D 908 may start processing [0063] ), the thread scheduler is to replace the threads of the first thread group in the barrier stall state with threads of a second thread group ([0054] switch stalled context and execute another one [0074] multiple context GPUs with a context switch and spinning wait - context is comparable to threads [0068] first set of GPU, use data, generated by a second set of GPU [0069] [0072] fig 11 inter-GPU barrier synchronization [0063]).

It would have been obvious to one of ordinary skills in the art before the effective filing date of the invention was made to combine the teachings of Jin with the teachings of Paltashev of PCI-E memory logic, inter-GPU synchronization with plurality of GPUs configured to sync without CPU intervention using plurality of GPU commands implementing barrier synchronization, inter-GPU synchronization stalling and switching the stalled context and execute another one in multi-context multi-GPU synchronization to improve efficiency and allow hardware logic, barrier instructions executed by multiple graphics processors, wherein in response to a determination that threads have entered a barrier stall state, the thread scheduler is to replace the threads in the barrier stall state with threads to the method of Jin as in the instant invention. The combination of cited analogous arts (Jin [0004] [0007] Paltashev [0004] [0008]) would have been obvious because substituting known methods of PCI-E memory redirection logic and inter-GPU synchronization by executing synchronization barrier GPU commands and switching the stalled context with another context as taught by Paltashev to the known method of parallel processing of multiple worker thread controlling multiple worker groups including multiple GPU to yield predictable result of  hardware barrier logic, barrier instructions executed by multiple graphics processors, switching the stalled thread to another thread in response to determination that threads have entered into stalled state with reasonable expectation of success and is motivated with the improved efficiency (Jin [0007] [0030] Paltashev [0005]). 

Jin and Paltashev, in combination, replace the threads of the first thread group with threads of a second thread group.

Targowski, however, teaches in response to a determination that one or more threads of a first thread group have entered a barrier stall state ([0024] thread, group, if it is executing/halting and waiting at a synchronization barrier, threshold met [0025] thread, stopped, synchronization barrier, fig 1 130 [0028] fig 2 220), scheduler to replace the threads of the first thread group with threads of a second thread group ([0024] waiting EU threads may be preempted to allow executions of threads from another thread group fig 1 pre-empt EU threads waiting at synchronization barrier 140 begin execution of next thread group [0025] EU threads of first thread group, preempted, EU occupied by preempted threads may be made available for use by a second thread group fig 1 yes-130 pre-empt EU threads waiting at synchronization barrier 140 150 [0028] fig 2 220 [0026] some or all of the EU threads waiting at the synchronization barrier may be preempted, EU threads of the next thread group may begin execution at these newly available EUs).
It would have been obvious to one of ordinary skills in the art before the effective filing date of the invention was made to combine the teachings of Jin and Paltashev with the teachings of Targowski of scheduling logic pre-empting the some or all of EU threads and execute threads from next thread group to improve efficiency and allow to replace the threads of the first thread group with threads of a second thread group to the method of Jin and Paltashev as in the instant invention. The combination of cited analogous arts (Jin [0004] [0007] Paltashev [0004] [0008], Targowski [0002] [0024]) would have been obvious because applying known methods of pre-empting threads waiting at synchronization and beginning execution of next thread group up on determination of first thread barrier reaching the synchronization barrier as taught by Targowski to the synchronization barrier of threads associated with multiple graphics processor card taught by Jin and Paltashev to yield predictable result of replacing stalled threads of the first thread groups with threads of second thread with reasonable amount of success and is motivated by improved efficiency (Jin [0007] [0030] Paltashev [0005] Targowski [0002] [0013] ).


As per claim 2, Paltashev teaches the hardware barrier logic is to facilitate the barrier synchronization using a barrier command having thread group identifications (IDs) corresponding to the thread groups, wherein each thread group is assigned a thread group ID (fig 6 PCI-E memory redirection logic 612 CPU chipset 610 GPUs with local memory 602-608 Fence/Wait miss command [0033] In some GPUs, a synchronization mechanism may include a plurality of GPU commands, a fence command, and a wait command implementing internal GPU pipeline barrier type synchronization [0075] local GPU queue, multiple contexts, application run list A 1002 Application run list B 1006 - runlist name is equivalent to thread group id [0076] local run list and context execution control blocks 1106a, 1106t of the GPUs 1108a, 1108t can provide management of such type synchronization).
Targowski teaches remaining claim elements of command having thread group identification corresponding to the thread groups ([0033] determination, sufficient, number, of EU, in use, by the thread group, synchronization [0034], GPU, number of EU threads, EU thread 1, once all threads of a thread group have reaches the synchronization barrier, group may continue executing ).


As per claim 3, Jin teaches the thread groups are synchronized across the multiple graphics processors using one or more fabric crossbars (fig 5 grp worker group 0 -1 global/local synchronization [0091] synchronization module, synchronize parameters, different GPUs [0070] GPU connected to the processors, peripheral interface bus, PCIe bus fig 3 PCIe switches), and multi-die barrier hardware for the multiple graphics processors (fig 2 processors 104 GPUs 110 fig 5 full thread barrier, global/local thread synchronization).
Paltashev teaches remaining claim elements of multi-die barrier hardware (fig 6 GPUs 602-608 PCI-E memory redirection logic 612 CPU chipset 612 fig 7 sync barrier 708).


As per claim 5, Jin teaches thread scheduler to schedule one or more of the thread groups across the multiple graphics processors ([0030] scheduling of each GPU implemented by exclusive CPU thread fig 12 workers, worker groups, set of GPUs, set of CPU) is to map a shared local memory space of the one or more of the thread groups (fig 6 video memory, batch data [0151] load plurality of batches of training data from memory to GPU video memories in the plurality of worker group) to a memory space that is global to the multiple graphics processors (fig 6 main memory batch data; Paltashev GPU with local memory 602-608, CPU system memory 614).

As per claim 6, Paltashev teaches wherein the thread scheduler (fig 10 GPU context scheduler 1010) is to maintain a list of thread groups scheduled across the multiple dies  (fig 10 GPU 1028 application run list, multiple contexts fig 11 local runlist 1102 contexts 1103 GPU A-T 1108 ) and a status of threads within the list of thread groups ([0077] maintain and monitor each context status), preempt the threads of the first thread group based on the list of thread groups and the status of threads with the list of thread groups ([0008] stopping execution of a current context and switching to a new context [0054] A plurality of GPUs may be configured to execute a plurality of contexts and, if inter-GPU synchronization procedure stalls a particular context for a long time, the GPU can be configured to switch stalled context and execute another one fig 10-11 [0077]) and maintain a list of preempted thread groups and status of preempted threads with the list of preempted thread groups (fig 12 context, status, suspended  fig 13 context 0/1/L suspended context saved state [0086]  fig 16 context description and control registers: CTX status register 1602 suspended 1624 CTX switch definition register 1646 fig 18  1838).
Targowski also teaches similar claim elements of the thread scheduler ([0031] scheduling logic, thread group) is to maintain a list of thread groups (fig 6 thread management logic, EU threads 640… [0036] threads, send, status information, thread management logic 610), preempt the threads of the first thread group based on the list of thread groups and the status of the threads within the list of thread groups ([0036] threads, send, status information, thread management logic 610, reached synchronization barrier, value of thread counter, timer, preemption logic 660 EU thread preempted in favor of EU thread from another thread group), and maintain a list of preempted thread groups ([0025] preempted threads fig 6 thread management logic, preemption logic) and status for preempted threads within the list of preempted thread groups (fig 1 receive status information for EU threads 120 [0025] status information, EU thread has stopped , additional status information, preempted threads, fig 4 enough time elapsed since 1st EU thread reaches sync barrier - yes/no).
As per claim 22, Jin teaches preempting threads across the multiple dies based on the list of thread groups scheduled across the multiple dies (fig 12 worker threads, worker groups, plurality of GPUs [0004] plurality of GPU cards (i.e. die) on a server [0030] scheduling of each GPU implemented by exclusive CPU thread, table 1 - worker group# CPU thread# GPU#).
Targowski teaches remaining claim elements of preempting threads (fig 1 pre-empt EU threads 140 [0024] waiting EU threads may be preempted to allow execution of threads from another thread group [0036]) based on list of thread groups (fig 5 EUT 1 - multiple threads including thread-1…5…, EUT M - threads …N-1 N).

Claim 8 recites a method for limitations similar to those of the claim 1. Therefore, it is rejected for the same rational.
Claim 9 recites method for limitations similar to those of the claim 2. Therefore, it is rejected for the same rational.
Claim 10 recites method for limitations similar to those of the claim 3. Therefore, it is rejected for the same rational.
Claim 12 recites method for limitations similar to those of the claim 5. Therefore, it is rejected by the same rational.
Claim 13 recites method for limitations similar to those of the claim 6. Therefore, it is rejected by the same rational.
Claim 23 recites method for elements limitations similar to those of the claim 22. Therefore, it is rejected for the same rational.

Claim 15 recites non-transitory machine-readable medium comprising instructions that when executed by a computing device, cause the computing device (Jin [0266]) to perform operations similar to those recited in claim 1. Therefore, it is rejected for the same rational.

Claim 16 recites non-transitory machine-readable medium of limitations similar to those of claim 2. Therefore, it is rejected for the same rational.
Claim 17 recites non-transitory machine-readable medium of limitations similar to those of claim 3. Therefore, it is rejected for the same rational.
Claim 19 recites non-transitory machine-readable medium of limitations similar to those of claim 5. Therefore, it is rejected for the same rational.
Claim 20 recites non-transitory machine-readable medium of limitations similar to those of claim 6. Therefore, it is rejected for the same rational.
Claim 21 recites non-transitory machine-readable medium of limitations similar to those of claim 22. Therefore, it is rejected for the same rational.


Response to Arguments
The previous 35 USC 112(b) rejections have been withdrawn. 
Applicant's arguments filed on 01/04/2021 have been fully considered but they are moot in view of new grounds of rejections.
Examiners Note

Applicant is further reminded of that the cited paragraphs and in the references as applied to the claims above for the convenience of the applicant(s) and although the specified citations are representative of the teachings of the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested from the applicant in preparing responses, to fully consider all of the references in entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner. 

Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Asaro; Anthony et al.	(US 20130263141 A1) teaches visibility ordering in a memory model for a unified computing system.
GASTER; Benedict et al. (US 20110022817 A1) teaches mapping processing logic having data-parallel threads across processors
Grover; Vinod et al. (US 20090259828 A1) teaches execution of retargetted graphics processor accelerated code by a general purpose processor.
HOUSTON; Michael C. et al. (US 20130326524 A1) teaches method and system for synchronization of workitems with divergent control flow.
Potter; Terence M. et al. (US 20180182154 A1) teaches resource synchronization for graphics processing.
Solinas; Angelo et al.	(US 20110252264 A1) teaches physical manager of synchronization barrier between multiple processes.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABU ZAR GHAFFARI whose telephone number is (571)270-3799.  The examiner can normally be reached on Monday-Thursday 9:00 - 17:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.  
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Meng-Ai AN can be reached on 571-272-3756.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/ABU ZAR GHAFFARI/Primary Examiner, Art Unit 2195