DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to Applicant’s Amendment and Remarks filed on 05 April 2021. 
Claims 1-20 are pending in this application.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-6, 8-10, 12-13, 15-17 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Rossbach et al. (US. Pub. 2013/0232495 A1) in view of SHI et al. (US. Pub. 2010/0223591 A1) and further in view of Turner et al. (US Pub. 2018/0081804 A1) and Ellis et al. (US. Patent 9,244,652 B1).
Rossbach, SHI and Ellis were cited in the previous Office Action.

As per claim 1, Rossbach teaches the invention substantially as claimed including A system, comprising: 
a computing device comprising a processor and a memory (Rossbach, Fig. 10, 1000 (computing device), 1002 processing unit, 1004 system memory) ; and machine-readable instructions stored in the memory that, when executed by the processor, cause the computing device to at least (Rossbach, [0090] lines 1-2, Computer-executable instructions, such as program modules, being executed by a computer may be used): 
generate a directed acyclic graph (DAG) representing a workload assigned to a virtualized compute accelerator (Rossbach, Fig. 2, 200 (DAG); Fig. 3, 140 Accelerator interface (as virtualized compute accelerator); Abstract, lines 3-6, A graph is generated with a node corresponding to each of the accelerator tasks with edges that represent the data flow and data dependencies between the accelerator tasks; [0003] lines 2-3, encapsulate snippets of executable code of a program (as workload) into accelerator tasks; [0026] lines 1-4, the graph 200 provides information about dataflow and concurrency that may be used by the accelerator interface 140 to schedule the execution of the accelerator tasks on the accelerators 120a-c) wherein: 
the workload comprises a plurality of compute kernels and the DAG comprising a plurality of nodes and a plurality of edges (Rossbach, [0003] lines 2-3, encapsulate snippets of executable code of a program (as workload) into accelerator tasks (as plurality of compute kernels); Fig. 2, 207, 209, 211 (nodes), edges (between nodes); Abstract, lines 3-6, A graph is generated with a node corresponding to each of the accelerator tasks with edges that represent the data flow and data dependencies between the accelerator tasks), 
each of the nodes represents a respective compute kernel (Rossbach, [0003] lines 3-4, A graph is generated with a node corresponding to each of the accelerator tasks), 
each of the edges represents a dependency between a respective pair of the compute kernels (Rossbach, Abstract, lines 3-6, A graph is generated with a node corresponding to each of the accelerator tasks with edges that represent the data flow and data dependencies between the accelerator tasks), and 
the virtualized compute accelerator represents a logical interface for a plurality of compute accelerators (Rossbach, Fig. 3, 140 Accelerator interface (as virtualized compute accelerator), 350 available accelerators (as plurality of compute accelerators)); and
assign compute kernels to a respective one of the plurality of compute accelerators (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the accelerator tasks associated with the nodes 207 and 209 in parallel on the accelerators 120a and 120b),
wherein the compute accelerator access a working set, the working set comprising initialization data to initialize execution of the compute kernels (Rossbach, Fig. 1, 120a-c accelerators, 130a-c memory; Fig. 4, 405 b-c buffer; [0029] lines 7-11, When executing an accelerator task at a particular accelerator 120a-c, the accelerator interface 140 may ensure that a current version of the data associated with a datablock used by the accelerator task is in a buffer at the particular accelerator 120a-c; [0036] lines 5-14, before the accelerator 120a begins executing the accelerator task, the datablock manager 310 may determine if current versions of the data (as initialization data) associated with the datablocks 201 and 203 are stored in buffers (as working set) in the memory 130a of the accelerator 120a, and if not, the datablock manager 310 may copy the current versions of the data to buffers in the memory 130a of the accelerator 120a.  The datablock manager 310 may then update the pointers and/or indicators associated with the datablocks 201 and 203, and may allow the accelerator 120a to begin executing the accelerator task (as accelerator initiating the execution by accessing/using the data (as initialization data) stored in the buffer (as working set)). 

Rossbach fails to specifically teach analyze the DAG to identify sets of dependent compute kernels, a respective set of dependent compute kernels being independent of other sets of dependent compute kernels and execution of at least one compute kernel in the respective set of dependent compute kernels depending on a previous execution of another compute kernel in the respective set of dependent compute kernels.

However, SHI teaches analyze the DAG to identify sets of dependent compute kernels, a respective set of dependent compute kernels being independent of other sets of dependent compute kernels and execution of at least one compute kernel in the respective set of dependent compute kernels depending on a previous execution of another compute kernel in the respective set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the output of the previous web service and the input of the next web service and depicts the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]} (as sets of dependent compute kernels); [0140] lines 3-4, obtain an independent path group: Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]}; [Examiner noted: the execution of the node/web service task (as compute kernel, see Fig. 8, node 3) in a set of dependent compute kernels ([1-2-3-8]) is depending on a previous execution of another computer kernel (node 3 is depending on node 2 in the set of [1-2-3-8]); and the set [1-2-3-8] is being independent of other set [1-4-5-7-8] since both sets don’t have dependency relationship, see Fig. 8]).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach with SHI because SHI’s teaching of determining and analyzing independent path group according to the DAG would have provided Rossbach’s system with the advantage and capability to execute different the different independent path group concurrently which improving the system overall performance and efficiency.

Rossbach and SHI fail to specifically teach when accessing the working set, it is the plurality of compute accelerators access a single copy of a working set over a network.

However, Turner teaches the plurality of compute accelerators access a single copy of a working set over a network (Turner, Fig. 3, 306c Hardware accelerator, 312 Coherent interconnect (as network (see [0045] lines 1-5, oherent interconnect 312 may be communicatively connected to the processing devices 302, 306a, 306b, 306c, and…shared memory 304), 304 shared memory; [0006] lines 15-17, executing a remaining portion of the offloaded workload by the hardware accelerator; [0049 lines 5-7, The data for the offloaded workload may be stored in the processing device cache (e.g., processing device cache 308 in FIG. 6); [0051] lines 1-4, To transmit the data for the offloaded workload to the hardware accelerator 306, the processing device 302 may implement a cache flush maintenance operation 400 to write the data to the shared memory (as copy the data (as working set) into a shared memory); [0052] lines 5-13, offloading a portion of the workload to the hardware accelerator 306 may include data reads and writes by the hardware accelerator 306 accessing the processing device cache and/or the shared memory…The hardware accelerator 306 may execute the offloaded workload using the data retrieved from the processing device cache and/or the shared memory without needing to cache the data locally; also see [0001] lines 1-2, Hardware accelerators can be used to help a central processing unit (CPU) process workloads; [Examiner noted: hardware accelerators accessing the data that is copied/transmitted from the processing device cache (as a single copy of the working set) over a coherent interconnect (as network) for executing the workload without need to cache the data locally]).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach and SHI with Turner because Turner’s teaching of providing a shared memory for storing the data (from processing device cache) that needed for processing the portion of the workload for the hardware accelerator would have provided Rossbach and SHI’s system with the advantage and capability to easily manage the workload data for hardware accelerators which improving the system efficiency and performance. 

Although, Rossbach, SHI and Turner teach assign the compute kernels to a respective one of the plurality of compute accelerators for execution, Rossbach, SHI and Turner fail to specifically teach when assigning and execution, it is assign/execution the respective set of dependent compute kernels.

However, Ellis teaches when assigning and execution, it is assign/execution the respective set of dependent compute kernels (Ellis, Fig. 5B, Task order, task dependency, DAG; Fig. 7D, DAG, worker 1, worker 2, worker 3 (each worker device has been assigned set of dependent tasks); Col 6, lines 38-40, FIG. 3 is a diagram of example components of a device 300, which may correspond to…worker device 240; Col 6, lines 48-50, Processor 320 may include a processor (e.g., a central processing unit, a graphics processing unit, an accelerated processing unit, etc.) (Each worker device as computer accelerator); Col 15, lines 41-44, assume that three worker devices, identified as worker device 1, worker device 2, and worker device 3, are available to perform the group of tasks, and that each of the three worker devices will be performing tasks simultaneously [Examiner noted: each group of tasks (as respective set of dependent compute kernels) are assigned to the each worker device (as compute accelerator) for execution]).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach, SHI and Turner with Ellis because Ellis’s teaching of assigning the set of the dependent tasks to each worker devices would have provided Rossbach, SHI and Turner’s system with the advantage and capability to allow the system to perform the tasks simultaneously which improving the system efficiency.

As per claim 2, Rossbach, SHI, Turner and Ellis teach the invention according to claim 1 above. Ellis further teaches perform static analysis on an object code or a source code representation of the workload to identify the plurality of compute kernels (Ellis, Col 2, lines 27-30, program code that identifies three tasks (e.g., Task A, Task B, and Task C) that are to be performed by a group of N worker devices; Col 2, lines 42-48, determine a directed acyclic graph (DAG) based on the program code (e.g., the client device may display the DAG to the user, the client device may store information associated with the DAG, etc.), may execute the program code (e.g., the client device may send the program code to be added to a task queue associated with the group of N worker devices) [Examiner noted: perform static analysis (since the program code has not been executed) on program code to identify the different tasks (as plurality of compute kernels)]; and perform static analysis on the object code or the source code representation of the workload to identify dependencies between pairs of the plurality of compute kernels (Ellis, Col 2, lines 30-33, the program code may indicate an order (e.g., Task A.fwdarw.Task B.fwdarw.Task C) associated with the three tasks based on the program code; Col 2, lines 54-58, determine (e.g., based on information stored by the client device) the program code that identifies the three tasks, the order associated with the three tasks, and the dependencies associated with the three tasks).


As per claim 3, Rossbach, SHI, Turner and Ellis teach the invention according to claim 1 above. Rossbach further teaches determine that the respective one of the compute accelerators complies with a predefined criterion (Rossbach, [0048] lines 1-5, the scheduler 320 may determine an accelerator 120a-c of the available accelerators 350 that can support the selected accelerator task as determined by the accelerator parameters associated with the selected accelerator task; [0043] lines 2-10, each accelerator identified in the available accelerators 350 may have what is referred to herein as an associated strength. The strength of an accelerator may be a measure of the performance capabilities of the accelerator…Other performance indicators may be used to calculate the strength of the accelerator; [0051] lines 1-3, select the available accelerator with the greatest strength, or other criteria); 
select the respective one of the compute accelerators from the plurality of compute accelerators based on a determination that the respective one of the compute accelerators complies with the predefined criterion (Rossbach, [0048] lines 5-9, If multiple accelerators 120a-c of the available accelerators 350 can support (as complies) the accelerator task, then the scheduler 320 may select from the available accelerators 350 using one of a variety of accelerator 120a-c selection techniques; [0051] lines 1-3, select the available accelerator with the greatest strength, or other criteria); and 
send a respective compute kernels to the respective one of the compute accelerators (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the accelerator tasks associated with the nodes 207 and 209 in parallel on the accelerators 120a and 120b).
In addition, SHI teaches set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the output of the previous web service and the input of the next web service and depicts the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]} (as sets of dependent compute kernels); [0140] lines 3-4, obtain an independent path group: Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]}). Further, Ellis teaches when sending, it is send respective set of dependent compute kernels (Ellis, Fig. 5B, Task order, task dependency, DAG; Fig. 7D, DAG, worker 1, worker 2, worker 3 (each worker device has been assigned set of dependent tasks); Col 6, lines 38-40, FIG. 3 is a diagram of example components of a device 300, which may correspond to…worker device 240; Col 6, lines 48-50, Processor 320 may include a processor (e.g., a central processing unit, a graphics processing unit, an accelerated processing unit, etc.) (Each worker device as computer accelerator); Col 15, lines 41-44, assume that three worker devices, identified as worker device 1, worker device 2, and worker device 3, are available to perform the group of tasks, and that each of the three worker devices will be performing tasks simultaneously).

As per claim 5, Rossbach, SHI, Turner and Ellis teach the invention according to claim 1 above. Rossbach further teaches determine that a dependent compute kernel is performing a predefined computation (Rossbach, Fig. 2, 207, 209, 211; [0052] lines 1-9, with respect to FIG. 2, the scheduler 320 may be selecting an available accelerator to execute the accelerator task associated with the node 211. The accelerator 120a and the accelerator 120b may both be identified in the available accelerators 350. The accelerator 120a may have just completed executing the accelerator task associated with the node 207. As shown, the accelerator task associated with the node 211 uses data from the execution of the accelerator tasks associated with the nodes 207 and 209 [Examiner noted: node 211 is using the data from previous node 207 and 209 (as predefined computation)]); 
select the respective one of the compute accelerators from the plurality of compute accelerators based on a determination that the set of the dependent compute kernels is performing the predefined computation (Rossbach, [0052] lines 10-16, the scheduler 320 may select the accelerator 120a to execute the selected accelerator task, because the data that was generated by the accelerator task associated with the node 207 is already at the accelerator 120a from the previous execution and only data from the execution of the accelerator task associated with the node 209 may be copied to the accelerator 120a by the datablock manager 310); and 
send the dependent compute kernel to the respective one of the compute accelerators (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the accelerator tasks associated with the nodes 207 and 209 in parallel on the accelerators 120a and 120b).
In addition, SHI teaches set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the output of the previous web service and the input of the next web service and depicts the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]} (as sets of dependent compute kernels); [0140] lines 3-4, obtain an independent path group: Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]}). Further, Ellis teaches when sending, it is send the set of dependent compute kernels to the respective one of the compute accelerators (Ellis, Fig. 5B, Task order, task dependency, DAG; Fig. 7D, DAG, worker 1, worker 2, worker 3 (each worker device has been assigned set of dependent tasks); Col 6, lines 38-40, FIG. 3 is a diagram of example components of a device 300, which may correspond to…worker device 240; Col 6, lines 48-50, Processor 320 may include a processor (e.g., a central processing unit, a graphics processing unit, an accelerated processing unit, etc.) (Each worker device as computer accelerator); Col 15, lines 41-44, assume that three worker devices, identified as worker device 1, worker device 2, and worker device 3, are available to perform the group of tasks, and that each of the three worker devices will be performing tasks simultaneously).

As per claim 6, Rossbach, SHI, Turner and Ellis teach the invention according to claim 5 above. Rossbach further teaches wherein the predefined computation involves a modification to a predefined resource (Rossbach, [0052] lines 7-9, The accelerator 120a may have just completed executing the accelerator task associated with the node 207. As shown, the accelerator task associated with the node 211 uses data from the execution of the accelerator tasks associated with the nodes 207 and 209 [Examiner noted: the processing of the node 211 involves the modification/multiplication of data (as predefined resource) from the node 207 and 209 due to the tasks dependency]).

As per claims 8-10 and 12-13, they are method claims of claims 1-3 and 5-6 respectively above. Therefore, they are rejected for the same reason as claims 1-3 and 5-6 respectively above.

As per claims 15-17 and 19, they are non-transitory, computer-readable medium claims of claims 1-3 and 5 respectively above. Therefore, they are rejected for the same reason as claims 1-3 and 5 respectively above.


Claims 4, 11 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Rossbach, SHI, Turner and Ellis, as applied to claims 3, 10 and 17 respectively above, and further in view of Schumacher et al. (US Patent. 10,713,404 B1).
Schumacher was cited in the previous Office Action.

As per claim 4, Rossbach, SHI, Turner and Ellis teach the invention according to claim 3 above. Rossbach further teaches the respective compute kernels sent to the respective one of the compute accelerators (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the accelerator tasks associated with the nodes 207 and 209 in parallel on the accelerators 120a and 120b). In addition, SHI teaches set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the output of the previous web service and the input of the next web service and depicts the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]} (as sets of dependent compute kernels); [0140] lines 3-4, obtain an independent path group: Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]}). 

	Rossbach, SHI, Turner and Ellis fail to specifically teach the respective compute kernels is encrypted.

	However, Schumacher teaches the respective compute kernels is encrypted (Schumacher, Col 7, lines 55-57, If the accelerator is a crypto -accelerator, the application may transmit a batch of encrypted data).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach, SHI, Turner and Ellis with Schumacher because Schumacher’s teaching of sending the encrypted data/tasks to the accelerator would have provided Rossbach, SHI, Turner and Ellis’s system with the advantage and capability to preventing any malicious access to the tasks/data which improving the data security. 

As per claim 11, it is a method claim of claim 4 above. Therefore, it is rejected for the same reason as claim 4 above.

As per claim 18, it is a non-transitory, computer-readable medium claim of claim 4 above. Therefore, it is rejected for the same reason as claim 4 above.


Claims 7, 14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Rossbach, SHI, Turner and Ellis, as applied to claims 3, 8 and 15 respectively above, and further in view of Sen et al. (US Pub. 2018/0219797 A1).

As per claim 7, Rossbach, SHI, Turner and Ellis teach the invention according to claim 1 above. Rossbach teaches wherein the predefined criterion comprises the respective one of the compute accelerators being configured to access the working set (Rossbach, Fig. 1, 120a-c accelerators, 130a-c memory; Fig. 4, 405 b-c buffer; [0048] lines 1-5, the scheduler 320 may determine an accelerator 120a-c of the available accelerators 350 that can support the selected accelerator task as determined by the accelerator parameters associated with the selected accelerator task; [0043] lines 2-10, each accelerator identified in the available accelerators 350 may have what is referred to herein as an associated strength. The strength of an accelerator may be a measure of the performance capabilities of the accelerator…Other performance indicators may be used to calculate the strength of the accelerator; [0051] lines 1-3, select the available accelerator with the greatest strength, or other criteria; [0036] lines 5-14, before the accelerator 120a begins executing the accelerator task, the datablock manager 310 may determine if current versions of the data associated with the datablocks 201 and 203 are stored in buffers (as working set) in the memory 130a of the accelerator 120a, and if not, the datablock manager 310 may copy the current versions of the data to buffers in the memory 130a of the accelerator 120a.  The datablock manager 310 may then update the pointers and/or indicators associated with the datablocks 201 and 203, and may allow the accelerator 120a to begin executing the accelerator task). In addition, Turner teaches when accessing, it is to access the single copy of the working set (Turner, Fig. 3, 306c Hardware accelerator; [0006] lines 15-17, executing a remaining portion of the offloaded workload by the hardware accelerator; [0049 lines 5-7, The data for the offloaded workload may be stored in the processing device cache (e.g., processing device cache 308 in FIG. 6); [0051] lines 1-4, To transmit the data for the offloaded workload to the hardware accelerator 306, the processing device 302 may implement a cache flush maintenance operation 400 to write the data to the shared memory (as copy the data (as working set) into a shared memory); [0052] lines 5-13, offloading a portion of the workload to the hardware accelerator 306 may include data reads and writes by the hardware accelerator 306 accessing the processing device cache and/or the shared memory…The hardware accelerator 306 may execute the offloaded workload using the data retrieved from the processing device cache and/or the shared memory without needing to cache the data locally).

Rossbach, SHI, Turner and Ellis fail to specifically teach when accessing, it is use a remote direct memory access (RDMA) protocol.

However, Sen teaches when accessing, it is use a remote direct memory access (RDMA) protocol (Sen, [0022] lines 24-25, the Remote Direct Memory Access (RDMA); [0030] lines 2-33, provide an interface for an application executed by the compute device 102 to an accelerator device 308 on an accelerator sled 104. The remote accelerator manager 406 may communicate through the host fabric interface 210 of the compute device 102 with the host fabric interface 310 of the accelerator sled 104 using any suitable protocol or technique, such as TCP, RDMA, RoCE, RoCEv2, iWARP, etc. …The data portion may include the data to be written or data that has been read, a program to be loaded into the accelerator device 308, etc. In some embodiments, the data portion may be embodied as a scatter-gather list, which may be used, for example, with RDMA to transport RDMA keys and leverage RDMA read/write for direct data transfer).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach, SHI, Turner and Ellis with Sen because Sen’s teaching of using the RDMA protocol would have provided Rossbach, SHI, Turner and Ellis’s system with the advantage and capability to lowering the data transferring latency which improving the system performance and efficiency. 

As per claim 14, it is a method claim of claim 7 above. Therefore, it is rejected for the same reason as claim 7 above.

As per claim 20, it is a non-transitory, computer-readable medium claim of claim 7 above. Therefore, it is rejected for the same reason as claim 7 above.


Response to Arguments  
In the remark applicant’s argue in substance: 
(a), Ellis does not disclose or suggest that "the plurality of compute accelerators access a single copy of a working set over a network, the working set comprising initialization data to initialize execution of the respective set of dependent compute kernels." The additions of Rossbach and Shi do not cure the deficiencies of Ellis regarding these elements.

Examiner respectfully disagreed with Applicant’s argument for the following reasons:
As to point (a), Examiner would like to point out that Rossbach clearly teaches the compute accelerator access a working set, the working set comprising initialization data to initialize execution of the compute kernels. For example, Rossbach teaches an accelerator system that each accelerators has its own buffer (as working set) for storing the data (as initialization data) for executing the accelerator tasks. (see Rossbach, Fig. 1, 120a-c accelerators, 130a-c memory; Fig. 4, 405 b-c buffer; [0029] lines 7-11, When executing an accelerator task at a particular accelerator 120a-c, the accelerator interface 140 may ensure that a current version of the data associated with a datablock used by the accelerator task is in a buffer at the particular accelerator 120a-c; [0036] lines 5-14, before the accelerator 120a begins executing the accelerator task, the datablock manager 310 may determine if current versions of the data (as initialization data) associated with the datablocks 201 and 203 are stored in buffers (as working set) in the memory 130a of the accelerator 120a…and may allow the accelerator 120a to begin executing the accelerator task (as accelerator initiating the execution by accessing/using the data (as initialization data) stored in the buffer (as working set)). 

As cited above, the accelerator system of Rossbach merely does not recites the plurality of compute accelerators access a single copy of a working set over a network but rather accessing the buffer (working set) within each compute accelerator for executing the accelerator task. However, newly found art Turner specifically teaches the plurality of compute accelerators access a single copy of a working set over a network (see Turner, Fig. 3, 306c Hardware accelerator, 312 Coherent interconnect (as network (see [0045] lines 1-5, oherent interconnect 312 may be communicatively connected to the processing devices 302, 306a, 306b, 306c, and…shared memory 304), 304 shared memory; [0006] lines 15-17, executing a remaining portion of the offloaded workload by the hardware accelerator; [0049 lines 5-7, The data for the offloaded workload may be stored in the processing device cache (e.g., processing device cache 308 in FIG. 6); [0051] lines 1-4, To transmit the data for the offloaded workload to the hardware accelerator 306, the processing device 302 may implement a cache flush maintenance operation 400 to write the data to the shared memory (as copy the data (as working set) into a shared memory); [0052] lines 5-13, offloading a portion of the workload to the hardware accelerator 306 may include data reads and writes by the hardware accelerator 306 accessing the processing device cache and/or the shared memory…The hardware accelerator 306 may execute the offloaded workload using the data retrieved from the processing device cache and/or the shared memory without needing to cache the data locally; also see [0001] lines 1-2, Hardware accelerators can be used to help a central processing unit (CPU) process workloads; [Examiner noted: hardware accelerators accessing the data that is copied/transmitted from the processing device cache (as a single copy of the working set) over a coherent interconnect (as network) for executing the workload without need to cache the data locally]).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach and SHI with Turner because Turner’s teaching of providing a shared memory for storing the data (from processing device cache) that needed for processing the portion of the workload for the hardware accelerator would have provided Rossbach and SHI’s system with the advantage and capability to easily manage the workload data for hardware accelerators which improving the system efficiency and performance. 

Further, Ellis is used for teaching when assigning and execution, it is assign/execution the respective set of dependent compute kernels (see Ellis, Fig. 5B, Task order, task dependency, DAG; Fig. 7D, DAG, worker 1, worker 2, worker 3 (each worker device has been assigned set of dependent tasks); Col 6, lines 38-40, FIG. 3 is a diagram of example components of a device 300, which may correspond to…worker device 240; Col 6, lines 48-50, Processor 320 may include a processor (e.g., a central processing unit, a graphics processing unit, an accelerated processing unit, etc.) (Each worker device as computer accelerator); Col 15, lines 41-44, assume that three worker devices, identified as worker device 1, worker device 2, and worker device 3, are available to perform the group of tasks, and that each of the three worker devices will be performing tasks simultaneously). Please refer to the rejection under 35 U.S.C. 103 above. Therefore, applicant argument is not persuasive.

For the reasons above, Applicant’s argument has not been found to be persuasive, and therefore the rejections are maintained. 


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ZUJIA XU whose telephone number is (571)272-0954.  The examiner can normally be reached on M-F 9:00-5:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Meng-Ai An can be reached on (571) 272-3756.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/MENG AI T AN/Supervisory Patent Examiner, Art Unit 2195                                                                                                                                                                                                        




/Z.X./Examiner, Art Unit 2195