DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to Request for Continued Examination, Applicant Amendment and Arguments filed on 08 October, 2021.
Claims 1-20 are pending in this application.


Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 08 October, 2021 has been entered.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill 

Claims 1, 3 and 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over Rossbach et al. (US. Pub. 2013/0232495 A1) in view of SHI et al. (US. Pub. 2010/0223591 A1) and further in view of Shi et al. (US Patent. 10,104,187 B2; hereafter Shi ‘187’) and Druyan (US Patent. 9,015,724 B2).
Rossbach and SHI were cited in the previous Office Action.

As per claim 1, Rossbach teaches the invention substantially as claimed including A system, comprising: 
a computing device comprising a processor and a memory (Rossbach, Fig. 10, 1000 (computing device), 1002 processing unit, 1004 system memory) ; and machine-readable instructions stored in the memory that, when executed by the processor, cause the computing device to at least (Rossbach, [0090] lines 1-2, Computer-executable instructions, such as program modules, being executed by a computer may be used): 
generate a directed acyclic graph (DAG) representing a workload assigned to a virtualized compute accelerator (Rossbach, Fig. 2, 200 (DAG); Fig. 3, 140 Accelerator interface (as virtualized compute accelerator); Abstract, lines 3-6, A graph is generated with a node corresponding to each of the accelerator tasks with edges that represent the data flow and data dependencies between the accelerator tasks; [0003] lines 2-3, encapsulate snippets of executable code of a program (as workload) into accelerator tasks; [0026] lines 1-4, the graph 200 provides information about dataflow wherein: 
the workload comprises a plurality of compute kernels and the DAG comprising a plurality of nodes and a plurality of edges (Rossbach, [0003] lines 2-3, encapsulate snippets of executable code of a program (as workload) into accelerator tasks (as plurality of compute kernels); Fig. 2, 207, 209, 211 (nodes), edges (between nodes); Abstract, lines 3-6, A graph is generated with a node corresponding to each of the accelerator tasks with edges that represent the data flow and data dependencies between the accelerator tasks), 
each of the nodes represents a respective compute kernel (Rossbach, [0003] lines 3-4, A graph is generated with a node corresponding to each of the accelerator tasks), 
each of the edges represents a dependency between a respective pair of the compute kernels (Rossbach, Abstract, lines 3-6, A graph is generated with a node corresponding to each of the accelerator tasks with edges that represent the data flow and data dependencies between the accelerator tasks), and 
the virtualized compute accelerator represents a logical interface for a plurality of compute accelerators (Rossbach, Fig. 3, 140 Accelerator interface (as virtualized compute accelerator), 350 available accelerators (as plurality of compute accelerators)); and
assign the compute kernels to compute accelerator (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the nodes 207 and 209 in parallel on the accelerators 120a and 120b),

Rossbach fails to specifically teach analyze the DAG to identify sets of dependent compute kernels, a respective set of dependent compute kernels being independent of other sets of dependent compute kernels and execution of at least one compute kernel in the respective set of dependent compute kernels depending on a previous execution of another compute kernel in the respective set of dependent compute kernels.

However, SHI teaches analyze the DAG to identify sets of dependent compute kernels, a respective set of dependent compute kernels being independent of other sets of dependent compute kernels and execution of at least one compute kernel in the respective set of dependent compute kernels depending on a previous execution of another compute kernel in the respective set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the output of the previous web service and the input of the next web service and depicts the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, 

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach with SHI because SHI’s teaching of determining and analyzing independent path group according to the DAG would have provided Rossbach’s system with the advantage and capability to execute different the different independent path group concurrently which improving the system overall performance and efficiency.

Rossbach and SHI fail to specifically teach determine a set-specific computing resource profile for a particular set of dependent compute kernels by an analysis of resource requirements specified by a set of nodes corresponding to the particular set of dependent compute kernels, the set-specific computing resource profile comprising a total memory and a total processing power determined for the set of dependent compute kernels as a whole; and when assign the compute kernels to compute accelerator, it is assign the particular set of dependent compute kernels to a particular compute accelerator comprising available resources comprising at least the total memory and the total processing power for the set of dependent compute kernels.

However, Shi ‘187’ teaches determine a set-specific computing resource profile for a particular set of dependent compute kernels by an analysis of resource requirements specified by a set of nodes corresponding to the particular set of dependent compute kernels (Shi ‘187’, Fig. 3, 300, B:15, D:30, E:25 (as set of nodes); Fig. 4, 402 (including B, D, E) (as a particular set of dependent compute kernels); Col 3, lines 7-11, a cost associated with the one or more composite subsets of the services may be identified. To this end, the one or more composite subsets of the services may be output to the at least service node, based on the cost. Optionally, the cost may include a cost in resources; Col 8, lines 20-23, identify a cost (in terms of resources, delay, efficiency and/or other factors) associated with each of the sub-graphs 218 (as determine a set-specific computing resource profile for a particular set of dependent compute kernels); Col 9, lines 20-26, the service graph 300 includes a plurality of services 302 each including an identifier 306 (e.g. A, B, C, D, etc.) as well as a numerical cost 308 (e.g. 20, 50, 15, etc.) associated therewith. As mentioned earlier, such numerical cost 308 may include a cost in terms of resources necessary for executing the associated service 302. Also included are a plurality of interdependencies 304 that each connect two services 302; Col 10, lines 18-21, the services 302 are divided into sub-graphs 402 (as a particular set of dependent compute kernels), in the manner shown. In FIG. 4, Service B, D and E are grouped together and the sub-graphs 402 thus represent a composite service that is created because of the sum of costs associated with multiple services; Fig. 5, B: 70 (cost: 70, (combined from 15, 30 and 25) as set-specific computing resource profile determined by an analysis of resource requirements specified by a set of nodes)), the set-specific computing resource profile comprising a total resource determined for the set of dependent compute kernels as a whole. (Shi ‘187’, Col 3, lines 10-13, the cost may include a cost in resources (e.g. data processing, network bandwidth, storage capacity, and/or input/output (I/O) resources, etc.); Col 6, lines 10-11, the composite subset(s) of the services may include information on a sum of costs associated with multiple services), and
assign the particular set of dependent compute kernels to a particular compute accelerator comprising available resources comprising at least the total resource for the set of dependent compute kernels (Shi ‘187’, Fig. 2, 218A, 218B, 218C, sub-graph assigned to OASN 210A-C; Col 11, lines 6-8, all services inside one of the composite sub-graphs 502 may be assigned to a single OASN (compute accelerator was taught by Rossbach); Col 16, lines 35-44, determining a composite resource requirement for the composite service, the composite resource requirement being determined based on individual resource requirements of each service included in the composite service; determining a resource capacity of the single service node; and assigning the composite service to the single service node, based on the determined composite resource requirement and the determined resource capacity (as including at least the total resource for the set of dependent compute kernels)).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach and SHI with Shi ‘187’ because Shi ‘187’’s teaching of determining the total resource cost/requirement for the corresponding sub-graph (as set of dependent compute kernels) and assigning each sub-graph to a node that having enough resource for execution would have provided Rossbach and SHI’s system with the advantage and capability to easily determining the resource requirement for the set of tasks and preventing potential system failure due to the resource shortage which improving the system resource utilization and performance.

Rossbach, SHI and Shi ‘187’ fail to specifically teach the total resource including total memory and a total processing power, and when assigning to a particular compute accelerator comprising available resources comprising at least the total memory and the total processing power. 

However, Druyan teaches the total resource including total memory and a total processing power (Druyan, Fig. 2, 206 Determine job characteristics; Col 4, lines 29-34, The computational resource requirements for each node are defined as the job characteristic. Such requirements may include the number of CPUs required (as total processing power) in a node, the amount of memory required (as total memory), special-purpose hardware required such as network adapters, and any other special resources required), and
 comprising at least the total memory and the total processing power (Druyan, Fig. 2, 208 determine nodes for job execution, 210 dispatch job to nodes for execution; Col 8, lines 36-44, selecting an available node and determining from the scheduler records further containing a combination of job characteristics that can be concurrently executed successfully by that node whether the node can be assigned to the job being currently dispatched wherein the combination of job characteristics that can be concurrently executed successfully by that node include the computer resources available such as number of CPUs, amount of node memory and/or memory restrictions for that node when other jobs are being concurrently executed on that node).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach, SHI and Shi ‘187’ with Druyan because Druyan’s teaching of the resource including the memory and processing power and assigning the jobs to the node that having enough power and memory for execution would have provided Rossbach, SHI and Shi ‘187’’s system with the advantage and capability to allow the system to efficiently utilizing the system resource in order to improving the system performance.

As per claim 3, Rossbach, SHI, Shi ‘187’ and Druyan teach the invention according to claim 1 above. Rossbach further teaches determine that the respective one of the compute accelerators complies with a predefined criterion (Rossbach, [0048] lines 1-5, the scheduler 320 may determine an accelerator 120a-c of the support the selected accelerator task as determined by the accelerator parameters associated with the selected accelerator task; [0043] lines 2-10, each accelerator identified in the available accelerators 350 may have what is referred to herein as an associated strength. The strength of an accelerator may be a measure of the performance capabilities of the accelerator…Other performance indicators may be used to calculate the strength of the accelerator; [0051] lines 1-3, select the available accelerator with the greatest strength, or other criteria); 
select the respective one of the compute accelerators from the plurality of compute accelerators based on a determination that the respective one of the compute accelerators complies with the predefined criterion (Rossbach, [0048] lines 5-9, If multiple accelerators 120a-c of the available accelerators 350 can support (as complies) the accelerator task, then the scheduler 320 may select from the available accelerators 350 using one of a variety of accelerator 120a-c selection techniques; [0051] lines 1-3, select the available accelerator with the greatest strength, or other criteria); and 
send a respective compute kernels to the respective one of the compute accelerators (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the accelerator tasks associated with the nodes 207 and 209 in parallel on the accelerators 120a and 120b).
In addition, SHI teaches set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the output of the previous web service and the input of the next web service and depicts the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]} (as sets of dependent compute kernels); [0140] lines 3-4, obtain an independent path group: Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]}). 
Further, Shi ‘187’ teaches when sending, it is send respective set of dependent compute kernels (Shi ‘187’, Fig. 2, 218A, 218B, 218C, sub-graph assigned to OASN 210A-C; Col 11, lines 6-8, all services inside one of the composite sub-graphs 502 may be assigned to a single OASN (compute accelerator was taught by Rossbach); Col 16, lines 35-44, determining a composite resource requirement for the composite service, the composite resource requirement being determined based on individual resource requirements of each service included in the composite service; determining a resource capacity of the single service node; and assigning the composite service to the single service node, based on the determined composite resource requirement and the determined resource capacity).

As per claim 5, Rossbach, SHI, Shi ‘187’ and Druyan teach the invention according to claim 1 above. Rossbach further teaches determine that a dependent compute kernel is performing a predefined computation (Rossbach, Fig. 2, 207, 209, 211; [0052] lines 1-9, with respect to FIG. 2, the scheduler 320 may be selecting an available accelerator to execute the accelerator task associated with the node 211. uses data from the execution of the accelerator tasks associated with the nodes 207 and 209 [Examiner noted: node 211 is using the data from previous node 207 and 209 (as predefined computation)]); 
select the respective one of the compute accelerators from the plurality of compute accelerators based on a determination that the set of the dependent compute kernels is performing the predefined computation (Rossbach, [0052] lines 10-16, the scheduler 320 may select the accelerator 120a to execute the selected accelerator task, because the data that was generated by the accelerator task associated with the node 207 is already at the accelerator 120a from the previous execution and only data from the execution of the accelerator task associated with the node 209 may be copied to the accelerator 120a by the datablock manager 310); and 
send the dependent compute kernel to the respective one of the compute accelerators (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the accelerator tasks associated with the nodes 207 and 209 in parallel on the accelerators 120a and 120b).
In addition, SHI teaches set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]} (as sets of dependent compute kernels); [0140] lines 3-4, obtain an independent path group: Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]}). 
Further, Shi ‘187’ teaches when sending, it is send the set of dependent compute kernels to the respective one of the compute accelerators (Shi ‘187’, Fig. 2, 218A, 218B, 218C, sub-graph assigned to OASN 210A-C; Col 11, lines 6-8, all services inside one of the composite sub-graphs 502 may be assigned to a single OASN (compute accelerator was taught by Rossbach); Col 16, lines 35-44, determining a composite resource requirement for the composite service, the composite resource requirement being determined based on individual resource requirements of each service included in the composite service; determining a resource capacity of the single service node; and assigning the composite service to the single service node, based on the determined composite resource requirement and the determined resource capacity).

As per claim 6, Rossbach, SHI, Shi ‘187’ and Druyan teach the invention according to claim 5 above. Rossbach further teaches wherein the predefined computation involves a modification to a predefined resource (Rossbach, [0052] lines 7-9, The accelerator 120a may have just completed executing the accelerator task associated with the node 207. As shown, the accelerator task associated with the node 211 uses data from the execution of the accelerator tasks associated with the nodes 207 and 209 [Examiner noted: the processing of the node 211 involves the modification/multiplication of data (as predefined resource) from the node 207 and 209 due to the tasks dependency]).


Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Rossbach, SHI, Shi ‘187’ and Druyan, as applied to claim 1 above, and further in view of Ellis et al. (US. Patent 9,244,652 B1).
Ellis was cited in the previous Office Action.

As per claim 2, Rossbach, SHI, Shi ‘187’ and Druyan teach the invention according to claim 1 above. Rossbach, SHI, Shi ‘187’ and Druyan fail to specifically teach perform static analysis on an object code or a source code representation of the workload to identify the plurality of compute kernels; and perform static analysis on the object code or the source code representation of the workload to identify dependencies between pairs of the plurality of compute kernels.

However, Ellis teaches perform static analysis on an object code or a source code representation of the workload to identify the plurality of compute kernels (Ellis, Col 2, lines 27-30, program code that identifies three tasks (e.g., Task A, Task B, and Task C) that are to be performed by a group of N worker devices; Col 2, lines 42-48, determine a directed acyclic graph (DAG) based on the program code (e.g., the client device may display the DAG to the user, the client device may store information ; and perform static analysis on the object code or the source code representation of the workload to identify dependencies between pairs of the plurality of compute kernels (Ellis, Col 2, lines 30-33, the program code may indicate an order (e.g., Task A.fwdarw.Task B.fwdarw.Task C) associated with the three tasks based on the program code; Col 2, lines 54-58, determine (e.g., based on information stored by the client device) the program code that identifies the three tasks, the order associated with the three tasks, and the dependencies associated with the three tasks).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach, SHI, Shi ‘187’ and Druyan with Ellis because Ellis’s teaching of determine the directed acyclic graph (DAG) based on the program code to identifying the tasks dependencies would have provided Rossbach, SHI, Shi ‘187’ and Druyan’s system with the advantage and capability to allow the system to easily determine the tasks dependencies which allowing the system to perform the tasks simultaneously and improving the system efficiency.


Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Rossbach, SHI, Shi ‘187’ and Druyan, as applied to claim 3 above, and further in view of Schumacher et al. (US Patent. 10,713,404 B1).
Schumacher was cited in the previous Office Action.

As per claim 4, Rossbach, SHI, Shi ‘187’ and Druyan teach the invention according to claim 3 above. Rossbach further teaches the respective compute kernels sent to the respective one of the compute accelerators (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the accelerator tasks associated with the nodes 207 and 209 in parallel on the accelerators 120a and 120b). In addition, SHI teaches set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the output of the previous web service and the input of the next web service and depicts the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]} (as sets of dependent compute kernels); [0140] lines 3-4, obtain an independent path group: Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]}). 

	Rossbach, SHI, Shi ‘187’ and Druyan fail to specifically teach the respective compute kernels is encrypted.

	However, Schumacher teaches the respective compute kernels is encrypted (Schumacher, Col 7, lines 55-57, If the accelerator is a crypto -accelerator, the application may transmit a batch of encrypted data).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach, SHI, Shi ‘187’ and Druyan with Schumacher because Schumacher’s teaching of sending the encrypted data/tasks to the accelerator would have provided Rossbach, SHI, Shi ‘187’ and Druyan’s system with the advantage and capability to preventing any malicious access to the tasks/data which improving the data security. 


Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Rossbach, SHI, Shi ‘187’ and Druyan, as applied to claim 3 above, and further in view of Turner et al. (US Pub. 2018/0081804 A1) and Sen et al. (US Pub. 2018/0219797 A1).
Turner and Sen were cited in the previous Office Action.

As per claim 7, Rossbach, SHI, Shi ‘187’ and Druyan teach the invention according to claim 3 above. Rossbach teaches wherein the predefined criterion comprises the respective one of the compute accelerators being configured to access the working set (Rossbach, Fig. 1, 120a-c accelerators, 130a-c memory; Fig. 4, 405 b-c buffer; [0048] lines 1-5, the scheduler 320 may determine an accelerator 120a-c of the available accelerators 350 that can support the selected accelerator task as determined by the accelerator parameters associated with the selected accelerator task; [0043] lines 2-10, each accelerator identified in the available accelerators 350 may have what is referred to herein as an associated strength. The strength of an accelerator may be a measure of the performance capabilities of the accelerator…Other performance indicators may be used to calculate the strength of the accelerator; [0051] lines 1-3, select the available accelerator with the greatest strength, or other criteria; [0036] lines 5-14, before the accelerator 120a begins executing the accelerator task, the datablock manager 310 may determine if current versions of the data associated with the datablocks 201 and 203 are stored in buffers (as working set) in the memory 130a of the accelerator 120a, and if not, the datablock manager 310 may copy the current versions of the data to buffers in the memory 130a of the accelerator 120a.  The datablock manager 310 may then update the pointers and/or indicators associated with the datablocks 201 and 203, and may allow the accelerator 120a to begin executing the accelerator task). 

Rossbach, SHI, Shi ‘187’ and Druyan fail to specifically teach when accessing, it is to access a single copy of the working set.


access a single copy of the working set (Turner, Fig. 3, 306c Hardware accelerator; [0006] lines 15-17, executing a remaining portion of the offloaded workload by the hardware accelerator; [0049 lines 5-7, The data for the offloaded workload may be stored in the processing device cache (e.g., processing device cache 308 in FIG. 6); [0051] lines 1-4, To transmit the data for the offloaded workload to the hardware accelerator 306, the processing device 302 may implement a cache flush maintenance operation 400 to write the data to the shared memory (as copy the data (as working set) into a shared memory); [0052] lines 5-13, offloading a portion of the workload to the hardware accelerator 306 may include data reads and writes by the hardware accelerator 306 accessing the processing device cache and/or the shared memory…The hardware accelerator 306 may execute the offloaded workload using the data retrieved from the processing device cache and/or the shared memory without needing to cache the data locally).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach, SHI, Shi ‘187’ and Druyan with Turner because Turner’s teaching of providing a shared memory for storing the data (from processing device cache) that needed for processing the portion of the workload for the hardware accelerator would have provided Rossbach, SHI, Shi ‘187’ and Druyan’s system with the advantage and capability to easily manage the workload data for hardware accelerators which improving the system efficiency and performance. 

use a remote direct memory access (RDMA) protocol.

However, Sen teaches when accessing, it is use a remote direct memory access (RDMA) protocol (Sen, [0022] lines 24-25, the Remote Direct Memory Access (RDMA); [0030] lines 2-33, provide an interface for an application executed by the compute device 102 to an accelerator device 308 on an accelerator sled 104. The remote accelerator manager 406 may communicate through the host fabric interface 210 of the compute device 102 with the host fabric interface 310 of the accelerator sled 104 using any suitable protocol or technique, such as TCP, RDMA, RoCE, RoCEv2, iWARP, etc. …The data portion may include the data to be written or data that has been read, a program to be loaded into the accelerator device 308, etc. In some embodiments, the data portion may be embodied as a scatter-gather list, which may be used, for example, with RDMA to transport RDMA keys and leverage RDMA read/write for direct data transfer).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach, SHI, Shi ‘187’, Druyan and Turner with Sen because Sen’s teaching of using the RDMA protocol would have provided Rossbach, SHI, Shi ‘187’, Druyan and Turner’s system with the advantage and capability to lowering the data transferring latency which improving the system performance and efficiency. 

Claims 8, 10, 12-13, 15, 17 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Rossbach et al. (US. Pub. 2013/0232495 A1) in view of SHI et al. (US. Pub. 2010/0223591 A1) and further in view of Shi et al. (US Patent. 10,104,187 B2; hereafter Shi ‘187’).
Rossbach and SHI were cited in the previous Office Action.

As per claim 8, Rossbach teaches the invention substantially as claimed including A method, comprising: 
generating, by a computing device, a directed acyclic graph (DAG) representing a workload assigned to a virtualized compute accelerator (Rossbach, Fig. 10, 1000 (computing device), Fig. 2, 200 (DAG); Fig. 3, 140 Accelerator interface (as virtualized compute accelerator); Abstract, lines 3-6, A graph is generated with a node corresponding to each of the accelerator tasks with edges that represent the data flow and data dependencies between the accelerator tasks; [0003] lines 2-3, encapsulate snippets of executable code of a program (as workload) into accelerator tasks; [0026] lines 1-4, the graph 200 provides information about dataflow and concurrency that may be used by the accelerator interface 140 to schedule the execution of the accelerator tasks on the accelerators 120a-c) wherein: 
the workload comprises a plurality of compute kernels and the DAG comprising a plurality of nodes and a plurality of edges (Rossbach, [0003] lines 2-3, encapsulate snippets of executable code of a program (as workload) into accelerator tasks (as plurality of compute kernels); Fig. 2, 207, 209, 211 (nodes), edges (between nodes); Abstract, lines 3-6, A graph is generated with a node corresponding to each of edges that represent the data flow and data dependencies between the accelerator tasks), 
each of the nodes represents a respective compute kernel (Rossbach, [0003] lines 3-4, A graph is generated with a node corresponding to each of the accelerator tasks), 
each edge represents a dependency between a respective pair of the compute kernels (Rossbach, Abstract, lines 3-6, A graph is generated with a node corresponding to each of the accelerator tasks with edges that represent the data flow and data dependencies between the accelerator tasks), and 
the virtualized compute accelerator represents a logical interface for a plurality of compute accelerators (Rossbach, Fig. 3, 140 Accelerator interface (as virtualized compute accelerator), 350 available accelerators (as plurality of compute accelerators)); and
assign, by the computing device, the compute kernels to compute accelerator (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the accelerator tasks associated with the nodes 207 and 209 in parallel on the accelerators 120a and 120b),

Rossbach fails to specifically teach analyzing, by the computing device, the DAG to identify sets of dependent compute kernels, a respective set of dependent compute kernels being independent of other sets of dependent compute kernels and execution of at least one compute kernel in the respective set of dependent compute kernels depending on a previous execution of another compute kernel in the respective set of dependent compute kernels.

However, SHI teaches analyzing, by the computing device, the DAG to identify sets of dependent compute kernels, a respective set of dependent compute kernels being independent of other sets of dependent compute kernels and execution of at least one compute kernel in the respective set of dependent compute kernels depending on a previous execution of another compute kernel in the respective set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the output of the previous web service and the input of the next web service and depicts the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]} (as sets of dependent compute kernels); [0140] lines 3-4, obtain an independent path group: Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]}; [Examiner noted: the execution of the node/web service task (as compute kernel, see Fig. 8, node 3) in a set of dependent compute kernels ([1-2-3-8]) is depending on a previous execution of another computer kernel (node 3 is depending on node 2 in the set of [1-2-3-8]); and the set [1-2-3-8] is being independent of other set [1-4-5-7-8] since both sets don’t have dependency relationship, see Fig. 8]).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach with SHI because SHI’s teaching of determining and analyzing independent path group according to the DAG would have provided Rossbach’s system with the advantage and capability to execute different the different independent path group concurrently which improving the system overall performance and efficiency.

Rossbach and SHI fail to specifically teach determining, by the computing device, a set-specific computing resource profile for a particular set of dependent compute kernels by an analysis of resource requirements specified by a set of nodes corresponding to the particular set of dependent compute kernels; and when assign the compute kernels to compute accelerator, it is assign, by the computing device, the particular set of dependent compute kernels to a particular one of the plurality of compute accelerators comprising available resources corresponding to the set-specific computing resource profile.

However, Shi ‘187’ teaches determining, by the computing device, a set-specific computing resource profile for a particular set of dependent compute kernels by an analysis of resource requirements specified by a set of nodes corresponding to the particular set of dependent compute kernels (Shi ‘187’, Fig. 3, 300, B:15, D:30, E:25 (as set of nodes); Fig. 4, 402 (including B, D, E) (as a particular set of dependent compute kernels); Col 3, lines 7-11, a cost associated with the one or composite subsets of the services may be identified. To this end, the one or more composite subsets of the services may be output to the at least service node, based on the cost. Optionally, the cost may include a cost in resources; Col 6, lines 10-11, the composite subset(s) of the services may include information on a sum of costs associated with multiple services; Col 8, lines 20-23, identify a cost (in terms of resources, delay, efficiency and/or other factors) associated with each of the sub-graphs 218 (as determine a set-specific computing resource profile for a particular set of dependent compute kernels); Col 9, lines 20-26, the service graph 300 includes a plurality of services 302 each including an identifier 306 (e.g. A, B, C, D, etc.) as well as a numerical cost 308 (e.g. 20, 50, 15, etc.) associated therewith. As mentioned earlier, such numerical cost 308 may include a cost in terms of resources necessary for executing the associated service 302. Also included are a plurality of interdependencies 304 that each connect two services 302; Col 10, lines 18-21, the services 302 are divided into sub-graphs 402 (as a particular set of dependent compute kernels), in the manner shown. In FIG. 4, Service B, D and E are grouped together and the sub-graphs 402 thus represent a composite service that is created because of the interdependencies represented via Edge BD and Edge BE; also see Col 6, lines 10-11, the composite subset(s) of the services may include information on a sum of costs associated with multiple services; Fig. 5, B: 70 (cost: 70, (combined from 15, 30 and 25) as set-specific computing resource profile determined by an analysis of resource requirements specified by a set of nodes)), and
assign, by the computing device, the particular set of dependent compute kernels to a particular one of the plurality of compute accelerators comprising available resources corresponding to the set-specific computing resource profile. (Shi ‘187’, Fig. 2, 202 (as computing device), 218A, 218B, 218C, sub-graph assigned to OASN 210A-C; Fig. 7, 700; Col 11, lines 6-8, all services inside one of the composite sub-graphs 502 may be assigned to a single OASN (compute accelerator was taught by Rossbach); Col 16, lines 35-44, determining a composite resource requirement for the composite service, the composite resource requirement being determined based on individual resource requirements of each service included in the composite service; determining a resource capacity of the single service node; and assigning the composite service to the single service node, based on the determined composite resource requirement and the determined resource capacity (as including at least the total resource for the set of dependent compute kernels)).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach and SHI with Shi ‘187’ because Shi ‘187’’s teaching of determining the total resource cost/requirement for the corresponding sub-graph (as set of dependent compute kernels) and assigning each sub-graph to a node that having enough resource for execution would have provided Rossbach and SHI’s system with the advantage and capability to easily determining the resource requirement for the set of tasks and preventing potential system failure due to the resource shortage which improving the system resource utilization and performance.

As per claim 10, Rossbach, SHI and Shi ‘187’ teach the invention according to claim 1 above. Rossbach further teaches determining that the respective one of the compute accelerators complies with a predefined criterion (Rossbach, [0048] lines 1-5, the scheduler 320 may determine an accelerator 120a-c of the available accelerators 350 that can support the selected accelerator task as determined by the accelerator parameters associated with the selected accelerator task; [0043] lines 2-10, each accelerator identified in the available accelerators 350 may have what is referred to herein as an associated strength. The strength of an accelerator may be a measure of the performance capabilities of the accelerator…Other performance indicators may be used to calculate the strength of the accelerator; [0051] lines 1-3, select the available accelerator with the greatest strength, or other criteria); 
selecting the respective one of the compute accelerators from the plurality of compute accelerators based on a determination that the respective one of the compute accelerators complies with the predefined criterion (Rossbach, [0048] lines 5-9, If multiple accelerators 120a-c of the available accelerators 350 can support (as complies) the accelerator task, then the scheduler 320 may select from the available accelerators 350 using one of a variety of accelerator 120a-c selection techniques; [0051] lines 1-3, select the available accelerator with the greatest strength, or other criteria); and 
sending a respective compute kernels to the respective one of the compute accelerators (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the accelerator tasks associated with the nodes 207 and 209 in parallel on the accelerators 120a and 120b).
set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the output of the previous web service and the input of the next web service and depicts the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]} (as sets of dependent compute kernels); [0140] lines 3-4, obtain an independent path group: Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]}). 
Further, Shi ‘187’ teaches when sending, it is send respective set of dependent compute kernels (Shi ‘187’, Fig. 2, 218A, 218B, 218C, sub-graph assigned to OASN 210A-C; Col 11, lines 6-8, all services inside one of the composite sub-graphs 502 may be assigned to a single OASN (compute accelerator was taught by Rossbach); Col 16, lines 35-44, determining a composite resource requirement for the composite service, the composite resource requirement being determined based on individual resource requirements of each service included in the composite service; determining a resource capacity of the single service node; and assigning the composite service to the single service node, based on the determined composite resource requirement and the determined resource capacity).

As per claim 12, Rossbach, SHI and Shi ‘187’ teach the invention according to claim 8 above. Rossbach further teaches determine that a dependent compute kernel is performing a predefined computation (Rossbach, Fig. 2, 207, 209, 211; [0052] lines 1-9, with respect to FIG. 2, the scheduler 320 may be selecting an available accelerator to execute the accelerator task associated with the node 211. The accelerator 120a and the accelerator 120b may both be identified in the available accelerators 350. The accelerator 120a may have just completed executing the accelerator task associated with the node 207. As shown, the accelerator task associated with the node 211 uses data from the execution of the accelerator tasks associated with the nodes 207 and 209 [Examiner noted: node 211 is using the data from previous node 207 and 209 (as predefined computation)]); 
select the respective one of the compute accelerators from the plurality of compute accelerators based on a determination that the set of the dependent compute kernels is performing the predefined computation (Rossbach, [0052] lines 10-16, the scheduler 320 may select the accelerator 120a to execute the selected accelerator task, because the data that was generated by the accelerator task associated with the node 207 is already at the accelerator 120a from the previous execution and only data from the execution of the accelerator task associated with the node 209 may be copied to the accelerator 120a by the datablock manager 310); and 
send the dependent compute kernel to the respective one of the compute accelerators (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the accelerator tasks associated with the nodes 207 and 209 in parallel on the accelerators 120a and 120b).
set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the output of the previous web service and the input of the next web service and depicts the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]} (as sets of dependent compute kernels); [0140] lines 3-4, obtain an independent path group: Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]}). 
Further, Shi ‘187’ teaches when sending, it is send the set of dependent compute kernels to the respective one of the compute accelerators (Shi ‘187’, Fig. 2, 218A, 218B, 218C, sub-graph assigned to OASN 210A-C; Col 11, lines 6-8, all services inside one of the composite sub-graphs 502 may be assigned to a single OASN (compute accelerator was taught by Rossbach); Col 16, lines 35-44, determining a composite resource requirement for the composite service, the composite resource requirement being determined based on individual resource requirements of each service included in the composite service; determining a resource capacity of the single service node; and assigning the composite service to the single service node, based on the determined composite resource requirement and the determined resource capacity).

As per claim 13, Rossbach, SHI and Shi ‘187’ teach the invention according to claim 12 above. Rossbach further teaches wherein the predefined computation involves a modification to a predefined resource (Rossbach, [0052] lines 7-9, The accelerator 120a may have just completed executing the accelerator task associated with the node 207. As shown, the accelerator task associated with the node 211 uses data from the execution of the accelerator tasks associated with the nodes 207 and 209 [Examiner noted: the processing of the node 211 involves the modification/multiplication of data (as predefined resource) from the node 207 and 209 due to the tasks dependency]).

As per claims 15, 17 and 19, they are non-transitory, computer-readable medium claims of claims 8, 10 and 12 respectively above. Therefore, they are rejected for the same reason as claims 8, 10 and12 respectively above.


Claims 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Rossbach, SHI and Shi ‘187’, as applied to claims 8 and 15 respectively above, and further in view of Ellis et al. (US. Patent 9,244,652 B1).
Ellis was cited in the previous Office Action.

As per claim 9, Rossbach, SHI and Shi ‘187’ teach the invention according to claim 8 above. Rossbach, SHI and Shi ‘187’ fail to specifically teach performing static analysis on an object code or a source code representation of the workload to identify the plurality of compute kernels; and performing static analysis on the object code or the source code representation of the workload to identify dependencies between pairs of the plurality of compute kernels.

However, Ellis teaches performing static analysis on an object code or a source code representation of the workload to identify the plurality of compute kernels (Ellis, Col 2, lines 27-30, program code that identifies three tasks (e.g., Task A, Task B, and Task C) that are to be performed by a group of N worker devices; Col 2, lines 42-48, determine a directed acyclic graph (DAG) based on the program code (e.g., the client device may display the DAG to the user, the client device may store information associated with the DAG, etc.), may execute the program code (e.g., the client device may send the program code to be added to a task queue associated with the group of N worker devices) [Examiner noted: perform static analysis (since the program code has not been executed) on program code to identify the different tasks (as plurality of compute kernels)]; and performing static analysis on the object code or the source code representation of the workload to identify dependencies between pairs of the plurality of compute kernels (Ellis, Col 2, lines 30-33, the program code may indicate an order (e.g., Task A.fwdarw.Task B.fwdarw.Task C) associated with the three tasks based on the program code; Col 2, lines 54-58, determine (e.g., based on information stored by the client device) the program code that identifies the three tasks, the order associated with the three tasks, and the dependencies associated with the three tasks).



As per claim 16, it is a non-transitory, computer-readable medium claim of claim 9 above. Therefore, it is rejected for the same reason as claim 9 above.


Claims 11 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Rossbach, SHI and Shi ‘187’, as applied to claims 10 and 17 respectively above, and further in view of Schumacher et al. (US Patent. 10,713,404 B1).
Schumacher was cited in the previous Office Action.

As per claim 11, Rossbach, SHI and Shi ‘187’ teach the invention according to claim 10 above. Rossbach further teaches the respective compute kernels sent to the respective one of the compute accelerators (Rossbach, Fig. 1, 120a-120c accelerator; [0026] lines 10-13, the accelerator interface 140 may execute the accelerator tasks associated with the nodes 207 and 209 in parallel on the accelerators 120a and 120b). In addition, SHI teaches set of dependent compute kernels (SHI, Fig. 7, DAG; Fig. 8, Independent path groups; Abstract, lines 5-6, obtaining (as identify) an independent path group according to the DAG composition logic; [0062] lines 2-7, The composition logic obtained by the web service composition engine is represented in the DAG mode. Each node denotes operations of a web service. An edge denotes the output of the previous web service and the input of the next web service and depicts the execution dependency and data dependency between web services; [0128] Table 4, composition logic (DAG) expresses in a data sheet; node, data dependence; [0136] line 1, Group paths; [0138] line 1, Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]} (as sets of dependent compute kernels); [0140] lines 3-4, obtain an independent path group: Group={[1-2-3-8], [1-4-5-7-8], [1-4-6-7-8]}). 

	Rossbach, SHI and Shi ‘187’ fail to specifically teach the respective compute kernels is encrypted/encrypting.

	However, Schumacher teaches the respective compute kernels is encrypted/encrypting (Schumacher, Col 7, lines 55-57, If the accelerator is a crypto -accelerator, the application may transmit a batch of encrypted data).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach, SHI and Shi ‘187’ with Schumacher because Schumacher’s teaching of sending the encrypted data/tasks to the accelerator would have provided Rossbach, 

As per claim 18, it is a non-transitory, computer-readable medium claim of claim 11 above. Therefore, it is rejected for the same reason as claim 11 above.



Claims 14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Rossbach, SHI and Shi ‘187’, as applied to claims 8 and 15 respectively above, and further in view of Turner et al. (US Pub. 2018/0081804 A1) and Sen et al. (US Pub. 2018/0219797 A1).
Turner and Sen were cited in the previous Office Action.

As per claim 14, Rossbach, SHI and Shi ‘187’ teach the invention according to claim 8 above. Rossbach teaches wherein the predefined criterion comprises the respective one of the compute accelerators being configured to access the working set (Rossbach, Fig. 1, 120a-c accelerators, 130a-c memory; Fig. 4, 405 b-c buffer; [0048] lines 1-5, the scheduler 320 may determine an accelerator 120a-c of the available accelerators 350 that can support the selected accelerator task as determined by the accelerator parameters associated with the selected accelerator task; [0043] lines 2-10, each accelerator identified in the available accelerators 350 may have what is referred to herein as an associated strength. The strength of an accelerator may be a measure of the performance capabilities of the accelerator…Other performance indicators may be used to calculate the strength of the accelerator; [0051] lines 1-3, criteria; [0036] lines 5-14, before the accelerator 120a begins executing the accelerator task, the datablock manager 310 may determine if current versions of the data associated with the datablocks 201 and 203 are stored in buffers (as working set) in the memory 130a of the accelerator 120a, and if not, the datablock manager 310 may copy the current versions of the data to buffers in the memory 130a of the accelerator 120a.  The datablock manager 310 may then update the pointers and/or indicators associated with the datablocks 201 and 203, and may allow the accelerator 120a to begin executing the accelerator task). 

Rossbach, SHI and Shi ‘187’ fail to specifically teach when accessing, it is to access a single copy of the working set.

However, Turner teaches when accessing, it is to access a single copy of the working set (Turner, Fig. 3, 306c Hardware accelerator; [0006] lines 15-17, executing a remaining portion of the offloaded workload by the hardware accelerator; [0049 lines 5-7, The data for the offloaded workload may be stored in the processing device cache (e.g., processing device cache 308 in FIG. 6); [0051] lines 1-4, To transmit the data for the offloaded workload to the hardware accelerator 306, the processing device 302 may implement a cache flush maintenance operation 400 to write the data to the shared memory (as copy the data (as working set) into a shared memory); [0052] lines 5-13, offloading a portion of the workload to the hardware accelerator 306 may include data reads and writes by the hardware accelerator 306 accessing the processing device or the shared memory…The hardware accelerator 306 may execute the offloaded workload using the data retrieved from the processing device cache and/or the shared memory without needing to cache the data locally).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach, SHI and Shi ‘187’with Turner because Turner’s teaching of providing a shared memory for storing the data (from processing device cache) that needed for processing the portion of the workload for the hardware accelerator would have provided Rossbach, SHI and Shi ‘187’’s system with the advantage and capability to easily manage the workload data for hardware accelerators which improving the system efficiency and performance. 

Rossbach, SHI, Shi ‘187’ and Turner fail to specifically teach when accessing, it is use a remote direct memory access (RDMA) protocol.

However, Sen teaches when accessing, it is use a remote direct memory access (RDMA) protocol (Sen, [0022] lines 24-25, the Remote Direct Memory Access (RDMA); [0030] lines 2-33, provide an interface for an application executed by the compute device 102 to an accelerator device 308 on an accelerator sled 104. The remote accelerator manager 406 may communicate through the host fabric interface 210 of the compute device 102 with the host fabric interface 310 of the accelerator sled 104 using any suitable protocol or technique, such as TCP, RDMA, RoCE, RoCEv2, RDMA to transport RDMA keys and leverage RDMA read/write for direct data transfer).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Rossbach, SHI, Shi ‘187’ and Turner with Sen because Sen’s teaching of using the RDMA protocol would have provided Rossbach, SHI, Shi ‘187’ and Turner’s system with the advantage and capability to lowering the data transferring latency which improving the system performance and efficiency. 

As per claim 20, it is a non-transitory, computer-readable medium claim of claim 14 above. Therefore, it is rejected for the same reason as claim 14 above.



Response to Arguments
Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ZUJIA XU whose telephone number is (571)272-0954. The examiner can normally be reached M-F 9:00-5:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Meng-Ai An can be reached on (571) 272-3756. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/MENG AI T AN/Supervisory Patent Examiner, Art Unit 2195