Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Argument
Applicant’s arguments with respect to claims 2-4, 6-11, 14-17, and 19-25 have been considered but are moot because the arguments do not apply to any of the references being used in the current rejection

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.

Claims 2, 3, 10, 11, 16, 17, and 23 are rejected under 35 U.S.C. 103(a) as being unpatentable over Jung et al. (US 2018/0321859, hereinafter Jung) in view of Lo (US 2018/0276047, hereinafter Lo) and Rafique et al. (US 2015/0128136, hereinafter Rafique).

Regarding claim 2, Jung discloses 
a data processing method carried out while an application is running, via system software (Fig. 3 application), on a hardware platform that includes at least one processor and a plurality of coprocessors (Fig. 1 CPU and processors);
intercepting, by an intermediate software layer running logically between the application and the system software (Fig. 3, par. 59: Referring to FIG. 3, a host employs an accelerator driver (i.e., a device driver) 37 and a runtime library 36 as the software stack for the accelerator 35, and employs a flash firmware 34, a host block adaptor (HBA) driver 33, a file system 32, and an I/O runtime library 31 as the software stack to recognize the SSD 35 as a storage), at least one kernel, comprising a plurality of kernel tasks, dispatched within a data and command stream issued by the application (par. 58: single application task has to be split into multiple kernels due to capacity limit of the internal DRAM 26a of the accelerator 26, in turn serializing the execution and thereby deteriorating the degree of parallelism), each said kernel corresponding to instructions to an intended one of the coprocessors for execution on that intended coprocessor (par. 7: The supervisor processor maps a region of the first memory pointed by a data section of a first kernel to a region of the flash memory to allow first data to move between the region of the first memory and the region of the flash memory, based on a first message which is transferred in accordance with execution of the first kernel by a first processor among the plurality of processors; par. 15: A second processor among the plurality of processors may transfer to the supervisor processor a second message for writing second data to the flash memory in accordance with execution of a second kernel, and the second message may include a pointer to a data section of the second kernel);
determining compute functions within the at least one kernel the compute functions including a first compute function and a second compute function; automatically, and transparent to the application, determining data dependencies among the compute functions including determining that an input to the first compute function is an output from the second compute function (par. 58: a single application task has to be split into multiple kernels due to capacity limit of the internal DRAM 26a of the accelerator 26, in turn serializing the execution; par. 119: kernel in practice may be formed by multiple groups of code segments, referred to as microblocks. Each group has execution dependence on its input/output data); 
selecting at least one coprocessor to which the first and second compute functions are to be dispatched based at least in part on the determined data dependencies; dispatching the first and second compute functions to the selected at least one coprocessor (par. 52: each processor of the accelerator 300 may be a light-weight processor (LWP); par. 54: The computing device offloads various applications to the accelerator 300, which allows the accelerator 300 to directly execute kernels of the application; par. 58: a single application task has to be split into multiple kernels due to capacity limit of the internal DRAM 26a of the accelerator 26, in turn serializing the execution).
Jung does not teach an intermediate software layer, which is installed in a non-privileged, user space to run logically between the application and the system software, without modification of the application or of the system software running on the hardware platform. Lo teaches  an intermediate software layer, which is installed in a non-privileged, user space to run logically between the application and the system software, without modification of the application or of the system software running on the hardware platform (par. 8: Although low latency operation is optimized for a specific application according to concepts of the present invention, the low latency optimization of embodiments does not target the application itself, but instead targets the user space level of the operating system (e.g., HAL, device driver frameworks, system services, etc.) for providing an environment for application execution that is optimized for low latency operation with respect to the particular application). It would have been obvious to one of ordinary skill in the art at the time the claimed invention was effectively filed to modify the teaching of Jung by implementing HAL in user space level of the operation system of Lo. The motivation would have been to provide dynamic implementation of low latency optimization configured to perform from the hardware layer up to an application (Lo par. 8).
Jung in view of Lo does not teach returning kernel results to the application. Rafique teaches returning kernel results to the application (par. 36: The or each graphics processing unit 3 is configured to execute one or more kernel computes 21 for an application 22 of one of the one or more virtual machines 2, and to return the results of the execution of the or each kernel compute 21 to the one of the one or more virtual machines 2). It would have been obvious to one of ordinary skill in the art at the time the claimed invention was effectively filed to modify the teaching of Jung in view of Lo and Jung by returning he results of the execution of the or each kernel compute to the one of the one or more virtual machine after executing each kernel compute by GPUs of 
Regarding claim 10 referring to claim 2, Jung discloses a non-transitory computer readable medium comprising instructions that are executable in a computer system to cause a hardware platform of the computer system that includes at least one processor and a plurality of coprocessors to carry out a data processing method while an application is running thereon via system software, said method comprising: ... (Fig. 1).

Regarding claim 16, Jung discloses 
a data processing system comprising: a hardware platform that includes at least one processor and a plurality of coprocessors (Fig. 1 CPU and processors);
at least one application running on the hardware platform, via system software (Fig. 3 application);
a hardware abstraction layer running logically between the application and the system software (Fig. 3, par. 59: Referring to FIG. 3, a host employs an accelerator driver (i.e., a device driver) 37 and a runtime library 36 as the software stack for the accelerator 35, and employs a flash firmware 34, a host block adaptor (HBA) driver 33, a file system 32, and an I/O runtime library 31 as the software stack to recognize the SSD 35 as a storage), wherein the hardware abstraction layer is configured to:
intercepting, at least one kernel, comprising a plurality of kernel tasks, dispatched within a data and command stream issued by the application (par. 58: single application task has to be split into multiple kernels due to capacity limit of the internal DRAM 26a of the accelerator 26, in turn serializing the execution and thereby deteriorating the degree of parallelism), each said kernel corresponding to instructions to an intended one of the coprocessors for execution on that intended coprocessor (par. 7: The supervisor processor maps a region of the first memory pointed by a data section of a first kernel to a region of the flash memory to allow first data to move between the region of the first memory and the region of the flash memory, based on a first message which is transferred in accordance with execution of the first kernel by a first processor among the plurality of processors; par. 15: A second processor among the plurality of processors may transfer to the supervisor processor a second message for writing second data to the flash memory in accordance with execution of a second kernel, and the second message may include a pointer to a data section of the second kernel);
determine compute functions within the at least one kernel the compute functions including a first compute function and a second compute function; automatically, and transparent to the application, determine data dependencies among the compute functions including determining that an input to the first compute function is an output from the second compute function (par. 58: a single application task has to be split into multiple kernels due to capacity limit of the internal DRAM 26a of the accelerator 26, in turn ; 
select at least one coprocessor to which the first and second compute functions are to be dispatched based at least in part on the determined data dependencies; dispatch the first and second compute functions to the selected at least one coprocessor (par. 52: each processor of the accelerator 300 may be a light-weight processor (LWP); par. 54: The computing device offloads various applications to the accelerator 300, which allows the accelerator 300 to directly execute kernels of the application; par. 58: a single application task has to be split into multiple kernels due to capacity limit of the internal DRAM 26a of the accelerator 26, in turn serializing the execution).
Jung does not teach a hardware abstraction layer installed in a non-privileged, user space and running logically between the application and the system software ... without modification of the application or of the system software running on the hardware platform. Lo teaches a hardware abstraction layer installed in a non-privileged, user space and running logically between the application and the system software ... without modification of the application or of the system software running on the hardware platform (par. 8: Although low latency operation is optimized for a specific application according to concepts of the present invention, the low latency optimization of embodiments does not target the application itself, but instead targets the user space level of the operating system (e.g., HAL, device driver frameworks, system services, etc.) for providing an environment for 
Jung in view of Lo does not teach return kernel results to the application. Rafique teaches return kernel results to the application (par. 36: The or each graphics processing unit 3 is configured to execute one or more kernel computes 21 for an application 22 of one of the one or more virtual machines 2, and to return the results of the execution of the or each kernel compute 21 to the one of the one or more virtual machines 2). It would have been obvious to one of ordinary skill in the art at the time the claimed invention was effectively filed to modify the teaching of Jung in view of Lo and Jung by returning he results of the execution of the or each kernel compute to the one of the one or more virtual machine after executing each kernel compute by GPUs of Rafique. The motivation would have been to provide systems and methods to share GPU resources more readily between multiple applications (Rafique par. 10).

Regarding claims 3, 11, and 17, Jung discloses 
wherein at least two kernels are intercepted, each of the plurality of kernel tasks being defined by a respective one of the at least two kernels, and determination of the data dependencies among the compute functions is performed at kernel level granularity (paragraph [0058]: a single application 

Regarding claim 23, Jung discloses 
further comprising wherein the input to the first compute function is determined to be the output from the second compute function by analyzing the code of the at least one kernel (paragraph [0058]: a single application task has to be split into multiple kernels due to capacity limit of the internal DRAM 26a of the accelerator 26, in turn serializing the execution).

Claims 4, 7, 8, 14 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Jung in view of Lo and Rafique as applied to claims 2 and 15, and further in view of Dube et al. (US 2016/0085587, hereinafter Dube).

Regarding claim 4, Jung in view of Lo and Rafique does not teach in which the plurality of kernel tasks comprises at least two sub-tasks defined within a single one of the at least one kernel. Dube teaches in which the plurality of kernel tasks comprises at least two sub-tasks defined within a single one of the at least one kernel (par. 23: workload scheduling program 150 creates a graph, hereinafter referred to as a resource graph, which identifies tasks within computing job 120 that can be executed on each of the specific data processing elements (e.g., CPUs, GPUs, FPGAs, etc.) contained within heterogeneous computing device 110 … FPGAs and GPUs may be configured to perform a specific subset of computing tasks such as graphical 

Regarding claim 7, Jung in view of Rafique does not teach further comprising selecting which of the coprocessors to dispatch each compute function to as a function of relative performance characteristics of the respective coprocessors. Dube teaches further comprising selecting which of the coprocessors to dispatch each compute function to as a function of relative performance characteristics of the respective coprocessors (par. 23: workload scheduling program 150 creates a graph, hereinafter referred to as a resource graph, which identifies tasks within computing job 120 that can be executed on each of the specific data processing elements (e.g., CPUs, GPUs, FPGAs, etc.) contained within heterogeneous computing device 110 … FPGAs and GPUs may be configured to perform a specific subset of computing tasks such as graphical computation, video encoding, or data mining computation … the determination of whether or not a data processing element can perform a task is based in part on the current configuration of a data processing element to perform a specific subset of computing tasks, as well as a table which lists all the data processing elements available for use along with their capabilities for executing various types of tasks … one or more of the tasks in computing job 120 have data dependencies, workload scheduling program 150 indicates the location of required data on the resource graph generated; par. 25: Execution mappings are evaluated in order to determine a total job execution time and/or cost associated with a mapping such as a cost charged by a cloud services provider for utilizing resources such as 

Regarding claims 8, 14, and 19, Jung in view of Rafique does not teach wherein the at least one selected coprocessor includes the intended coprocessor. Dube teaches wherein the at least one selected coprocessor includes the intended coprocessor (par. 19: in embodiments where computing job 120 includes a task which includes heavy graphical computation, workload scheduling program 150 may select an execution mapping which utilizes a GPUs ability to perform graphical computations more efficiently and quickly than a CPU to allow the task to be .

Claim 6 is rejected under 35 U.S.C. 103(a) as being unpatentable over Jung in view of Lo and Rafique as applied to claim 2, and further in view of Kenney et al. (US 2016/0246598, hereinafter Kenney).

Regarding claim 6, Jung discloses 
wherein the input to the first compute function is determined to be the output from the second compute function (par. 58: a single application task has to be split into multiple kernels due to capacity limit of the internal DRAM 26a of the accelerator 26, in turn serializing the execution; par. 119: kernel in practice may be formed by multiple groups of code segments, referred to as microblocks. Each group has execution dependence on its input/output data);
Jung in view of Lo and Rafique does not teach by determining that a memory location to which the second compute function writes is the same memory location from which the first compute function reads. Kenney teaches wherein the input to the first compute function is determined to be the output from the second compute function by determining that a memory location to which the second compute function writes is the same memory location from which the first compute function reads (par. 73: the store1 instruction has been received. In response, dependency logic 330 has set an indicator in row 0 (corresponding to the store1 instruction) and column 3 (corresponding to the load2 instruction) because the load2 instruction specifies a read from the same region (region 1) to which the store1 instruction specifies a write). It would have been obvious to one of ordinary skill in the art at the time the claimed invention was effectively filed to modify the teaching of Jung in view of Lo and Rafique by specifying, by load2 instruction, a read from the same region to which the store1 instruction specifies a write of Kenney. The motivation would have been to provide techniques relating to handling dependencies between instructions (Kenney par. 6).


Allowable Subject Matter
Claims 9, 15, 20-22, 24, and 25 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP 
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SISLEY NAHYUN KIM whose telephone number is (571)270-7832.  The examiner can normally be reached on Monday-Friday 8AM-5PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner' s supervisor, EMERSON PUENTE can be reached on (571)272-3652.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic 

/SISLEY N KIM/Primary Examiner, Art Unit 2196                                                                                                                                                                                                        3/21/2021