DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application was filed on June 27, 2018. 
This action is in response to amendments and/or remarks filed on October 8, 2021. In the current amendment, claims 1-3, 5, 8, 17, and 20 are amended. No claims have been canceled. Claims 1-20 are presented for examination and are pending.  
In response to amendments and/or remarks filed on October 8, 2021, the objection to claim 20 made in the previous office action has been withdrawn. 
In response to amendments and/or remarks filed on October 8, 2021, the double patenting rejection applied to claims 1, 5, 6, and 17 made in the previous office action has been withdrawn. 
In response to amendments and/or remarks filed on October 8, 2021, the 35 U.S.C. 112(b) rejection applied to claims 1-20 made in the previous office action has been withdrawn.

Information Disclosure Statement
The Information Disclosure statements (IDS) was submitted on October 8, 2021. This submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the Information Disclosure Statement is being considered by the examiner. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 4 – 7, 9 – 17, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ukidave et al. (“Mystic: Predictive Scheduling for GPU Based Cloud Servers using Machine Learning”, hereinafter “Ukidave”) in view of Wang et al. (“An Elastic CNN Inference Accelerator with Adaptive Trade-off between QoS and QoR”, hereinafter “Wang”), further in view of Wesolowski et al. (US 20190114537 A1, hereinafter “Wesolowski”) further in view of Wilt et al. (US 2017/0132746 A1, hereinafter “Wilt”).

As per claim 1, Ukidave teaches: A computer-implemented method, comprising: receiving, in a multi-tenant web services provider, an application instance configuration, an application of the application instance to utilize a portion of an attached graphics processing unit (GPU)’s compute capacity during execution of a machine learning model (Page 356: “Instead, Mystic initiates two short profiling runs for each incoming application to obtain metrics for two randomly selected CoIs (out of 6 identified CoIs). The profiler run needs be long enough to profile each distinct kernel in the application at least once… The short-profiles (∼5 seconds) for incoming applications are collected and stored in the Profile Information Table (PIT) in form of sparse rows, as metrics for only 2 random CoIs out of 6 are captured. The PIT is indexed by the application process ID (pid).” and Page 357: “The CF predictor takes the PIT and TRM as inputs. When a new application A0 is enqueued for execution on the system, the predictor identifies A0’s profile information by searching the PIT using the process-id (pid) of the application. The PIT returns a sparse vector v with the metrics obtained from the short profiles collected in Stage-1” teaches receiving Application A0’s profile information (application instance configuration) and A0 (application instance)); Page 354: “We present Mystic, a framework enabling interference-aware scheduling for GPU workloads. Our work targets servers and cloud schedulers by utilizing machine learning algorithms. Mystic utilizes the concurrency features of modern GPUs exposed by programming frameworks such as CUDA 7.0.” teaches that Mystic is a multi-tenant web service provider because it targets servers and cloud schedulers; Page 354 – 355: “Requests from various frontend user applications can be aggregated into backend threads, which can be handled as a single GPU context (see Figure 2b). GPU components of all frontend applications co-executing on the GPU are assigned to separate backend threads. The backend threads map to the same device on a per-GPU context basis. This design enables GPU operations from different applications to be executed concurrently, which enables a single GPU to be shared in both space and time [13, 32].” teaches that an application is executed on a shared GPU context (portion of a GPU); Page 358: “We select 55 distinct workloads… In addition, we leverage tuned CUDA libraries such as cuDNN (deep learning libraries)” teaches that an application can be a machine learning model from the cuDNN library)

    PNG
    media_image1.png
    315
    605
    media_image1.png
    Greyscale


loading the machine learning model onto the portion of the GPU; (Page 357: “When a new application A0 is enqueued for execution on the system, the predictor identifies A0’s profile information by searching the PIT using the process-id (pid) of the application.” teaches enqueuing (loading) Application A0; Page 354 – 355: “Requests from various frontend user applications can be aggregated into backend threads, which can be handled as a single GPU context (see Figure 2b). GPU components of all frontend applications co-executing on the GPU are assigned to separate backend threads. The backend threads map to the same device on a per-GPU context basis. This design enables GPU operations from different applications to be executed concurrently, which enables a single GPU to be shared in both space and time [13, 32].” teaches that an application is executed on a shared GPU context (portion of a GPU); Page 358: “We select 55 distinct workloads… In addition, we leverage tuned CUDA libraries such as cuDNN (deep learning libraries)” teaches that an application can be a machine learning model from the cuDNN library) 

Ukidave does not appear to explicitly teach: 

and the application instance configuration specifying… and a processing speed to be used for graphics processing unit (GPU)-based acceleration of machine learning model inference;
the arithmetic precision being one of a plurality of arithmetic precision capabilities provided by the multi-tenant web services provider for graphics processing unit (GPU)-based acceleration of machine learning model inference,
and the processing speed being one of a plurality of processing speed capabilities provided by the multi-tenant web services provider for graphics processing unit (GPU)-based acceleration of machine learning model inference;
determining a portion of a GPU’s compute capacity to provision to the application based at least in part on both the arithmetic precision… indicated by the application instance configuration;
determining a portion of a GPU’s compute capacity to provision to the application based at least in part on both the… processing speed indicated by the application instance configuration;
provisioning the application instance and the portion of the GPU attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the GPU is implemented using a physical GPU in the second location, and wherein the physical GPU is accessible to the physical compute instance over a network;
and performing inference using the loaded machine learning model of the application using the portion of the GPU on the attached GPU.

However, Wang teaches: 
and the application instance configuration specifying both an arithmetic precision… to be used for graphics processing unit (GPU)-based acceleration of machine learning model inference;  (Page 3, Section 3.1: “Afterwards, ELNA manager selects the CNN hyperparameters and the hardware operating mode, as the final configuration. When the configuration is decided, ELNA compiler will map the final model to the accelerator at the correct mode, and generate the on-line control bitstreams that will be used by the ELNA scheduler to direct the accelerator to execute the network inference.” teaches selecting a configuration for the convolutional neural network (application instance); Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model. As shown in Fig. 3, the data path of ELNA accelerator is organized into lanes. For example in Fig.3 (a), after the nX16-byte data arrives in the input registers, the MSB half and LSB half goes into the different lanes of PE. Depending on the precision mode decided by the control bits, the PEs will choose to separate the final result into two or activate the bridge logics to generate the final one result. In this way, the data path can offer either higher computation throughput or word-level precision as decided by the synthesizer and compiler. For example in Fig. 3(a), in 8-bit operating mode, each PE processes two continuous pixels in the input feature maps in the {x, y} and {x+1, y+1} positions of all channels and then respectively sum them up into two pixels in the output channel of next layer. In contrast, in 16-bit operating mode, each PE also receives 16-bit input but only generate one pixel point. The data mapping and data-level parallelization schemes also become slightly different, and the instructions fed into the scheduler change accordingly.” teaches that the precision of the neural network (machine learning model) determines the mode (portion) of the accelerator to use in executing (perform machine learning inference) the neural network)

the arithmetic precision being one of a plurality of arithmetic precision capabilities provided by the multi-tenant web services provider for graphics processing unit (GPU)-based acceleration of Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model.” teaches that both 16 bit precision and 8 bit precision is supported)

determining a portion of a GPU’s compute capacity to provision to the application based at least in part on both the arithmetic precision… indicated by the application instance configuration; (Page 3, Section 3.1: “Afterwards, ELNA manager selects the CNN hyperparameters and the hardware operating mode, as the final configuration. When the configuration is decided, ELNA compiler will map the final model to the accelerator at the correct mode, and generate the on-line control bitstreams that will be used by the ELNA scheduler to direct the accelerator to execute the network inference.” teaches selecting a configuration for the convolutional neural network (application instance); Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model. As shown in Fig. 3, the data path of ELNA accelerator is organized into lanes. For example in Fig.3 (a), after the nX16-byte data arrives in the input registers, the MSB half and LSB half goes into the different lanes of PE. Depending on the precision mode decided by the control bits, the PEs will choose to separate the final result into two or activate the bridge logics to generate the final one result. In this way, the data path can offer either higher computation throughput or word-level precision as decided by the synthesizer and compiler. For example in Fig. 3(a), in 8-bit operating mode, each PE processes two continuous pixels in the input feature maps in the {x, y} and {x+1, y+1} positions of all channels and then respectively sum them up into two pixels in the output channel of next layer. In contrast, in 16-bit operating mode, each PE also receives 16-bit input but only generate one pixel point. The data mapping and data-level parallelization schemes also become slightly different, and the instructions fed into the scheduler change accordingly.” teaches that the precision of the neural network (machine learning model) determines the mode (portion) of the accelerator to use in executing (perform machine learning inference) the neural network)

and performing inference using the loaded machine learning model of the application using the portion of the GPU on the attached GPU. (Page 3, Section 3.1: “Afterwards, ELNA manager selects the CNN hyperparameters and the hardware operating mode, as the final configuration. When the configuration is decided, ELNA compiler will map the final model to the accelerator at the correct mode, and generate the on-line control bitstreams that will be used by the ELNA scheduler to direct the accelerator to execute the network inference.” teaches performing inference using the convolutional neural network (machine learning model); Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model.” teaches that inference is performed with a mode (portion) of the accelerator, depending on the precision of the neural network)

Ukidave and Wang are analogous art because they are directed to neural network accelerators. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang’s multi-mode neural network accelerator into Ukidave’s Predictive Scheduling for GPU based Cloud Servers with a motivation to “…support CNNs of various topologies in this data-driven architecture.” (Wang, Page 4, Section 3.3)

The combination of Ukidave and Wang does not appear to explicitly teach: 

and the processing speed being one of a plurality of processing speed capabilities provided by the multi-tenant web services provider for graphics processing unit (GPU)-based acceleration of machine learning model inference;
determining a portion of a GPU’s compute capacity to provision to the application based at least in part on both the… processing speed indicated by the application instance configuration;
provisioning the application instance and the portion of the GPU attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the GPU is implemented using a physical GPU in the second location, and wherein the physical GPU is accessible to the physical compute instance over a network;

However, Wesolowski teaches: 
and the application instance configuration specifying… and a processing speed to be used for graphics processing unit (GPU)-based acceleration of machine learning model inference; (Para [0026]: “In particular embodiments, the scheduler machine may distribute execution of a single machine learning model across multiple different computing machines, so that each computing machine trains a different portion (e.g., graph-segment) of the ML model and the different computing machines exchange processing data, as needed. In this case, the scheduler machine may monitor the performance of each computing machine, and if necessary, transfer execution of a portion of the machine learning model from one machine to a faster or slower machine, as necessary, to maintain optimal timing between the transferring of processing data between the machines…” and Para [0068]: “Some neural network model may require faster machines, or more memory, and each may generally require a different profile machine.” teaches that the processing speed requirement of the specific Para [0005]: “In particular embodiments, a master machine learning (ML) control system/server (e.g., a scheduler machine or master ML control system, or first computing system) establishes access to different types of computing systems configured for different types of primary tasks. Such systems may include, for example, a GPU-based or CPU-based ML training system (e.g., a second computing system)…” teaches that the computing machines can be GPU based)

and the processing speed being one of a plurality of processing speed capabilities provided by the multi-tenant web services provider for graphics processing unit (GPU)-based acceleration of machine learning model inference; (Para [0061]: “For example, ML Model 1 is submitted to training system 11, which is illustratively shown as having one GPU (e.g., daughter board). More specifically, training system 11 has one NVidia Corporation, Kepler K40 GPU-based board, which has a total of 2880 single instruction multiple data (SIMD) cores (e.g., processing cores). As model complexity increases, it may be necessary to add more GPU boards to a machine or to increase the number of machines in a training system… If higher power is needed, then, additional GPUs or more powerful GPUs may be used. For example, ML Model 4 is submitted to a training system 17 consisting of two computing systems 17A and 17B, each system having 8 NVidia Maxwell M40 GPUs, where each M40 GPU has a 3072 SIMD cores.” teaches that the accelerator service has servers that contain multiple GPUs with multiple processing speeds)

determining a portion of a GPU’s compute capacity to provision to the application based at least in part on both the… processing speed indicated by the application instance configuration; (Para [0026]: “In particular embodiments, the scheduler machine may distribute execution of a single machine learning model across multiple different computing machines, so that each computing machine trains a different portion (e.g., graph-segment) of the ML model and the different computing machines exchange processing data, as needed. In this case, the scheduler machine may monitor the performance of each computing machine, and if necessary, transfer execution of a portion of the machine learning model from one machine to a faster or slower machine, as necessary, to maintain optimal timing between the transferring of processing data between the machines…” teaches determining and allocating the model to be executed on different computing machines based on processing speed requirements, therefore the processing speed requirements of application instance determines what computing machine is allocated; Para [0005]: “In particular embodiments, a master machine learning (ML) control system/server (e.g., a scheduler machine or master ML control system, or first computing system) establishes access to different types of computing systems configured for different types of primary tasks. Such systems may include, for example, a GPU-based or CPU-based ML training system (e.g., a second computing system)…” teaches that the computing machines can be GPU based)
Ukidave, Wang, and Wesolowski are analogous art because they are directed to machine learning accelerators. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wesolowski’s system for distributed training and prediction using elastic resources into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang with a motivation to “…provide for heterogeneous computing for training a machine learning model across different computing systems…” (Wesolowski, Para [0021])

The combination of Ukidave, Wang, and Wesolowski does not appear to explicitly teach:
provisioning the application instance and the portion of the GPU attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first 

However, Wilt teaches: 
provisioning the application instance and the portion of the GPU attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the GPU is implemented using a physical GPU in the second location, and wherein the physical GPU is accessible to the physical compute instance over a network; (Fig. 14 (shown below) and Para [0052]: “The instance provisioning functionality 130 may provision a virtual compute instance 141B with an attached virtual GPU 151B based on the specified instance type "B" and the specified virtual GPU class "B". The provisioned virtual compute instance 141B may be implemented by the compute virtualization functionality 140 using suitable physical resources such as a physical compute instance 142B, and the provisioned virtual GPU 151B may be implemented by the GPU virtualization functionality 150 using suitable physical resources such as a physical GPU 152B.” teaches provisioning the application instance and virtual GPU (portion of the GPU) attached to the application instance, wherein the application instance is implemented using a physical compute instance and the Virutal GPU (portion of the GPU) is implemented using a physical GPU; Fig 14 and Para [0052]: “To implement the virtual compute instance 141B with the attached virtual GPU 151B, a physical compute instance 142B may communicate with a physical GPU 152B, e.g., over a network. The physical GPU 152B may be located in a different computing device than the physical compute instance 142B.” teaches that the physical compute instance is in a first instance location, the GPU is in a second instance location, and that the physical GPU communicates (is accessible) to the physical compute instance over a network)

    PNG
    media_image2.png
    1017
    932
    media_image2.png
    Greyscale

Ukidave, Wang, Wesolowski and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang and Wesolowski with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]). 

As per claim 2, the combination of Ukidave, Wang, Wesolowski and Wilt as shown above teaches The method of claim 1, 
Wang further teaches: 
prior to provisioning the portion of the GPU, evaluating the machine learning model to determine the arithmetic precision of the machine learning model. (Fig. 3 and Page 4, Section 3.3: “As shown in Fig. 3, the data path of ELNA accelerator is organized into lanes. For example in Fig.3 (a), after the nX16-byte data arrives in the input registers, the MSB half and LSB half goes into the different lanes of PE. Depending on the precision mode decided by the control bits, the PEs will choose to separate the final result into two or activate the bridge logics to generate the final one result. In this way, the data path can offer either higher computation throughput or word-level precision as decided by the synthesizer and compiler.” teaches that the ELNA accelerator evaluates the convolutional neural network (machine learning model) to determine the precision mode (portion of the accelerator) needed to execute inference for the neural network)

Ukidave, Wang, Wesolowski and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang’s multi-mode neural network accelerator into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wesolowski and Wilt with a motivation to “…support CNNs of various topologies in this data-driven architecture.” (Wang, Page 4, Section 3.3)

As per claim 4, the combination of Ukidave, Wang, Wesolowski and Wilt as shown above teaches The method of claim 1, 
Wilt further teaches: 
selecting a GPU location for a physical accelerator or an application instance location based at least in part on one or more placement criteria, wherein the multi-tenant web services provider comprises a plurality of instance locations for physical compute instances and a plurality of GPU locations for physical accelerators. (Para [0146]: “Based on one or more of the placement criteria 1425, a particular GPU location 1450A may be selected for a physical GPU 152A. Based on one or more of the placement criteria 1425, a particular instance location 1440A may be selected for a physical compute instance 141B.” teaches selecting a GPU location for a physical GPU (accelerator) and selecting an application instance location based on placement criteria; Para [0142]: “FIG. 14 illustrates an example system environment for placement optimization for virtualized graphics processing, including multiple instance locations and multiple GPU locations in a provider network, according to one embodiment. The provider network 100 may include a plurality of instance locations 1440A-1440N for a plurality of physical compute instances 142A-142N. The instance locations 1440A-1440N may represent a plurality of racks, a plurality of data centers, and/or a plurality of geographical regions.” teaches that the provider network comprises a plurality of instance locations for physical compute instances; Para [0143]: “The provider network 100 may also include a plurality of GPU locations 1450A-1450N for a plurality of physical GPUs 152A-152N (e.g., for graphics servers that include and provide access to the physical GPUs). The GPU locations 1450A-1450N may represent a plurality of racks, a plurality of data centers, and/or a plurality of geographical regions.” teaches that the provider network comprises a plurality of GPU locations for physical GPUs (accelerators); Fig 1 and Para [0041]: “The provider network 100 may implement or provide a multi-tenant environment such that multiple clients (e.g., using client devices 180A-180N) may access or use a particular resource in a substantially simultaneous manner.” teaches that the provider network is a multi-tenant web services provider)


Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang and Wesolowski with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).

As per claim 5, Ukidave teaches: A computer-implemented method, comprising: receiving, in a multi-tenant web services provider, an application instance configuration, an application of the application instance to utilize a portion of an attached accelerator’s compute capacity during execution of a machine learning model (Page 356: “Instead, Mystic initiates two short profiling runs for each incoming application to obtain metrics for two randomly selected CoIs (out of 6 identified CoIs). The profiler run needs be long enough to profile each distinct kernel in the application at least once… The short-profiles (∼5 seconds) for incoming applications are collected and stored in the Profile Information Table (PIT) in form of sparse rows, as metrics for only 2 random CoIs out of 6 are captured. The PIT is indexed by the application process ID (pid).” and Page 357: “The CF predictor takes the PIT and TRM as inputs. When a new application A0 is enqueued for execution on the system, the predictor identifies A0’s profile information by searching the PIT using the process-id (pid) of the application. The PIT returns a sparse vector v with the metrics obtained from the short profiles collected in Stage-1” teaches receiving Application A0’s profile information (application instance configuration) and A0 (application instance)); Page 354: “We present Mystic, a framework enabling interference-aware scheduling for GPU workloads. Our work targets servers and cloud schedulers by utilizing machine learning algorithms. Mystic utilizes the concurrency features of modern GPUs exposed by programming frameworks such as CUDA 7.0.” teaches that Mystic is a multi-tenant web service provider because it targets servers and cloud schedulers; Page 354 – 355: “Requests from various frontend user applications can be aggregated into backend threads, which can be handled as a single GPU context (see Figure 2b). GPU components of all frontend applications co-executing on the GPU are assigned to separate backend threads. The backend threads map to the same device on a per-GPU context basis. This design enables GPU operations from different applications to be executed concurrently, which enables a single GPU to be shared in both space and time [13, 32].” teaches that an application is executed on a shared GPU context (portion of a GPU (accelerator)); Page 358: “We select 55 distinct workloads… In addition, we leverage tuned CUDA libraries such as cuDNN (deep learning libraries)” teaches that an application can be a machine learning model from the cuDNN library)

    PNG
    media_image1.png
    315
    605
    media_image1.png
    Greyscale


loading the machine learning model onto the portion of the accelerator; (Page 357: “When a new application A0 is enqueued for execution on the system, the predictor identifies A0’s profile information by searching the PIT using the process-id (pid) of the application.” teaches enqueuing (loading) Application A0; Page 354 – 355: “Requests from various frontend user applications can be aggregated into backend threads, which can be handled as a single GPU context (see Figure 2b). GPU components of all frontend applications co-executing on the GPU are assigned to separate backend threads. The backend threads map to the same device on a per-GPU context basis. This design enables GPU operations from different applications to be executed concurrently, which enables a single GPU to be shared in both space and time [13, 32].” teaches that an application is executed on a shared GPU context (portion of a GPU (accelerator)); Page 358: “We select 55 distinct workloads… In addition, we leverage tuned CUDA libraries such as cuDNN (deep learning libraries)” teaches that an application can be a machine learning model from the cuDNN library)


Ukidave does not appear to explicitly teach: 
and the application instance configuration indicating both an arithmetic precision… to be used for hardware acceleration of machine learning model inference; 
and the application instance configuration specifying… and a processing speed to be used for hardware acceleration of machine learning model inference;
the arithmetic precision being one of a plurality of arithmetic precision capabilities provided by the multi-tenant web services provider for hardware acceleration of machine learning model inference,
and the processing speed being one of a plurality of processing speed capabilities provided by the multi-tenant web services provider for hardware acceleration of machine learning model inference;
determining a portion of an accelerator’s compute capacity to provision to the application based at least in part on both the arithmetic precision… indicated by the application instance configuration;

provisioning the application instance and the portion of the accelerator attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the accelerator is implemented using a physical accelerator in the second location, and wherein the physical accelerator is accessible to the physical compute instance over a network;
and performing inference using the loaded machine learning model of the application using the portion of the accelerator on the attached accelerator.

However, Wang teaches: 
and the application instance configuration indicating both an arithmetic precision… to be used for hardware acceleration of machine learning model inference; (Page 3, Section 3.1: “Afterwards, ELNA manager selects the CNN hyperparameters and the hardware operating mode, as the final configuration. When the configuration is decided, ELNA compiler will map the final model to the accelerator at the correct mode, and generate the on-line control bitstreams that will be used by the ELNA scheduler to direct the accelerator to execute the network inference.” teaches selecting a configuration for the convolutional neural network (application instance); Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model. As shown in Fig. 3, the data path of ELNA accelerator is organized into lanes. For example in Fig.3 (a), after the nX16-byte data arrives in the input registers, the MSB half and LSB half goes into the different lanes of PE. Depending on the precision mode decided by the control bits, the PEs will choose to separate the final result into two or activate the bridge logics to generate the final one result. In this way, the data path can offer either higher computation throughput or word-level precision as decided by the synthesizer and compiler. For example in Fig. 3(a), in 8-bit operating mode, each PE processes two continuous pixels in the input feature maps in the {x, y} and {x+1, y+1} positions of all channels and then respectively sum them up into two pixels in the output channel of next layer. In contrast, in 16-bit operating mode, each PE also receives 16-bit input but only generate one pixel point. The data mapping and data-level parallelization schemes also become slightly different, and the instructions fed into the scheduler change accordingly.” teaches that the precision of the neural network (machine learning model) determines the mode (portion) of the accelerator to use in executing the neural network)

the arithmetic precision being one of a plurality of arithmetic precision capabilities provided by the multi-tenant web services provider for hardware acceleration of machine learning model inference, (Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model.” teaches that both 16 bit precision and 8 bit precision is supported)

determining a portion of an accelerator’s compute capacity to provision to the application based at least in part on both the arithmetic precision… indicated by the application instance configuration; (Page 3, Section 3.1: “Afterwards, ELNA manager selects the CNN hyperparameters and the hardware operating mode, as the final configuration. When the configuration is decided, ELNA compiler will map the final model to the accelerator at the correct mode, and generate the on-line control bitstreams that will be used by the ELNA scheduler to direct the accelerator to execute the network inference.” teaches selecting a configuration for the convolutional neural network (application Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model. As shown in Fig. 3, the data path of ELNA accelerator is organized into lanes. For example in Fig.3 (a), after the nX16-byte data arrives in the input registers, the MSB half and LSB half goes into the different lanes of PE. Depending on the precision mode decided by the control bits, the PEs will choose to separate the final result into two or activate the bridge logics to generate the final one result. In this way, the data path can offer either higher computation throughput or word-level precision as decided by the synthesizer and compiler. For example in Fig. 3(a), in 8-bit operating mode, each PE processes two continuous pixels in the input feature maps in the {x, y} and {x+1, y+1} positions of all channels and then respectively sum them up into two pixels in the output channel of next layer. In contrast, in 16-bit operating mode, each PE also receives 16-bit input but only generate one pixel point. The data mapping and data-level parallelization schemes also become slightly different, and the instructions fed into the scheduler change accordingly.” teaches that the precision of the neural network (machine learning model) determines the mode (portion) of the accelerator to use in executing (perform machine learning inference) the neural network)

and performing inference using the loaded machine learning model of the application using the portion of the accelerator on the attached accelerator. (Page 3, Section 3.1: “Afterwards, ELNA manager selects the CNN hyperparameters and the hardware operating mode, as the final configuration. When the configuration is decided, ELNA compiler will map the final model to the accelerator at the correct mode, and generate the on-line control bitstreams that will be used by the ELNA scheduler to direct the accelerator to execute the network inference.” teaches performing inference using the convolutional neural network (machine learning model); Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model.” teaches that inference is performed with a mode (portion) of the accelerator, depending on the precision of the neural network)

Ukidave and Wang are analogous art because they are directed to neural network accelerators. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang’s multi-mode neural network accelerator into Ukidave’s Predictive Scheduling for GPU based Cloud Servers with a motivation to “…support CNNs of various topologies in this data-driven architecture.” (Wang, Page 4, Section 3.3)

The combination of Ukidave and Wang does not appear to explicitly teach: 
and the application instance configuration specifying… and a processing speed to be used for hardware acceleration of machine learning model inference;
and the processing speed being one of a plurality of processing speed capabilities provided by the multi-tenant web services provider for hardware acceleration of machine learning model inference;
determining a portion of an accelerator’s compute capacity to provision to the application based at least in part on both the… processing speed indicated by the application instance configuration;
provisioning the application instance and the portion of the accelerator attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the accelerator is implemented using a physical accelerator in the second location, and wherein the physical accelerator is accessible to the physical compute instance over a network;

However, Wesolowski teaches: 
and the application instance configuration specifying… and a processing speed to be used for hardware acceleration of machine learning model inference; (Para [0026]: “In particular embodiments, the scheduler machine may distribute execution of a single machine learning model across multiple different computing machines, so that each computing machine trains a different portion (e.g., graph-segment) of the ML model and the different computing machines exchange processing data, as needed. In this case, the scheduler machine may monitor the performance of each computing machine, and if necessary, transfer execution of a portion of the machine learning model from one machine to a faster or slower machine, as necessary, to maintain optimal timing between the transferring of processing data between the machines…” and Para [0068]: “Some neural network model may require faster machines, or more memory, and each may generally require a different profile machine.” teaches that the processing speed requirement of the specific model (application instance) for machine learning execution (inference) determines what computing machine is used; Para [0005]: “In particular embodiments, a master machine learning (ML) control system/server (e.g., a scheduler machine or master ML control system, or first computing system) establishes access to different types of computing systems configured for different types of primary tasks. Such systems may include, for example, a GPU-based or CPU-based ML training system (e.g., a second computing system)…” teaches that the computing machines can be GPU based)

and the processing speed being one of a plurality of processing speed capabilities provided by the multi-tenant web services provider for hardware acceleration of machine learning model inference; (Para [0061]: “For example, ML Model 1 is submitted to training system 11, which is illustratively shown as having one GPU (e.g., daughter board). More specifically, training system 11 has one NVidia Corporation, Kepler K40 GPU-based board, which has a total of 2880 single instruction multiple data (SIMD) cores (e.g., processing cores). As model complexity increases, it may be necessary to add more GPU boards to a machine or to increase the number of machines in a training system… If higher power is needed, then, additional GPUs or more powerful GPUs may be used. For example, ML Model 4 is submitted to a training system 17 consisting of two computing systems 17A and 17B, each system having 8 NVidia Maxwell M40 GPUs, where each M40 GPU has a 3072 SIMD cores.” teaches that the accelerator service has servers that contain multiple GPUs with multiple processing speeds)

determining a portion of an accelerator’s compute capacity to provision to the application based at least in part on both the… processing speed indicated by the application instance configuration; (Para [0026]: “In particular embodiments, the scheduler machine may distribute execution of a single machine learning model across multiple different computing machines, so that each computing machine trains a different portion (e.g., graph-segment) of the ML model and the different computing machines exchange processing data, as needed. In this case, the scheduler machine may monitor the performance of each computing machine, and if necessary, transfer execution of a portion of the machine learning model from one machine to a faster or slower machine, as necessary, to maintain optimal timing between the transferring of processing data between the machines…” teaches determining and allocating the model to be executed on different computing machines based on processing speed requirements, therefore the processing speed requirements of application instance determines what computing machine is allocated; Para [0005]: “In particular embodiments, a master machine learning (ML) control system/server (e.g., a scheduler machine or master ML control system, or first computing system) establishes access to different types of computing systems configured for different types of primary tasks. Such systems may include, for example, a GPU-based or CPU-based ML training system (e.g., a second computing system)…” teaches that the computing machines can be GPU based)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wesolowski’s system for distributed training and prediction using elastic resources into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang with a motivation to “…provide for heterogeneous computing for training a machine learning model across different computing systems…” (Wesolowski, Para [0021])

The combination of Ukidave, Wang, and Wesolowski does not appear to explicitly teach:
provisioning the application instance and the portion of the accelerator attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the accelerator is implemented using a physical accelerator in the second location, and wherein the physical accelerator is accessible to the physical compute instance over a network;

However, Wilt teaches: 
provisioning the application instance and the portion of the accelerator attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the accelerator is implemented using a physical accelerator in the second location, and wherein the physical accelerator is accessible to the physical compute instance; (Fig. 14 (shown below) and Para [0052]: “The instance provisioning functionality 130 may provision a virtual compute instance 141B with an attached virtual GPU 151B based on the specified instance type "B" and the specified virtual GPU class "B". The provisioned virtual compute instance 141B may be implemented by the compute virtualization functionality 140 using suitable physical resources such as a physical compute instance 142B, and the provisioned virtual GPU 151B may be implemented by the GPU virtualization functionality 150 using suitable physical resources such as a physical GPU 152B.” teaches provisioning the application instance and virtual GPU (portion of the physical GPU (accelerator)) attached to the application instance, wherein the application instance is implemented using a physical compute instance and the virtual GPU (portion of the physical GPU (accelerator)) is implemented using a physical GPU (accelerator); Fig 14 and Para [0052]: “To implement the virtual compute instance 141B with the attached virtual GPU 151B, a physical compute instance 142B may communicate with a physical GPU 152B, e.g., over a network. The physical GPU 152B may be located in a different computing device than the physical compute instance 142B.” teaches that the physical compute instance is in a first instance location, the physical GPU (accelerator) is in a second instance location, and that the physical GPU (accelerator) communicates (is accessible) to the physical compute instance over a network)

    PNG
    media_image2.png
    1017
    932
    media_image2.png
    Greyscale

Ukidave, Wang, Wesolowski and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang and Wesolowski with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).

As per claim 6, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 5, 
Wang further teaches: 
wherein the machine learning model includes a description of a computation graph for inference and weights obtained from training. (Fig. 1 and Page 2, Section 2: “A typical CNN consists of multiple interconnected neural layers that process the 3D feature data shown in Fig.1 [4].” teaches that the convolutional neural network (machine learning model) includes a computation graph for inference and parameters (weights) obtained from training)

    PNG
    media_image3.png
    399
    796
    media_image3.png
    Greyscale

Ukidave, Wang, Wesolowksi, and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang’s multi-mode neural network accelerator into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wilt and Wesolowski with a motivation to “…support CNNs of various topologies in this data-driven architecture.” (Wang, Page 4, Section 3.3)

As per claim 7, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 5, 
Wang further teaches: 
prior to provisioning the portion of the accelerator, evaluating the machine learning model to determine the arithmetic precision of the machine learning model. (Fig. 3 and Page 4, Section 3.3: “As shown in Fig. 3, the data path of ELNA accelerator is organized into lanes. For example in Fig.3 (a), after the nX16-byte data arrives in the input registers, the MSB half and LSB half goes into the different lanes of PE. Depending on the precision mode decided by the control bits, the PEs will choose to separate the final result into two or activate the bridge logics to generate the final one result. In this way, the data path can offer either higher computation throughput or word-level precision as decided by the synthesizer and compiler.” teaches that the ELNA accelerator evaluates the convolutional neural network (machine learning model) to determine the precision mode (portion of the accelerator) needed to execute inference for the neural network)

Ukidave, Wang, Wesolowski, and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang’s multi-mode neural network accelerator into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wilt and Wesolowski with a motivation to “…support CNNs of various topologies in this data-driven architecture.” (Wang, Page 4, Section 3.3)


As per claim 9, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 5, 
Ukidave further teaches: 
in the application, aggregating calls to the portion of the accelerator and sending the aggregated calls as a batch. (Page 354: “Requests from various frontend user applications can be aggregated into backend threads, which can be handled as a single GPU context” teaches aggregating requests into backend threads (batches) and then sending these backend threads to a shared GPU context (portion of a GPU (accelerator)))

As per claim 10, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 5, 
Wilt further teaches: 
prior to attaching the accelerator, selecting the accelerator based on computational capability of the accelerator. (Para [0035]: “The virtual GPU may be selected from a set of virtual GPUs (e.g., belonging to virtual GPU classes) having different capabilities for graphics processing.” and Para [0039]: “In one embodiment, the virtual GPU classes may represent subdivisions of graphics processing capabilities of a physical GPU, such as a full GPU, a half GPU, a quarter GPU, and so on.” teaches selecting the GPU (accelerator) based on the GPU’s processing capability)

Ukidave, Wang, Wesolowski, and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as 

As per claim 11, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 5, 
Wilt further teaches: 
selecting an accelerator location for a physical accelerator or an application instance location based at least in part on one or more placement criteria, wherein the multi-tenant web services provider comprises a plurality of instance locations for physical compute instances and a plurality of accelerator locations for physical accelerators. (Para [0146]: “Based on one or more of the placement criteria 1425, a particular GPU location 1450A may be selected for a physical GPU 152A. Based on one or more of the placement criteria 1425, a particular instance location 1440A may be selected for a physical compute instance 141B.” teaches selecting a GPU (accelerator) location for a physical GPU (accelerator) and selecting an application instance location based on placement criteria; Para [0142]: “FIG. 14 illustrates an example system environment for placement optimization for virtualized graphics processing, including multiple instance locations and multiple GPU locations in a provider network, according to one embodiment. The provider network 100 may include a plurality of instance locations 1440A-1440N for a plurality of physical compute instances 142A-142N. The instance locations 1440A-1440N may represent a plurality of racks, a plurality of data centers, and/or a plurality of geographical regions.” teaches that the provider network comprises a plurality of instance locations for physical compute instances; Para [0143]: “The provider network 100 may also include a plurality of GPU locations 1450A-1450N for a plurality of physical GPUs 152A-152N (e.g., for graphics servers that include and provide access to the physical GPUs). The GPU locations 1450A-1450N may represent a plurality of racks, a plurality of data centers, and/or a plurality of geographical regions.” teaches that the provider network comprises a plurality of GPU (accelerator) locations for physical GPUs (accelerators); Fig 1 and Para [0041]: “The provider network 100 may implement or provide a multi-tenant environment such that multiple clients (e.g., using client devices 180A-180N) may access or use a particular resource in a substantially simultaneous manner.” teaches that the provider network is a multi-tenant web services provider)

Ukidave, Wang, Wesolowski and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang and Wesolowski with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).

As per claim 12, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 11, 
Wilt further teaches: 
wherein the one or more placement criteria comprise improvement of one or more metrics. (Para [0147]: “The one or more placement criteria 1425 may include or be associated with optimization (e.g., improvement) of metrics for performance ( e.g., to maximize performance), resource usage ( e.g., to minimize resource usage), cost ( e.g., to minimize cost or fit resource costs within a client-specified budget), energy usage (e.g., to minimize energy usage or prioritize "green" energy), network locality (e.g., to minimize networking proximity between two or more resources), and/or any other suitable metrics. Performance metrics and cost metrics used as placement criteria may often be associated with the use of the physical GPU by the physical compute instance.” teaches that the placement criteria comprises improvement of one or more metrics)

Ukidave, Wang, Wesolowksi, and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).

As per claim 13, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 11, 
Wilt further teaches: 
wherein the one or more placement criteria are based at least in part on a performance metric associated with use of the physical accelerator by the physical compute instance. (Para [0147]: “The one or more placement criteria 1425 may include or be associated with optimization (e.g., improvement) of metrics for performance ( e.g., to maximize performance), resource usage ( e.g., to minimize resource usage), cost ( e.g., to minimize cost or fit resource costs within a client-specified budget), energy usage (e.g., to minimize energy usage or prioritize "green" energy), network locality (e.g., to minimize networking proximity between two or more resources), and/or any other suitable metrics. Performance metrics and cost metrics used as placement criteria may often be associated with the use of the physical GPU by the physical compute instance.” teaches that the placement criteria is based on performance metrics associated with the use of the physical GPU (accelerator) by the physical compute instance)

Ukidave, Wang, Wesolowski, and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang and Wesolowski with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).

As per claim 14, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 11, 
Wilt further teaches: 
wherein the one or more placement criteria are based at least in part on an energy metric associated with use of the physical accelerator by the physical compute instance. (Para [0147]: “The one or more placement criteria 1425 may include or be associated with optimization (e.g., improvement) of metrics for performance ( e.g., to maximize performance), resource usage ( e.g., to minimize resource usage), cost ( e.g., to minimize cost or fit resource costs within a client-specified budget), energy usage (e.g., to minimize energy usage or prioritize "green" energy), network locality (e.g., to minimize networking proximity between two or more resources), and/or any other suitable metrics. Performance metrics and cost metrics used as placement criteria may often be associated with the use of the physical GPU by the physical compute instance.” teaches that the placement criteria is based on an energy usage metric associated with the use of the physical GPU (accelerator) by the physical compute instance)

Ukidave, Wang, Wesolowski, and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang and Wesolowski with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).

As per claim 15, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 11, 
Wilt further teaches: 
wherein the accelerator location or the application instance location is selected based at least in part on network locality. (Para [0149]: “Placement optimization for network locality may attempt to group multiple resources (e.g., one or more physical compute instances and one or more physical GPUs) based (at least in part) on proximity within a network. Network locality may refer to one or more locations, connections, associations, or zones in a network to which a resource belongs.” and “Instance locations and/or GPU locations may be selected based (at least in part) on network locality.” teaches that the GPU (accelerator) location or the application instance location is selected based on network locality) 

Ukidave, Wang, Wesolowski, and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang and Wesolowski with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).


As per claim 16, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 11, 
Wilt further teaches: 
wherein the accelerator location is selected based at least in part on network latency between the physical accelerator and a client device. (Para [0147]: “As another example, a GPU location 1450A in a data center nearest the client device 180A may be selected to minimize latency between the physical GPU and the client device, where the proximity of the data center to the client device is measured based on anticipated or historical latency and/or on geographical proximity.” teaches that the GPU (accelerator) location is selected based on network latency between the physical GPU (accelerator) and a client device)


Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).

As per claim 17, Ukidave teaches: A system, comprising: storage to store an application, the application including a machine learning model; and (Page 356: “Mystic is implemented as a 3-stage control layer, hosted in the head node of a GPU cloud server or cluster. The framework is capable of predicting the interference between an incoming application and currently running applications on the server. Mystic guides the scheduler to optimize coexecution of applications on a GPU using the predicted interference.” teaches a computer based implementation; Page 358: “We select 55 distinct workloads… In addition, we leverage tuned CUDA libraries such as cuDNN (deep learning libraries)” teaches that an application can be a machine learning model from the cuDNN library)

receive, in a multi-tenant web services provider, an application instance configuration, an application of the application instance to utilize a portion of an attached accelerator’s compute capacity during execution of a machine learning model (Page 356: “Instead, Mystic initiates two short profiling runs for each incoming application to obtain metrics for two randomly selected CoIs (out of 6 identified CoIs). The profiler run needs be long enough to profile each distinct kernel in the application at least once… The short-profiles (∼5 seconds) for incoming applications are collected and stored in the Profile Information Table (PIT) in form of sparse rows, as metrics for only 2 random CoIs out of 6 are captured. The PIT is indexed by the application process ID (pid).” and Page 357: “The CF predictor takes the PIT and TRM as inputs. When a new application A0 is enqueued for execution on the system, the predictor identifies A0’s profile information by searching the PIT using the process-id (pid) of the application. The PIT returns a sparse vector v with the metrics obtained from the short profiles collected in Stage-1” teaches receiving Application A0’s profile information (application instance configuration) and A0 (application instance)); Page 354: “We present Mystic, a framework enabling interference-aware scheduling for GPU workloads. Our work targets servers and cloud schedulers by utilizing machine learning algorithms. Mystic utilizes the concurrency features of modern GPUs exposed by programming frameworks such as CUDA 7.0.” teaches that Mystic is a multi-tenant web service provider because it targets servers and cloud schedulers; Page 354 – 355: “Requests from various frontend user applications can be aggregated into backend threads, which can be handled as a single GPU context (see Figure 2b). GPU components of all frontend applications co-executing on the GPU are assigned to separate backend threads. The backend threads map to the same device on a per-GPU context basis. This design enables GPU operations from different applications to be executed concurrently, which enables a single GPU to be shared in both space and time [13, 32].” teaches that an application is executed on a shared GPU context (portion of a GPU (accelerator)); Page 358: “We select 55 distinct workloads… In addition, we leverage tuned CUDA libraries such as cuDNN (deep learning libraries)” teaches that an application can be a machine learning model from the cuDNN library)

    PNG
    media_image1.png
    315
    605
    media_image1.png
    Greyscale


load the machine learning model onto the portion of the accelerator; (Page 357: “When a new application A0 is enqueued for execution on the system, the predictor identifies A0’s profile information by searching the PIT using the process-id (pid) of the application.” teaches enqueuing (loading) Application A0; Page 354 – 355: “Requests from various frontend user applications can be aggregated into backend threads, which can be handled as a single GPU context (see Figure 2b). GPU components of all frontend applications co-executing on the GPU are assigned to separate backend threads. The backend threads map to the same device on a per-GPU context basis. This design enables GPU operations from different applications to be executed concurrently, which enables a single GPU to be shared in both space and time [13, 32].” teaches that an application is executed on a shared GPU context (portion of a GPU (accelerator)); Page 358: “We select 55 distinct workloads… In addition, we leverage tuned CUDA libraries such as cuDNN (deep learning libraries)” teaches that an application can be a machine learning model from the cuDNN library)
 
Ukidave does not appear to explicitly teach: 

and the application instance configuration indicating both an arithmetic precision… to be used in determining the portion of the accelerator to provision for hardware acceleration of machine learning model inference, 
and the application instance configuration indicating… and a processing speed to be used in determining the portion of the accelerator to provision for hardware acceleration of machine learning model inference,
the arithmetic precision being one of a plurality of arithmetic precision capabilities that the elastic inference service is configured to provide for hardware acceleration of machine learning model inference,
and the processing speed being one of a plurality of processing speed capabilities that the elastic inference service is configured to provide for hardware acceleration of machine learning model inference;
determine a portion of an accelerator’s compute capacity to provision to the application based at least in part on the arithmetic precision… indicated by the application instance configuration;
determine a portion of an accelerator’s compute capacity to provision to the application based at least in part on the… processing speed indicated by the application instance configuration;
provision the application instance and the portion of the accelerator attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the accelerator is implemented using a physical accelerator in the second location, and wherein the physical accelerator is accessible to the physical compute instance;


However, Wang teaches: 
and the application instance configuration indicating both an arithmetic precision… to be used in determining the portion of the accelerator to provision for hardware acceleration of machine learning model inference, (Page 3, Section 3.1: “Afterwards, ELNA manager selects the CNN hyperparameters and the hardware operating mode, as the final configuration. When the configuration is decided, ELNA compiler will map the final model to the accelerator at the correct mode, and generate the on-line control bitstreams that will be used by the ELNA scheduler to direct the accelerator to execute the network inference.” teaches selecting a configuration for the convolutional neural network (application instance); Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model. As shown in Fig. 3, the data path of ELNA accelerator is organized into lanes. For example in Fig.3 (a), after the nX16-byte data arrives in the input registers, the MSB half and LSB half goes into the different lanes of PE. Depending on the precision mode decided by the control bits, the PEs will choose to separate the final result into two or activate the bridge logics to generate the final one result. In this way, the data path can offer either higher computation throughput or word-level precision as decided by the synthesizer and compiler. For example in Fig. 3(a), in 8-bit operating mode, each PE processes two continuous pixels in the input feature maps in the {x, y} and {x+1, y+1} positions of all channels and then respectively sum them up into two pixels in the output channel of next layer. In contrast, in 16-bit operating mode, each PE also receives 16-bit input but only generate one pixel point. The data mapping and data-level parallelization schemes also become slightly different, and the instructions fed into the scheduler change accordingly.” teaches that the precision of the neural network (machine learning model) determines the mode (portion) of the accelerator to use in executing the neural network)

the arithmetic precision being one of a plurality of arithmetic precision capabilities that the elastic inference service is configured to provide for hardware acceleration of machine learning model inference, (Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model.” teaches that both 16 bit precision and 8 bit precision is supported)

determine a portion of an accelerator’s compute capacity to provision to the application based at least in part on the arithmetic precision… indicated by the application instance configuration; (Page 3, Section 3.1: “Afterwards, ELNA manager selects the CNN hyperparameters and the hardware operating mode, as the final configuration. When the configuration is decided, ELNA compiler will map the final model to the accelerator at the correct mode, and generate the on-line control bitstreams that will be used by the ELNA scheduler to direct the accelerator to execute the network inference.” teaches selecting a configuration for the convolutional neural network (application instance); Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model. As shown in Fig. 3, the data path of ELNA accelerator is organized into lanes. For example in Fig.3 (a), after the nX16-byte data arrives in the input registers, the MSB half and LSB half goes into the different lanes of PE. Depending on the precision mode decided by the control bits, the PEs will choose to separate the final result into two or activate the bridge logics to generate the final one result. In this way, the data path can offer either higher computation throughput or word-level precision as decided by the synthesizer and compiler. For example in Fig. 3(a), in 8-bit operating mode, each PE processes two continuous pixels in the input feature maps in the {x, y} and {x+1, y+1} positions of all channels and then respectively sum them up into two pixels in the output channel of next layer. In contrast, in 16-bit operating mode, each PE also receives 16-bit input but only generate one pixel point. The data mapping and data-level parallelization schemes also become slightly different, and the instructions fed into the scheduler change accordingly.” teaches that the precision of the neural network (machine learning model) determines the mode (portion) of the accelerator to use in executing (perform machine learning inference) the neural network)

and perform inference using the loaded machine learning model of the application using the portion of the accelerator on the attached accelerator. (Page 3, Section 3.1: “Afterwards, ELNA manager selects the CNN hyperparameters and the hardware operating mode, as the final configuration. When the configuration is decided, ELNA compiler will map the final model to the accelerator at the correct mode, and generate the on-line control bitstreams that will be used by the ELNA scheduler to direct the accelerator to execute the network inference.” teaches performing inference using the convolutional neural network (machine learning model); Page 4, Section 3.3: “Second, the accelerator can operate in different precision/throughput modes. For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE to suit the precision-mode of the reshaped CNN model.” teaches that inference is performed with a mode (portion) of the accelerator, depending on the precision of the neural network)

Ukidave and Wang are analogous art because they are directed to neural network accelerators. 


The combination of Ukidave and Wang does not appear to explicitly teach: 
and one or more electronic devices to implement an elastic inference service, the elastic inference service including an application instance and an accelerator appliance, the elastic inference service to:
and the application instance configuration indicating… and a processing speed to be used in determining the portion of the accelerator to provision for hardware acceleration of machine learning model inference,
and the processing speed being one of a plurality of processing speed capabilities that the elastic inference service is configured to provide for hardware acceleration of machine learning model inference;
determine a portion of an accelerator’s compute capacity to provision to the application based at least in part on the… processing speed indicated by the application instance configuration;
provision the application instance and the portion of the accelerator attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the accelerator is implemented using a physical accelerator in the second location, and wherein the physical accelerator is accessible to the physical compute instance;

However, Wesolowski teaches: 
Para [0026]: “In particular embodiments, the scheduler machine may distribute execution of a single machine learning model across multiple different computing machines, so that each computing machine trains a different portion (e.g., graph-segment) of the ML model and the different computing machines exchange processing data, as needed. In this case, the scheduler machine may monitor the performance of each computing machine, and if necessary, transfer execution of a portion of the machine learning model from one machine to a faster or slower machine, as necessary, to maintain optimal timing between the transferring of processing data between the machines…” and Para [0068]: “Some neural network model may require faster machines, or more memory, and each may generally require a different profile machine.” teaches that the processing speed requirement of the specific model (application instance) for machine learning execution (inference) determines what computing machine is used; Para [0005]: “In particular embodiments, a master machine learning (ML) control system/server (e.g., a scheduler machine or master ML control system, or first computing system) establishes access to different types of computing systems configured for different types of primary tasks. Such systems may include, for example, a GPU-based or CPU-based ML training system (e.g., a second computing system)…” teaches that the computing machines can be GPU based)

and the processing speed being one of a plurality of processing speed capabilities that the elastic inference service is configured to provide for hardware acceleration of machine learning model inference; (Para [0061]: “For example, ML Model 1 is submitted to training system 11, which is illustratively shown as having one GPU (e.g., daughter board). More specifically, training system 11 has one NVidia Corporation, Kepler K40 GPU-based board, which has a total of 2880 single instruction multiple data (SIMD) cores (e.g., processing cores). As model complexity increases, it may be necessary to add more GPU boards to a machine or to increase the number of machines in a training system… If higher power is needed, then, additional GPUs or more powerful GPUs may be used. For example, ML Model 4 is submitted to a training system 17 consisting of two computing systems 17A and 17B, each system having 8 NVidia Maxwell M40 GPUs, where each M40 GPU has a 3072 SIMD cores.” teaches that the accelerator service has servers that contain multiple GPUs with multiple processing speeds)

determine a portion of an accelerator’s compute capacity to provision to the application based at least in part on the… processing speed indicated by the application instance configuration; (Para [0026]: “In particular embodiments, the scheduler machine may distribute execution of a single machine learning model across multiple different computing machines, so that each computing machine trains a different portion (e.g., graph-segment) of the ML model and the different computing machines exchange processing data, as needed. In this case, the scheduler machine may monitor the performance of each computing machine, and if necessary, transfer execution of a portion of the machine learning model from one machine to a faster or slower machine, as necessary, to maintain optimal timing between the transferring of processing data between the machines…” teaches determining and allocating the model to be executed on different computing machines based on processing speed requirements, therefore the processing speed requirements of application instance determines what computing machine is allocated; Para [0005]: “In particular embodiments, a master machine learning (ML) control system/server (e.g., a scheduler machine or master ML control system, or first computing system) establishes access to different types of computing systems configured for different types of primary tasks. Such systems may include, for example, a GPU-based or CPU-based ML training system (e.g., a second computing system)…” teaches that the computing machines can be GPU based)
Ukidave, Wang, and Wesolowski are analogous art because they are directed to machine learning accelerators. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wesolowski’s system for distributed training and prediction using elastic resources into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang with a motivation to “…provide for heterogeneous computing for training a machine learning model across different computing systems…” (Wesolowski, Para [0021])

The combination of Ukidave, Wang, and Wesolowski does not appear to explicitly teach:
and one or more electronic devices to implement an elastic inference service, the elastic inference service including an application instance and an accelerator appliance, the elastic inference service to:
provision the application instance and the portion of the accelerator attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the accelerator is implemented using a physical accelerator in the second location, and wherein the physical accelerator is accessible to the physical compute instance;

However, Wilt teaches: 
and one or more electronic devices to implement an elastic inference service, the elastic inference service including an application instance and an accelerator appliance, the elastic inference service to: (Fig. 1 teaches an elastic graphics service implemented by multiple devices where the elastic graphics service includes an application instance and a GPU (accelerator))

provision the application instance and the portion of the accelerator attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the accelerator is implemented using a physical accelerator in the second location, and wherein the physical accelerator is accessible to the physical compute instance; (Fig. 14 (shown below) and Para [0052]: “The instance provisioning functionality 130 may provision a virtual compute instance 141B with an attached virtual GPU 151B based on the specified instance type "B" and the specified virtual GPU class "B". The provisioned virtual compute instance 141B may be implemented by the compute virtualization functionality 140 using suitable physical resources such as a physical compute instance 142B, and the provisioned virtual GPU 151B may be implemented by the GPU virtualization functionality 150 using suitable physical resources such as a physical GPU 152B.” teaches provisioning the application instance and virtual GPU (portion of the physical GPU (accelerator)) attached to the application instance, wherein the application instance is implemented using a physical compute instance and the virtual GPU (portion of the physical GPU (accelerator)) is implemented using a physical GPU (accelerator); Fig 14 and Para [0052]: “To implement the virtual compute instance 141B with the attached virtual GPU 151B, a physical compute instance 142B may communicate with a physical GPU 152B, e.g., over a network. The physical GPU 152B may be located in a different computing device than the physical compute instance 142B.” teaches that the physical compute instance is in a first instance location, the physical GPU (accelerator) is in a second instance location, and that the physical GPU (accelerator) communicates (is accessible) to the physical compute instance over a network)

    PNG
    media_image2.png
    1017
    932
    media_image2.png
    Greyscale

Ukidave, Wang, and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).

As per claim 19, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The system of claim 17, 
Wilt further teaches: 
wherein the elastic inference service is to select an accelerator location for a physical accelerator or an application instance location based at least in part on one or more placement criteria, wherein the multi-tenant web services provider comprises a plurality of instance locations for physical compute instances and a plurality of accelerator locations for physical accelerators. (Para [0146]: “Based on one or more of the placement criteria 1425, a particular GPU location 1450A may be selected for a physical GPU 152A. Based on one or more of the placement criteria 1425, a particular instance location 1440A may be selected for a physical compute instance 141B.” teaches selecting a GPU (accelerator) location for a physical GPU (accelerator) and selecting an application instance location based on placement criteria; Para [0142]: “FIG. 14 illustrates an example system environment for placement optimization for virtualized graphics processing, including multiple instance locations and multiple GPU locations in a provider network, according to one embodiment. The provider network 100 may include a plurality of instance locations 1440A-1440N for a plurality of physical compute instances 142A-142N. The instance locations 1440A-1440N may represent a plurality of racks, a plurality of data centers, and/or a plurality of geographical regions.” teaches that the provider network comprises a plurality of instance locations for physical compute instances; Para [0143]: “The provider network 100 may also include a plurality of GPU locations 1450A-1450N for a plurality of physical GPUs 152A-152N (e.g., for graphics servers that include and provide access to the physical GPUs). The GPU locations 1450A-1450N may represent a plurality of racks, a plurality of data centers, and/or a plurality of geographical regions.” teaches that the provider network comprises a plurality of GPU (accelerator) locations for physical GPUs (accelerators); Fig 1 and Para [0041]: “The provider network 100 may implement or provide a multi-tenant environment such that multiple clients (e.g., using client devices 180A-180N) may access or use a particular resource in a substantially simultaneous manner.” teaches that the provider network is a multi-tenant web services provider)

Ukidave, Wang, and Wilt are analogous art because they are directed to accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).

As per claim 20, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The system of claim 17, 
Wilt further teaches: 
wherein the elastic inference service is to the accelerator location or the application instance location is selected based at least in part on network locality. (Para [0149]: “Placement optimization for network locality may attempt to group multiple resources (e.g., one or more physical compute instances and one or more physical GPUs) based (at least in part) on proximity within a network. Network locality may refer to one or more locations, connections, associations, or zones in a network to which a resource belongs.” and “Instance locations and/or GPU locations may be selected based (at least in part) on network locality.” teaches that the GPU (accelerator) location or the application instance location is selected based on network locality) 


Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).


Claims 3, 8, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Ukidave in view of Wang, Wesolowski, and Wilt as shown above, further in view of Chen et al. (“TVM: End-to-End Optimization Stack for Deep Learning”, hereinafter “Chen”).

As per claim 3, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 1, 
The combination of Ukidave, Wang, Wesolowski, and Wilt does not appear to explicitly teach: 
profiling the machine learning model by converting to an intermediate representation having GPU -independent optimization and converting from the intermediate representation to machine code with GPU -dependent optimizations.
However, Chen teaches: 
profiling the machine learning model by converting to an intermediate representation having GPU -independent optimization and converting from the intermediate representation to machine code with GPU -dependent optimizations. (Fig. 2 and Page 2: “We present TVM (shown in Figure 2), an end-to-end optimizing compiler stack to lower and fine-tune deep learning workloads to diverse hardware back-ends. TVM is designed to separate the algorithm description, schedule, and hardware interface.” and Page 3: “By combining these optimization layers, TVM can take model descriptions from most deep learning frameworks, perform joint high level and low-level optimizations, and generate hardware specific optimized code for back-ends such as the Raspberry Pi, GPUs, and FPGA-based specialized accelerators” teaches that TVM can convert the model into a high-level data flow (intermediate representation) for optimizations that are not specific to a GPU (GPU-independent optimization) and then generate hardware specific optimized code for a GPU; Page 2: “High-level dataflow rewriting: Different hardware devices may have vastly different memory hierarchies, so enabling strategies to fuse operators and optimize data layouts are crucial for optimizing memory access.” teaches that the high-level data flow (intermediate representation) has hardware-independent optimization.
    PNG
    media_image4.png
    611
    1320
    media_image4.png
    Greyscale


Ukidave, Wang, Wesolowski, Wilt, and Chen are analogous art because they are directed to optimizing accelerators using GPUs. 


As per claim 8, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The method of claim 5, 
The combination of Ukidave, Wang, Wesolowski, and Wilt does not appear to explicitly teach: 
profiling the machine learning model by converting to an intermediate representation having accelerator -independent optimization and converting from the intermediate representation to machine code with accelerator -dependent optimizations.
However, Chen teaches: 
profiling the machine learning model by converting to an intermediate representation having accelerator -independent optimization and converting from the intermediate representation to machine code with accelerator -dependent optimizations. (Fig. 2 and Page 2: “We present TVM (shown in Figure 2), an end-to-end optimizing compiler stack to lower and fine-tune deep learning workloads to diverse hardware back-ends. TVM is designed to separate the algorithm description, schedule, and hardware interface.” and Page 3: “By combining these optimization layers, TVM can take model descriptions from most deep learning frameworks, perform joint high level and low-level optimizations, and generate hardware specific optimized code for back-ends such as the Raspberry Pi, GPUs, and FPGA-based specialized accelerators” teaches that TVM can convert the model into a high-level data flow (intermediate representation) for optimizations that are not specific to an accelerator (accelerator-independent optimization) and then generate hardware specific optimized code for an accelerator; Page 2: “High-level dataflow rewriting: Different hardware devices may have vastly different memory hierarchies, so enabling strategies to fuse operators and optimize data layouts are crucial for optimizing memory access.” teaches that the high-level data flow (intermediate representation) has hardware-independent optimization.
Ukidave, Wang, Wesolowski, Wilt, and Chen are analogous art because they are directed to optimizing accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Chen’s TVM: End-to-End optimization stack into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang, Wesolowski, and Wilt with a motivation to “…to easily deploy deep learning workloads to all kinds of hardware targets, including embedded devices, GPUs, FPGAs, and ASICs…” (Chen, Page 1).

As per claim 18, the combination of Ukidave, Wang, Wesolowski, and Wilt as shown above teaches The system of claim 17, 
The combination of Ukidave, Wang, Wesolowski, and Wilt does not appear to explicitly teach: 
wherein the elastic inference service is to profile the machine learning model by converting to an intermediate representation having accelerator-independent optimization and converting from the intermediate representation to machine code with accelerator -dependent optimizations.
However, Chen teaches: 
wherein the elastic inference service is to profile the machine learning model by converting to an intermediate representation having accelerator-independent optimization and converting from the intermediate representation to machine code with accelerator -dependent optimizations. (Fig. 2 and Page 2: “We present TVM (shown in Figure 2), an end-to-end optimizing compiler stack to lower and fine-tune deep learning workloads to diverse hardware back-ends. TVM is designed to separate the algorithm description, schedule, and hardware interface.” and Page 3: “By combining these optimization layers, TVM can take model descriptions from most deep learning frameworks, perform joint high level and low-level optimizations, and generate hardware specific optimized code for back-ends such as the Raspberry Pi, GPUs, and FPGA-based specialized accelerators” teaches that TVM can convert the model into a high-level data flow (intermediate representation) for optimizations that are not specific to an accelerator (accelerator-independent optimization) and then generate hardware specific optimized code for an accelerator; Page 2: “High-level dataflow rewriting: Different hardware devices may have vastly different memory hierarchies, so enabling strategies to fuse operators and optimize data layouts are crucial for optimizing memory access.” teaches that the high-level data flow (intermediate representation) has hardware-independent optimization.
Ukidave, Wang, Wesolowski, Wilt, and Chen are analogous art because they are directed to optimizing accelerators using GPUs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Chen’s TVM: End-to-End optimization stack into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang, Wesolowski, and Wilt with a motivation to “…to easily deploy deep learning workloads to all kinds of hardware targets, including embedded devices, GPUs, FPGAs, and ASICs…” (Chen, Page 1).

Response to Arguments
Regarding Claim Objections: 
Applicant’s argument: 
“Claim 20 is objected to because it contains grammatical issues. Applicant has amended claim 20 fixing these issues and request the withdrawal of the objection to this claim.”
Response: 


Regarding Double Patenting: 
Applicant’s argument: 
“In light of the amendments to the claims herein, there is no prima facie case of double patenting in the Office Action. Removal of the double patenting rejections is respectfully requested.”

Response: 
Applicant’s arguments have been fully considered and are persuasive. The double patenting rejection of claims 1, 5, 6, and 17 made in the previous office action has been withdrawn.

Regarding 35 U.S.C. 112: 
Applicant’s argument: 
“The Office Action notes that there is insufficient antecedent basis for “the second location.” This is corrected by the respective amendments to claims 1 and 5 herein.”
“The Office Action states that there is insufficient antecedent basis for “the portion of the accelerator.” This is corrected by the amendment to claim 2 herein.”
“The Office Action alleges that it is unclear what is converted to an intermediate representation. The respective amendments to claims 3, 8, and 18 herein resolves any potentially ambiguity in this regard.”

Response: 


Regarding 35 U.S.C. 103: 
Applicant’s argument: 
“determining a portion of a GPU’s compute capacity to provision to the application based at least in part on both the arithmetic precision and the processing speed indicated by the application instance configuration;”
“At least the above-bolded feature of claim 1 is not taught or suggested by Ukidave, Wang, and Wilt, individually or in combination.”

Response: 
Examiner respectfully disagrees. Although this limitation does not appear to be taught by the combination of Ukidave, Wang, and Wilt, this limitation is taught by the combination of Wang and Wesolowski. Please see pages 8-9 and 11-12 for a detailed analysis of this limitation. 

Regarding 35 U.S.C. 103: 
Applicant’s argument:
“All told, there is not sufficient evidence of record that one skilled in the art would have understood a combination of Ukidave, Wang, and Wilt to provide all that is claimed in claim 1. Based on the evidence of record, the only was one skilled in the art could arrive at the invention of claim 1 is to be informed by the Applicant’s own disclosure which, of course, is impermissible hindsight.”

Response: 
In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang’s multi-mode neural network accelerator into Ukidave’s Predictive Scheduling for GPU based Cloud Servers with a motivation to “…support CNNs of various topologies in this data-driven architecture (Wang, Page 4, Section 3.3) as Ukidave’s cloud based scheduling cannot accommodate neural networks having different precisions. 
Given the combination of Ukidave and Wang, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wesolowski’s system for distributed training and prediction using elastic resources into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang with a motivation to “…provide for heterogeneous computing for training a machine learning model across different computing systems…” (Wesolowski, Para [0021])
Given the combination of Ukidave, Wang, and Wesolowski, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wilt’s placement optimization for virtualized graphics processing into Ukidave’s Predictive Scheduling for GPU based Cloud Servers as modified by Wang and Wesolowski with a motivation to “allow a single physical computing device to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing device” (Wilt, Para [0001]).


Conclusion
The prior art made of record and not relied upon is considered pertinent to the applicant’s disclosure: 
Feng et al. (US 20170220949 A1) teaches a cloud based deep learning accelerator. 
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHOUN ABRAHAM whose telephone number is (571)272-8144.  The examiner can normally be reached on Mon - Fri 08:00-16:30.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/S.J.A./Examiner, Art Unit 2125                                                                                                                                                                                                        
/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125