DETAILED ACTION
This action is in response to examiner interview on 3/31/2022.  The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
This application is effectively filed 6/27/2018. The assignee of record is Amazon Technologies, Inc. The listed inventor(s) is/are: SENGUPTA, Sudipta; PERUMALLA, Poorna Chand Srinivas; DIVAKARUNI, Dominic Rajeev; BSHARA, Nafea; DIRAC, Leo Parker; SAHA, Bratin; WOOD, Matthew James; OLGIATI, Andrea; SIVASUBRAMANIAN, Swaminathan.
Information Disclosure Statement
The information disclosure statement(s) (IDS) submitted on 1/3/2020, 4/14/2020 (2), 6/21/2021, & 10/14/2021 is/are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the IDS(s) is/are being considered by the examiner.
REASONS FOR ALLOWANCE
Claims 1-2, 4-6, 8-16, 18-20 are allowed. The closest found prior art fails to teach singly or in combination the claimed invention. The closest found prior art is listed below:
a. Ukidave et al. (“Mystic: Predictive Scheduling for GPU Based Cloud Servers using Machine Learning”, this reference was provided by applicant on record on 6/21/2021; hereinafter Uki)
Uki teaches a computer-implemented method, comprising: receiving, in a multi-tenant web services provider (Uki Page 354 – 355: “Requests from various frontend user applications can be aggregated into backend threads, which can be handled as a single GPU context (see Figure 2b). GPU components of all frontend applications co-executing on the GPU are assigned to separate backend threads. The backend threads map to the same device on a per-GPU context basis. This design enables GPU operations from different applications to be executed concurrently, which enables a single GPU to be shared in both space and time [13, 32]), an application instance configuration, an application of the application instance to utilize a plurality of portions (Uki Page 358: “We select 55 distinct workloads… In addition, we leverage tuned CUDA libraries such as cuDNN (deep learning libraries)) of at least one attached graphics processing unit (GPU) during execution of a machine learning model (Uki Page 356: “Instead, Mystic initiates two short profiling runs for each incoming application to obtain metrics for two randomly selected CoIs (out of 6 identified CoIs). The profiler run needs be long enough to profile each distinct kernel in the application at least once… The short-profiles (∼5 seconds) for incoming applications are collected and stored in the Profile Information Table (PIT) in form of sparse rows, as metrics for only 2 random CoIs out of 6 are captured. The PIT is indexed by the application process ID (pid).” and Page 357: “The CF predictor takes the PIT and TRM as inputs. When a new application A0 is enqueued for execution on the system, the predictor identifies A0’s profile information by searching the PIT using the process-id (pid) of the application. The PIT returns a sparse vector v with the metrics obtained from the short profiles collected in Stage-1.

    PNG
    media_image1.png
    315
    605
    media_image1.png
    Greyscale
); 
loading the machine learning model onto the portions of the at least one GPU (Uki Page 357: “When a new application A0 is enqueued for execution on the system, the predictor identifies A0’s profile information by searching the PIT using the process-id (pid) of the application.”
enqueuing (loading) Application A0 is taught in Page 354 – 355: “Requests from various frontend user applications can be aggregated into backend threads, which can be handled as a single GPU context (see Figure 2b). GPU components of all frontend applications co-executing on the GPU are assigned to separate backend threads. The backend threads map to the same device on a per-GPU context basis. This design enables GPU operations from different applications to be executed concurrently, which enables a single GPU to be shared in both space and time [13, 32].); 

b. Wilt et al. (US 20170132746 A1, published 5/11/2017; hereinafter Wil).
Wan teaches provisioning the application instance and the portions of the at least one GPU attached to the application instance (Wil Fig. 14 (shown below) and Para [0052]: “The instance provisioning functionality 130 may provision a virtual compute instance 141B with an attached virtual GPU 151B based on the specified instance type "B" and the specified virtual GPU class "B". The provisioned virtual compute instance 141B may be implemented by the compute virtualization functionality 140 using suitable physical resources such as a physical compute instance 142B, and the provisioned virtual GPU 151B may be implemented by the GPU virtualization functionality 150 using suitable physical resources such as a physical GPU 152B

 
    PNG
    media_image2.png
    1017
    932
    media_image2.png
    Greyscale
).
receiving scoring data in the application (Wil ¶ 0178 [0178] In one embodiment, migration of resources such as virtual compute instances and/or virtual GPUs may be performed based on placement scoring….The placement score may reflect a score on how close the current placement is with respect to the more optimal scenario (e.g., same network router). The score may be a composite of multiple different placement criteria, considering the impact on the resource, resource host, and/or distributed system as a whole.); and 
utilizing each of the portions of the attached at least one GPU to perform inference on the scoring data in parallel (Wil ¶ 0179-0180 For example, placement score(s) may be generated for the placement of the resource at possible destination resource host(s). In at least some embodiments, a subset of available resource hosts may have scores generated as a possible placement, while in other embodiments all available resource hosts may be considered by generating a placement scores. [0180] A difference between the placement score of the current placement of the resource and the scores of the possible placements may be determined and compared to an optimization threshold. For example, the difference may be a value which is compared to a threshold value (is difference >0.3).) and only using one response from the portions of the GPU (Wil ¶ 0181 a priority for performing the migration of the resource to the destination resource host may be assigned).

c. Wang et al. (“An Elastic CNN Inference Accelerator with Adaptive Trade-off between QoS and QoR”, this reference was provided by applicant on record on 6/21/2021, hereinafter Wan)
Wan teaches wherein the machine learning model includes a description of a computation graph for inference and weights obtained from training (Wan (Fig. 1 and Page 2, Section 2: “A typical CNN consists of multiple interconnected neural layers that process the 3D feature data shown in Fig.1 [4].

    PNG
    media_image3.png
    399
    796
    media_image3.png
    Greyscale


d. US 20190086988 A1, METHODS AND SYSTEMS FOR MANAGING MACHINE LEARNING INVOLVING MOBILE DEVICES

EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below.  Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312.  To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee. Authorization for examiner’s amendment was given in an interview with *** on ***.
Amendments to the Claims:
This listing of claims will replace all prior versions and listing of the claims in the application.
Listing of Claims:
CLAIMS
1. 	(Currently Amended) A computer-implemented method, comprising:
receiving, in a multi-tenant web services provider, an application instance configuration, wherein an application of the application instance is to utilize 
provisioning the application instance and the portions of the at least one GPU attached to the application instance; 
loading the machine learning model onto the portions of the at least one GPU; 
receiving scoring data in the application; [[and]]
utilizing each of the portions of the attached at least one GPU to perform inference on the scoring data in parallel and only using one response from the portions of the GPU; 
tracking a timing of responses from the portions of the attached at least one GPU; and
altering the provisioning of the portions of at least one GPU based on the tracked timing.
2.	(Original) The method of claim 1, wherein the one response to use is a temporally first response.  
3.	(Canceled) 
4. 	(Currently Amended) A computer-implemented method, comprising:
provisioning an application instance and portions of at least one accelerator attached to the application instance to execute a machine learning model of an application of the application instance;
loading the machine learning model onto the portions of the at least one accelerator; 
receiving scoring data in the application; [[and]]
utilizing each of the portions of the attached at least one accelerator to perform inference on the scoring data in parallel and only using one response from the portions of the accelerator;
tracking a timing of responses from the portions of the attached at least one accelerator; and
altering the provisioning of the portions of at least one accelerator based on the tracked timing.
5.	(Original) The method of claim 4, wherein the machine learning model includes a description of a computation graph for inference and weights obtained from training.  
6.	(Original) The method of claim 4, wherein the one response to use is a temporally first response.  
7.	(Canceled)
8.	(Currently Amended) The method of claim 4, wherein the altering the provisioning of the 
9.	(Currently Amended) The method of claim 4, further comprising:
receiving an application instance configuration, the application instance configuration to indicate the use of altering the provisioning of the 
10.	(Original) The method of claim 4, further comprising:
prior to attaching the accelerator, selecting the accelerator based on computational capability of the accelerator.
11.	(Currently Amended) The method of claim 4, further comprising:
selecting an accelerator location for a physical accelerator or an application instance location based at least in part on one or more placement criteria, wherein [[the]] a multi-tenant web services provider comprises a plurality of instance locations for physical compute instances and a plurality of accelerator locations for physical accelerators. 
12. 	(Original) The method of claim 11, wherein the one or more placement criteria are based at least in part on a performance metric associated with use of the physical accelerator by the physical compute instance.
13. 	(Original) The method of claim 11, wherein the accelerator location or the application instance location is selected based at least in part on network locality.
14. 	(Original) The method of claim 11, wherein the accelerator location is selected based at least in part on network latency between the physical accelerator and a client device.
15.	(Currently Amended) A system comprising:
a storage to store an application, the application including a machine learning model; and
an elastic inference service implemented by 
provision an application instance and portions of at least one accelerator attached to the application instance to execute [[a]] the machine learning model of an application of the application instance;
load the machine learning model onto the portions of the at least one accelerator; 
receive scoring data in the application; [[and]]
utilize each of the portions of the attached at least one accelerator to perform inference on the scoring data in parallel and only using one response from the portions of the accelerator;
track a timing of responses from the portions of the attached at least one accelerator; and
alter the provisioning of the portions of at least one accelerator based on the tracked timing.
16.	(Original) The system of claim 15, wherein the one response to use is a temporally first response.  
17.	(Cancelled) 
18. 	(Currently Amended) The system of claim [[16]] 15, wherein to alter the at least one accelerator based on the tracked timing the elastic inference service is to launch terminate 
19.	(Currently Amended) The system of claim 15, wherein the elastic inference service is further to receive an application instance configuration, the application instance configuration to indicate the use of altering the provisioning of the 
20. 	(Original) The system of claim 15, wherein the elastic inference service is further to, prior to attaching the accelerator, select the accelerator based on computational capability of the accelerator.


Conclusion
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled "Comments on Statement of Reasons for Allowance." 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL A. KELLER whose telephone number is (571)270-3863. The examiner can normally be reached on Mon - Thurs (7 AM - 5 PM). If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Brian Gillis can be reached on 571-272-7952.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MICHAEL A KELLER/
Primary Patent Examiner, Art Unit 2446