Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 06/21/2022 has been entered. 
	
DETAILED ACTION
Claims 1-3, 5-9, 11-18 and 20-23 are currently pending and have been examined.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 06/21/2022, 08/29/2022 and 10/25/2022 has been considered. The submission is in compliance with the provisions of 37 CFR 1.97. Form PTO-1449 is signed and attached hereto.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 21 are rejected under 35 U.S.C. 103 as being unpatentable over Wilt et al. (U.S. Pub. No. 20170132746 A1) in view of Huynh et al. “DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications
”, and further in view of Smith et al. (U.S. Pub. No. 20110134761 A1).
   Wilt and Smith were cited in a previous Office Action.

As per claim 1, Wilt teaches the invention substantially as claimed including a computer-implemented method, comprising: 
attaching a first set of one or more graphical processing unit (GPU) slots of an accelerator appliance to an application instance according to an application instance configuration, the attached application instance remote from the accelerator appliance in a multi-tenant provider network, the accelerator appliance comprising a plurality of GPUs, the plurality of GPUs having a compute capacity, each GPU slot of the first set of GPU slots corresponding to a fraction of the compute capacity of the plurality of GPUs (par. 0033 … a virtual compute instance [equiv. to application instance] may be provisioned, and a first set of one or more GPU(s) may be attached to the instance to provide graphics processing. The first set of one or more virtual GPUs [GPU slots] may provide a particular level of graphics processing; par. 0035 … Using the techniques described herein, a virtual compute instance may be provisioned. The virtual compute instance may be configured to execute an application. The application may be associated with graphics requirements. For example, an application manifest [configuration] may specify a recommended graphics processing unit (GPU) class and/or size of video memory for the application, or analysis of execution of the application may determine graphics requirements for the application …  The virtual GPU may be implemented using a physical GPU that is connected to the virtual compute instance over a network [remote]);  
loading … [graphic processing] of the attached application instance onto the first set of GPU slots (par. 0079 In various embodiments … suitable technique(s) may be used to offload graphics processing from the virtual compute instance 141C to one or more physical GPUs used to implement the application-specific virtual GPUs 151C-151N);
migrating processing for the attached application instance from the first set of GPU slots to a second set of one or more GPU slots … (par. 0089 In one embodiment, the graphics processing provided by a first virtual GPU may be migrated to a second virtual GPU). 
Wilt does not expressly teach:
loading a machine learning model of the attached application instance onto the first set of GPU slots; including loading the machine learning model of the attached application instance onto the second set of GPU slots; handling a first set of one or more inference calls, made by an application of the attached application instance to the loaded machine learning model, using the first set of GPU slots; handling a second set of one or more inference calls, made by the application of the attached application instance to the loaded machine learning model, using the second set of GPU slots.
However, Huynh teaches: loading a machine learning model of the attached application instance onto the first set of GPU slots; including loading the machine learning model of the attached application instance onto the second set of GPU slots (page 83, left column, lines 23-26 we also developed a tool that automatically converts pre-trained legacy models and loads them to DeepMon with its various optimization strategies applied; pg. 83, right column, lines 4-6 developers can easily load pre-trained legacy models [machine learning models] on various mobile GPUs by using DeepMon’s model converting tool; Figure 3.); handling a first set of one or more inference calls, made by an application of the attached application instance to the loaded machine learning model, using the first set of GPU slots; handling a second set of one or more inference calls, made by the application of the attached application instance to the loaded machine learning model, using the second set of GPU slots (pg. 86, left column, section 6.1, lines 2-6 DeepMon works through two different phases: (1) the model conversion phase to convert existing models to run on mobile GPUs, and (2) the inference phase to process image streams using the converted model to recognize useful information).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teaching of Wilt to include methods for loading pre-trained legacy models on GPUs as set forth by Huynh because it would provide for efficiently offload inference calls onto to the GPUs and accelerate the processing, with predictable results.
Wilt and Huynh does not expressly teach: detecting a response timing related to handling the first set of inference calls using the first set of GPU slots.
However, Smith teaches: detecting a response timing related to handling the first set of inference calls using the first set of GPU slots (par. 0007 determining a response time for each of a plurality of virtual machines running on the plurality of compute nodes … migrating a virtual machine in response to a particular one of the virtual machines on a particular one of the compute nodes having a response time that exceeds a response time setpoint … to a target one of the compute nodes).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teaching of Wilt and Huynh by incorporating the technique of migrating virtual machines based on response time as set forth by Smith because it would provide for effectively migrating inference processing from a first set of vGPUs to a second set of vGPUs at least based on the response times in order to improve performance. 
 
As per claim 2, Wilt further teaches wherein the accelerator appliance includes at least one GPU and at least one other type of accelerator (par. 0061 … The virtual compute instance may be implemented using central processing unit (CPU) resources and memory resources of a physical compute instance. The virtual GPU may be implemented using a physical GPU).

As per claim 21, Smith teaches; determining that the response timing related to performing inference using the loaded machine learning model of the application using the first set of one or more GPU slots exceeds a threshold; and performing the migrating based on the response timing exceeding the threshold (par. 0007 determining a response time for each of a plurality of virtual machines running on the plurality of compute nodes … migrating a virtual machine in response to a particular one of the virtual machines on a particular one of the compute nodes having a response time that exceeds a response time setpoint … to a target one of the compute nodes).

Claim 3 rejected under 35 U.S.C. 103 as being unpatentable over Wilt in view of Huynh and Smith as applied to claim 1, and further in view of Gopalakrishnan et al. (US Pub. No. 20180089794 A1).
Gopalakrishnan was cited in a previous Office Action.

As per claim 3, Wilt, Huynh and Smith teaches the limitations of claim 1. Wilt further teaches while attaching the first set of GPU slots of the accelerator appliance to an attached application instance (par. 0033 … a virtual compute instance [equiv. to application instance] may be provisioned, and a first set of one or more GPU(s) may be attached to the instance to provide graphics processing), 
Wilt, Huynh and Smith does not expressly teach: updating at least one software version used by the set of GPU slots to be compatible with the machine learning model.
However, Gopalakrishnan teaches updating at least one software version used by the GPU slot to be compatible with the machine learning model (par. 0099 the use of firmware in the GPU, which can be updated … to include the desired function). 
It would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to modify the teaching of Wilt, Huynh and Smith to include the method of updating firmware used by the GPU as disclosed by Gopalakrishnan because it would provide for efficiently updating software use by GPU slots so as to include updated functions.

Claims 5-6, 11-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wilt et al. (U.S. Pub. No. 20170132746 A1) in view of Huynh et al. “DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications.

As per claim 5, Wilt teaches the invention substantially as claimed including a method, comprising: 
attaching a first set of one or more accelerator slots of an accelerator appliance to an application instance according to an application instance configuration, the attached application instance attached to the accelerator appliance in a multi- tenant provider network, the accelerator appliance comprising a plurality of accelerators, the plurality of accelerators having a compute capacity, each accelerator slot of the first set of accelerator slots corresponding to a fraction of the compute capacity of the plurality of accelerators (par. 0033 … a virtual compute instance [equiv. to application instance] may be provisioned, and a first set of one or more GPU(s) may be attached to the instance to provide graphics processing. The first set of one or more virtual GPUs [GPU slots] may provide a particular level of graphics processing; par. 0035 … Using the techniques described herein, a virtual compute instance may be provisioned. The virtual compute instance may be configured to execute an application. The application may be associated with graphics requirements. For example, an application manifest [configuration] may specify a recommended graphics processing unit (GPU) class and/or size of video memory for the application, or analysis of execution of the application may determine graphics requirements for the application …  The virtual GPU may be implemented using a physical GPU that is connected to the virtual compute instance over a network [remote]);  
loading … [graphic processing] of the attached application instance onto the first set of accelerator slots (par. 0079 In various embodiments … suitable technique(s) may be used to offload graphics processing from the virtual compute instance 141C to one or more physical GPUs used to implement the application-specific virtual GPUs 151C-151N).
migrating processing for the attached application instance from the first set of accelerator slots to a second set of one or more accelerator slots … (par. 0089 In one embodiment, the graphics processing provided by a first virtual GPU may be migrated to a second virtual GPU). 
Wilt does not expressly teach:
loading a machine learning model of the attached application instance onto the first set of accelerator slots; handling a first set of one or more inference calls, made by an application of the attached application instance to the loaded machine learning model, using the first set of accelerator slots; handling a second set of inference calls, made by the application of the attached application instance to the loaded machine learning model, using the second set of accelerator slots; including loading the machine learning model of the attached application instance onto the second set of accelerator slots.
However, Huynh teaches: loading a machine learning model of the attached application instance onto the first set of accelerator slots; including loading the machine learning model of the attached application instance onto the second set of accelerator slots (page 83, left column, lines 23-26 we also developed a tool that automatically converts pre-trained legacy models and loads them to DeepMon with its various optimization strategies applied; pg. 83, right column, lines 4-6 developers can easily load pre-trained legacy models [machine learning models] on various mobile GPUs by using DeepMon’s model converting tool; Figure 3.); 
handling a first set of one or more inference calls, made by an application of the attached application instance to the loaded machine learning model, using the first set of accelerator slots; handling a second set of inference calls, made by the application of the attached application instance to the loaded machine learning model, using the second set of accelerator slots (pg. 86, left column, section 6.1, lines 2-6 DeepMon works through two different phases: (1) the model conversion phase to convert existing models to run on mobile GPUs, and (2) the inference phase to process image streams using the converted model to recognize useful information).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teaching of Wilt to include methods for loading pre-trained legacy models on GPUs as set forth by Huynh because it would provide for efficiently offload inference calls onto to the GPUs and accelerate the processing, with predictable results.

As per claim 6, Wilt further teaches wherein managing resources of the accelerator appliance includes managing a central processing unit, memory, and ingress network bandwidth (par. 0038 … In one embodiment, the provider network 100 may offer virtual compute instances 141A-141Z with varying computational and/or memory resources. In one embodiment, each of the virtual compute instances 141A-141Z may correspond to one of several instance types. An instance type may be characterized by its computational resources (e.g., number, type, and configuration of central processing units [CPUs] or CPU cores), memory resources (e.g., capacity, type, and configuration of local memory), storage resources (e.g., capacity, type, and configuration of locally accessible storage), network resources).

As per claim 11, Wilt further teaches wherein processing for the application instance is migrated from the first set of accelerator slots are replaced with to the second set of accelerator slots for the application instance due to a change in requirements (par. 0089 In one embodiment, the graphics processing provided by a first virtual GPU may be migrated to a second virtual GPU; par. 0033 In one embodiment, the migration of graphics processing may be performed based (at least in part) on user input representing a change in GPU requirements).
 
As per claim 12, Wilt teaches: wherein the change in requirements is specified by a user of the attached application instance (par. 0033 In one embodiment, the migration of graphics processing may be performed based (at least in part) on user input representing a change in GPU requirements).

As per claim 13, Wilt further teaches wherein processing for the attached application instance is migrated from the first set of accelerator slots to the second set of accelerator slots due to a degradation of performance (par. 0033 In one embodiment, the migration of graphics processing may be performed based (at least in part) on detection of an increase in graphics workload, where increase in workload results in performance degradation).

As per claim 14, Wilt teaches: wherein the second set of accelerator slots provides a different level of processing relative to the first set of accelerator slots (par. 0033 The first set of one or more virtual GPUs may provide a particular level of graphics processing. After a change in GPU requirements for the instance is determined, the second set of one or more virtual GPU(s) may be selected and attached to the virtual compute instance to replace the graphics processing of the first virtual GPU(s) with a different level of graphics processing).

As per claim 15, Wilt teaches: wherein migrating processing for the application instance from the first set of accelerator slots with to the second set of accelerator slots comprises causing the second set of one more accelerator slots to assume operation in place of the first set of one more accelerator slots (par. 0104 Migration of graphics processing may represent replacing the graphics processing provided by the local GPU with the graphics processing provided by the virtual GPU with respect to one or more applications). 

As per claim 16, Wilt teaches: wherein the accelerator appliance includes accelerators of different capabilities (par. 0035 The virtual GPU may be selected from a set of virtual GPUs (e.g., belonging to virtual GPU classes) having different capabilities for graphics processing).

As per claim 17, it is a system having similar limitations as claim 5. Thus, claim 17 is rejected for the same rationale as applied to claim 5. Wilt further teaches storage to store an application (Fig. 18, System Memory 3020); and an elastic inference service implemented by a second one or more electronic devices, the elastic inference service including an application instance and an accelerator appliance (Fig. 1, Elastic Graphics Service 110). 

As per claim 18 it is a system having similar limitations as claim 6. Thus, claim 18 is rejected for the same rationale as applied to claim 6.

As per claim 20, it is a system having similar limitations as claim 13. Thus, claim 20 is rejected for the same rationale as applied to claim 13.

Claims 7 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Wilt in view of Huynh, as applied to claim 5 above, and further in view of Jain et al. “Dynamic Space-Time Scheduling for GPU Inference”.
Jain was cited in a previous Office Action.

As per claims 7 and 8, Wilt and Huynh does not expressly teach wherein managing resources of the accelerator appliance includes spatially multiplexing one or more accelerator slots; wherein managing resources of the accelerator appliance includes temporally multiplexing a tensor processing block into a single accelerator slot.
However, Jain teaches managing resources of the accelerator appliance includes spatially multiplexing one or more accelerator slots, wherein managing resources of the accelerator appliance includes temporally multiplexing a tensor processing block into a single accelerator slot (Abstract; page. 3. Section 3, 3 space and time multiplexing).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teaching of Wilt and Huynh to include the technique of managing resources as set forth by Jain because it would provide leveraging both temporal and spatial multiplexing to improve GPU utilization for deep learning inference workloads.

Claim 9 rejected under 35 U.S.C. 103 as being unpatentable over Wilt in view of Huyhn as applied to claim 5, and further in view of Gopalakrishnan et al. (US Pub. No. 20180089794 A1).
Gopalakrishnan was cited in a previous Office Action.

As per claim 9, Wilt and Huynh teaches the limitations of claim 5. Wilt and Huynh does not expressly teach: updating at least one software version used by the accelerator slot to execute the machine learning model.
However, Gopalakrishnan teaches updating at least one software version used by the accelerator slot to execute the machine learning model (par. 0099 the use of firmware in the GPU, which can be updated … to include the desired function). 
It would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to modify the teaching of Wilt and Huynh to include the method of updating firmware used by the GPU as disclosed by Gopalakrishnan because it would provide for efficiently updating software use by GPU slots so as to include updated functions.

Claims 22-23 are rejected under 35 U.S.C. 103 as being unpatentable over Wilt in view of Huynh as applied to claims 5 and 17, and further in view of Smith et al. (U.S. Pub. No. 20110134761 A1).

As per claim 22, Wilt and Huynh teaches the limitations of claim 5. Wilt and Huynh did not expressly teach detecting a response timing related to handling the first set of inference calls using the first set of accelerator slots; determining that the response timing exceeds a threshold; and performing the migrating based on the response timing exceeding the threshold.
However, Smith teaches: detecting a response timing related to handling the first set of inference calls using the first set of accelerator slots; determining that the response timing exceeds a threshold; and performing the migrating based on the response timing exceeding the threshold (par. 0007 determining a response time for each of a plurality of virtual machines running on the plurality of compute nodes … migrating a virtual machine in response to a particular one of the virtual machines on a particular one of the compute nodes having a response time that exceeds a response time setpoint … to a target one of the compute nodes). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teaching of Wilt and Huynh by incorporating the technique of migrating virtual machines based on response time as set forth by Smith because it would provide for effectively migrating inference processing between virtual GPUs at least based on the response times in order to improve performance. 

As per claim 23, it is a system having similar limitations as claim 22. Thus, claim 22 is rejected for the same rationale as applied to claim 22.

Response to Arguments
Applicant's arguments with respect to claims 1, 5 and 17 have been considered but are moot in view of the new ground(s) of rejection. 
Examiner’s Amendment Proposal
The following amendment proposal was presented to Applicant’s representative during an interview conducted on 11/03/2022.

1. (Currently Amended) A computer-implemented method, comprising: 
attaching a first set of in a multi-tenant provider network, the accelerator appliance comprising a plurality of GPUs, the plurality of GPUs having a compute capacity, each GPU slot of the first set of GPU slots corresponding to a fraction of the compute capacity of the plurality of GPUs; 
loading a machine learning model of the attached application instance onto the first set of GPU slots; 
handling a first set of 
tracking responses, , by calculating a timing of the responses associated with each inference call of the first set of inference calls; 
determining that the timing of one or more responses associated with one or more inference calls of the first set of inference calls exceeds a threshold;
migrating processing for the attached application instance from the first set of GPU slots to a second set of , wherein migrating further comprises detaching the first set of GPU slots, attaching the second set of GPU slots to the application instance, and causing the second set of GPU slots to assume operation in place of the first set of GPU slots; and
handling a second set of 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
NPL “Glow: Graph Lowering Compiler Techniques for Neural Networks”

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Willy W. Huaracha whose telephone number is (571)270-5510.  The examiner can normally be reached on M-F 8:30-5:00pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Meng-Ai An can be reached on (571) 272-3756.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/WH/
Examiner, Art Unit 2195

/MENG AI T AN/Supervisory Patent Examiner, Art Unit 2195