Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
Claims 1-3, 6-9, 11-18 and 20-23 are currently pending and have been examined.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 10/14/2021 and 12/02/2021 has been considered. The submission is in compliance with the provisions of 37 CFR 1.97. Form PTO-1449 is signed and attached hereto.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 21 are rejected under 35 U.S.C. 103 as being unpatentable over Wilt et al. (U.S. Pub. No. 20170132746 A1) in view of Yang et al. (U.S. Pub. No. 20190005606 A1), and further in view of Smith et al. (U.S. Pub. No. 20110134761 A1).
Wilt and Yang were cited in a previous Office Action.

As per claim 1, Wilt teaches the invention substantially as claimed including a computer-implemented method, comprising: 
attaching a first set of one or more graphical processing unit (GPU) slots of an accelerator appliance to an application instance according to an application instance configuration, the application instance remote from the accelerator appliance in a multi-tenant provider network, the accelerator appliance comprising a plurality of GPUs, the plurality of GPUs having a compute capacity, each GPU slot of the first set of GPU slots corresponding to a fraction of the compute capacity of the plurality of GPUs (par. 0033 … a virtual compute instance [equiv. to application instance] may be provisioned, and a first set of one or more GPU(s) may be attached to the instance to provide graphics processing. The first set of one or more virtual GPUs [GPU slots] may provide a particular level of graphics processing; par. 0035 … Using the techniques described herein, a virtual compute instance may be provisioned. The virtual compute instance may be configured to execute an application. The application may be associated with graphics requirements. For example, an application manifest [configuration] may specify a recommended graphics processing unit (GPU) class and/or size of video memory for the application, or analysis of execution of the application may determine graphics requirements for the application …  The virtual GPU may be implemented using a physical GPU that is connected to the virtual compute instance over a network [remote]); 
loading [graphic processing] of the application instance onto the first set of one or more GPU slots (par. 0079 In various embodiments … suitable technique(s) may be used to offload graphics processing from the virtual compute instance 141C to one or more physical GPUs used to implement the application-specific virtual GPUs 151C-151N); 
migrating processing for the application instance from the first set of one or more GPU slots to a second set of one or more GPU slots (par. 0089 In one embodiment, the graphics processing provided by a first virtual GPU may be migrated to a second virtual GPU). 
Wilt does not expressly teach: a machine learning model; performing inference using the loaded machine learning model of the application instance using the first set of one or more GPU slots.
However, Yang teaches: a machine learning model; performing inference using the loaded machine learning model of the application instance using the first set of one or more GPU slots (par. 0111 For example, FIG. 7 is a diagram of a computer system 300 that is useful for machine learning applications. The system 300 includes a CPU 302, an accelerator [GPUs] 304, and a storage device 306. The accelerator 304 can include one or more graphics processing units … In this example, training orchestration frameworks 318 can be executed on the CPU 302, while model training and validation are performed on the accelerator 304 … Training data 316 are stored in the storage device 306. By using the DMA techniques described above, the training data 316 can be transferred from the storage device 306 to the memory of the accelerator).
It would have been obvious to one of ordinary skill in the art before the effective 
Wild and Yang does not expressly teach: detecting a response timing related to performing inference using the loaded machine learning model of the application using the first set of one or more GPU slots. 
However, Smith teaches: detecting a response timing related to performing inference using the loaded machine learning model of the application using the first set of one or more [nodes] (par. 0007 determining a response time for each of a plurality of virtual machines running on the plurality of compute nodes … migrating a virtual machine in response to a particular one of the virtual machines on a particular one of the compute nodes having a response time that exceeds a response time setpoint … to a target one of the compute nodes).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teaching of Wilt and Yang by incorporating the technique of migrating virtual machines based on response time as set forth by Smith because it would provide for effectively migrating inference processing between virtual GPUs at least based on the response times in order to improve performance. 

As per claim 2, Wilt further teaches wherein the accelerator appliance includes at least one GPU and at least one other type of accelerator (par. 0061 … The virtual 

As per claim 21, Smith teaches; determining that the response timing related to performing inference using the loaded machine learning model of the application using the first set of one or more GPU slots exceeds a threshold; and performing the migrating based on the response timing exceeding the threshold (par. 0007 determining a response time for each of a plurality of virtual machines running on the plurality of compute nodes … migrating a virtual machine in response to a particular one of the virtual machines on a particular one of the compute nodes having a response time that exceeds a response time setpoint … to a target one of the compute nodes).

Claim 3 rejected under 35 U.S.C. 103 as being unpatentable over Wilt in view of Yang and Smith as applied to claim 1, and further in view of Gopalakrishnan et al. (US Pub. No. 20180089794 A1).
Gopalakrishnan was cited in a previous Office Action.

As per claim 3, Wilt, Yang and Smith teaches the limitations of claim 1. Wilt further teaches while attaching the first set of one or more GPU slots of the accelerator appliance to an application instance (par. 0033 … a virtual compute instance [equiv. to application instance] may be provisioned, and a first set of one or more GPU(s) may be attached to the instance to provide graphics processing), 

However, Gopalakrishnan teaches updating at least one software version used by the GPU slot to be compatible with the machine learning model (par. 0099 the use of firmware in the GPU, which can be updated … to include the desired function). 
It would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to modify the teaching of Wilt, Yang and Smith to include the method of updating firmware used by the GPU as disclosed by Gopalakrishnan because it would provide for efficiently updating software use by GPU slots so as to include updated functions.

Claims 5-6, 11-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wilt et al. (U.S. Pub. No. 20170132746 A1) in view of Yang et al. (U.S. Pub. No. 20190005606 A1.
Wilt and Yang were cited in a previous Office Action.

As per claim 5, Wilt teaches the invention substantially as claimed including a computer-implemented method, comprising: 
attaching a first set of one or more accelerator slots of an accelerator appliance to an application instance according to an application instance configuration, the application instance attached to the accelerator appliance in a multi-tenant provider network, the accelerator appliance comprising a plurality of accelerators, the plurality of accelerators having a compute capacity, each accelerator slot of the first set of accelerator slots corresponding to a fraction of the compute capacity of the plurality of accelerators (par. 0033 … a virtual compute instance [equiv. to application instance] may be provisioned, and a first set of one or more GPU(s) may be attached to the instance to provide graphics processing. The first set of one or more virtual GPUs [GPU slots] may provide a particular level of graphics processing; par. 0035 … Using the techniques described herein, a virtual compute instance may be provisioned. The virtual compute instance may be configured to execute an application. The application may be associated with graphics requirements. For example, an application manifest [configuration] may specify a recommended graphics processing unit (GPU) class and/or size of video memory for the application, or analysis of execution of the application may determine graphics requirements for the application …  The virtual GPU may be implemented using a physical GPU that is connected to the virtual compute instance over a network [remote]).
loading [graphic processing] of the application instance onto the first set of one or more accelerator slots (par. 0079 In various embodiments … suitable technique(s) may be used to offload graphics processing from the virtual compute instance 141C to one or more physical GPUs used to implement the application-specific virtual GPUs 151C-151N).
migrating processing for the application instance from the first set of one or more accelerator slots to a second set of one or more accelerator slots (par. 0089 In one embodiment, the graphics processing provided by a first virtual GPU may be migrated to a second virtual GPU).

However, Yang teaches: a machine learning model; performing inference using the loaded machine learning model of the application instance using the first set of one or more accelerator slots (par. 0111 For example, FIG. 7 is a diagram of a computer system 300 that is useful for machine learning applications. The system 300 includes a CPU 302, an accelerator [GPUs] 304, and a storage device 306. The accelerator 304 can include one or more graphics processing units … In this example, training orchestration frameworks 318 can be executed on the CPU 302, while model training and validation are performed on the accelerator 304 … Training data 316 are stored in the storage device 306. By using the DMA techniques described above, the training data 316 can be transferred from the storage device 306 to the memory of the accelerator).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teaching of Wilt to include methods for performing model training using one or more GPUs of an accelerator as set forth by Yang because it would provide for processing of large amounts training data faster, with predictable results.

As per claim 6, Wilt further teaches wherein managing resources of the accelerator appliance includes managing a central processing unit, memory, and ingress network bandwidth (par. 0038 … In one embodiment, the provider network 100 

As per claim 11, Wilt further teaches wherein processing for the application instance is migrated from the first set of one or more accelerator slots are replaced with to the second set of one or more accelerator slots for the application instance due to a change in requirements (par. 0089 In one embodiment, the graphics processing provided by a first virtual GPU may be migrated to a second virtual GPU; par. 0033 In one embodiment, the migration of graphics processing may be performed based (at least in part) on user input representing a change in GPU requirements).

As per claim 12, Wilt further teaches: wherein the change in requirements is specified by a user of the application instance (par. 0033 In one embodiment, the migration of graphics processing may be performed based (at least in part) on user input representing a change in GPU requirements).

As per claim 13, Wilt further teaches: wherein processing for the application 

As per claim 14, Wilt further teaches: wherein the second set of one or more accelerator slots provides a different level of processing relative to the first set of one or more accelerator slots (par. 0033 The first set of one or more virtual GPUs may provide a particular level of graphics processing. After a change in GPU requirements for the instance is determined, the second set of one or more virtual GPU(s) may be selected and attached to the virtual compute instance to replace the graphics processing of the first virtual GPU(s) with a different level of graphics processing).

As per claim 15, Wilt teaches: wherein migrating processing for the application instance from the first set of one or more accelerator slots with to the second set of one or more accelerator slots comprises causing the second set of one more accelerator slots to assume operation in place of the first set of one more accelerator slots (par. 0104 Migration of graphics processing may represent replacing the graphics processing provided by the local GPU with the graphics processing provided by the virtual GPU with respect to one or more applications). 

As per claim 16, Wilt teaches: wherein the accelerator appliance includes accelerators of different capabilities (par. 0035 The virtual GPU may be selected from a set of virtual GPUs (e.g., belonging to virtual GPU classes) having different capabilities for graphics processing).

As per claim 17, it is a system having similar limitations as claim 1. Thus, claim 17 is rejected for the same rationale as applied to claim 1. Wilt further teaches storage to store an application (Fig. 18, System Memory 3020); and an elastic inference service implemented by a second one or more electronic devices, the elastic inference service including an application instance and an accelerator appliance (Fig. 1, Elastic Graphics Service 110). Yang teaches application including a machine learning model (par. 0111 For example, FIG. 7 is a diagram of a computer system 300 that is useful for machine learning applications. The system 300 includes a CPU 302, an accelerator 304, and a storage device 306. The accelerator 304 can include one or more graphics processing units … In this example, training orchestration frameworks 318 can be executed on the CPU 302, while model training and validation are performed on the accelerator 304).

As per claim 18 it is a system having similar limitations as claim 6. Thus, claim 18 is rejected for the same rationale as applied to claim 6.

As per claim 20, it is a system having similar limitations as claim 13. Thus, claim 20 is rejected for the same rationale as applied to claim 13.

Claims 7 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Wilt in view of Yang, as applied to claim 5 above, and further in view of Jain et al. “Dynamic Space-Time Scheduling for GPU Inference”.
Jain was cited in a previous Office Action.

As per claims 7 and 8, Wilt and Yang does not expressly teach wherein managing resources of the accelerator appliance includes spatially multiplexing one or more accelerator slots; wherein managing resources of the accelerator appliance includes temporally multiplexing a tensor processing block into a single accelerator slot.
However, Jain teaches managing resources of the accelerator appliance includes spatially multiplexing one or more accelerator slots, wherein managing resources of the accelerator appliance includes temporally multiplexing a tensor processing block into a single accelerator slot (Abstract; page. 3. Section 3, 3 space and time multiplexing).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teaching of Wilt and Yang to include the technique of managing resources as set forth by Jain because it would provide leveraging both temporal and spatial multiplexing to improve GPU utilization for deep learning inference workloads.

Claim 9 rejected under 35 U.S.C. 103 as being unpatentable over Wilt in view of Yang as applied to claim 5, and further in view of Gopalakrishnan et al. (US Pub. No. 20180089794 A1).
Gopalakrishnan was cited in a previous Office Action.

As per claim 9, Wilt and Yang teaches the limitations of claim 5. Wilt and Yang does not expressly teach: updating at least one software version used by the accelerator slot to execute the machine learning model.
However, Gopalakrishnan teaches updating at least one software version used by the accelerator slot to execute the machine learning model (par. 0099 the use of firmware in the GPU, which can be updated … to include the desired function). 
It would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to modify the teaching of Wilt and Yang to include the method of updating firmware used by the GPU as disclosed by Gopalakrishnan because it would provide for efficiently updating software use by GPU slots so as to include updated functions.

Claims 22-23 are rejected under 35 U.S.C. 103 as being unpatentable over Wilt in view of Yang as applied to claims 5 and 17, and further in view of Smith et al. (U.S. Pub. No. 20110134761 A1).

As per claim 22, Wilt and Yang teaches the limitations of claim 5. Wilt and Yang did not expressly teach detecting a response timing related to performing inference using the loaded machine learning model of the application using the first set of one or more accelerator slots; determining that the response timing exceeds a threshold; and performing the migrating based on the response timing exceeding the threshold.
Smith teaches: detecting a response timing related to performing inference using the loaded machine learning model of the application using the first set of one or more accelerator slots; determining that the response timing exceeds a threshold; and performing the migrating based on the response timing exceeding the threshold (par. 0007 determining a response time for each of a plurality of virtual machines running on the plurality of compute nodes … migrating a virtual machine in response to a particular one of the virtual machines on a particular one of the compute nodes having a response time that exceeds a response time setpoint … to a target one of the compute nodes). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teaching of Wilt and Yang by incorporating the technique of migrating virtual machines based on response time as set forth by Smith because it would provide for effectively migrating inference processing between virtual GPUs at least based on the response times in order to improve performance. 

As per claim 23, it is a system having similar limitations as claim 22. Thus, claim 22 is rejected for the same rationale as applied to claim 22.

Response to Arguments
Applicant's arguments filed 12/02/2021 have been fully considered but they are not persuasive.
(1) The applicant argues in page 12 for claim 1 that the combination of Wilt and 
As per point 1, the examiner respectfully submits that a closer review of Wilt clearly teaches, par. 0089, In one embodiment, the graphics processing provided by a first virtual GPU may be migrated to a second virtual GPU.  Therefore, Applicant’s arguments are unpersuasive.
(2) The applicant appears to argue in pages 13-14 for claim 5 that the combination of Wilt and Yang does not teach “migrating processing for an application instance from a first set of one or more accelerator slots to a second set of one or more accelerator slots.” 
As per point 2, the examiner respectfully submits that closer review of Wilt clearly describes par. 0089, In one embodiment, the graphics processing provided by a first virtual GPU may be migrated to a second virtual GPU, wherein GPUs are accelerators.  Therefore, Applicant’s arguments are also unpersuasive.  
(3) The applicants arguments in page 14 for claim 17, are similar to arguments for in regards to claim 15 in point 2 above, and therefore similar response applies to arguments regarding claim 17. 

Conclusion

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Willy W. Huaracha whose telephone number is (571)270-5510.  The examiner can normally be reached on M-F 8:30-5:00pm. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Meng-Ai An can be reached on (571) 272-3756.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 


/WH/
Examiner, Art Unit 2195

/MENG AI T AN/Supervisory Patent Examiner, Art Unit 2195