DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to application filed on 6/27/2018, wherein claims 1-20 are pending.

Double Patenting
Claims 1-20 of this application are patentably indistinct from claims 1-20 of Application No.16020788. Pursuant to 37 CFR 1.78(f), when two or more applications filed by the same applicant or assignee contain patentably indistinct claims, elimination of such claims from all but one application may be required in the absence of good and sufficient reason for their retention during pendency in more than one application. Applicant is required to either cancel the patentably indistinct claims from all but one application or maintain a clear line of demarcation between the applications. See MPEP § 822.
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-20 of U.S. Patent Application No.16/020788. Although the claims at issue are not identical, they are not patentably distinct from each other.
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not been patented.
A claim has been mapped out below as example:
Reference Application (16/020788)
Instant Application (16/020776)
1. A computer-implemented method, comprising:
receiving, in a multi-tenant web services provider, an application instance configuration, an application of the application instance to utilize a portion of an attached graphics processing unit (GPU) during execution of a machine learning model and the application instance configuration including 



an arithmetic precision of the machine learning model to be used in determining the portion of the GPU to provision; 








provisioning the application instance and the portion of the GPU attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the GPU is implemented using a physical GPU in the second location, and wherein the physical GPU is accessible to the physical compute instance over a network; 




loading the machine learning model onto the portion of the GPU; and 

performing inference using the loaded machine learning model of the application using the portion of the GPU on the attached GPU.
1. A computer-implemented method, comprising:
receiving, in a multi-tenant web services provider, an application instance configuration, an application of the application instance to utilize a portion of an attached graphics processing unit (GPU) during execution of a machine learning model and the application instance configuration including: 

an indication of the central processing unit (CPU) capability to be used, 

an arithmetic precision of the machine learning model to be used, 

an indication of the GPU capability to be used, 

a storage location of the application, and 

an indication of an amount of random access memory to use; 

provisioning the application instance and the portion of the GPU attached to the application instance, wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the GPU is implemented using a physical GPU in the second location, and wherein the physical GPU is accessible to the physical compute instance over a network; 

attaching the portion of the GPU to the application instance; 

loading the machine learning model onto the attached portion of the GPU; and 

performing inference using the loaded machine learning model of the application using the portion of the GPU on the attached GPU.


Regarding Claim 1, the reference application, in claim 1, does not teach the request includes an indication of the central processing unit capability to be used, an indication of the GPU capability to be used, a storage location of the application and an indication of an amount of random access memory to use, or attaching the portion of the GPU to the application instance. Fong et al. (US PGPUB 2018/0276044) teaches request requirements including   an indication of the central processing unit capability to be used, an indication of the GPU capability to be used, a storage location of the application and an indication of an amount of random access memory to use (paragraph 26), and attaching the portion of the GPU to the application instance (paragraph 30).  One of ordinary skill in the arts would have been motivated to make this modification in order to improve network topology-aware cloud scheduling of machine learning workloads (Fong, paragraph 3).
As for claims 5 and 16, they contain similar limitations as claim 1 above.  Thus, they are rejected under the same rationales.
As for claims 2-4, 6-15, and 17-20, they contain limitations that are similarly obvious to claims 2-4, 6-16, and 18-20 of reference application and does not offer additional limitations that renders them non-obvious in light of the reference application and Fong et al.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1, 5, 15-16 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
The following claim limitations are unclear and indefinite:
Claim 14, it is unclear what is meant by “prior performing inference using the loaded machine learning model of the application using the portion of the accelerator on the attached accelerator…” because the sequence of words in bold does not make any linguistic sense in English.  it is entirely unclear what is the structure or function attempted to be claimed by the applicant in this limitation.  Specification merely repeat the same problematic claim language.  Thus, for the purpose of examination, examiner will assume the applicant actually means just “performing inference using….” 
The following claim limitations lacks antecedent basis:
Claim 1, 5 and 16: “the second location” 
Claim 15: “the machine learning model format”
Claim 15: “the version number”

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim 1, 3, 5-6, 9-14, and 16-19 are rejected under 35 U.S.C. 103 as being unpatentable over Fong et al. (US PGPUB 2018/0276044), in view of Wang et al. (“An Elastic CNN inference Accelerator with Adaptive Trade-off between QoS and QoR”, DAC ’17, June 18-22, 2017, hereafter “Wang”) .

As for claim 1, Fong teaches a computer-implemented method, comprising:
receiving, in a multi-tenant web services provider, an application instance configuration [scheduling requirement], an application of the application instance to utilize a portion of an attached graphics processing unit (GPU) during execution of a machine learning model (Fig. 1 – receive a work scheduling request 101, paragraphs 1-2,  and paragraphs 25-26, “The present invention relates….to a workload scheduling method…based on analysis of requirements on a central processing unit (CPU) memory and an accelerator such as a Graphics processing unit…”, “cognitive computing relies on computing infrastructure…accelerators such as GPUs and FPGAs are currently used…by many cognitive applications and services (e.g., machine learning and deep learning)….”, “…a work-scheduling request is received…analysis of scheduling requirement is performed…the workload may specify explicit requirements, for example, a minimum or maximum number of CPU cores…GPUs…memory…hardware architecture type…a GPU type…and GPU-GPU communication…SLA requirements…”) 
the application instance configuration including: an indication of the central processing unit (CPU) capability to be used, an indication of the GPU capability to be used, and an indication of an amount of random-access memory to use (paragraph 26, “…analysis of scheduling requirement on the number of GPU/CPU and topology/usage pattern…explicit requirements….minimum or maximum number of …CPU cores…GPUs…desired amount of memory”);
provisioning the application instance and the portion of the GPU attached to the application instance (Fig. 1 – requirement matching with availability of resources + GPU assignment preference rules -> allocation 104, paragraph 29-30, “…the resources are allocated to the workload…” “…selected based on dynamic attributes of resources…current number of GPUs for exclusive usage matched requiring quantity of this workload, etc.….”), wherein the application instance is implemented using a physical compute instance in a first instance location, wherein the portion of the GPU is implemented using a physical GPU in the second location, and wherein the physical GPU is accessible to the physical compute instance over a network (paragraph 30 in view of Fig. 2B, “…selected based on dynamic attributes…GPU preference assignment rules are applied based on the GPU topology for performance gain…GPUs on different socket would be preferred for workload with low rate/load of cross-GPU data transfer…” topological configurations are considered based on characteristic of the workload.  It includes GPUs on different sockets preferred for workload and based on if the workload is CPU-GPU data transfer intensive that CPU-GPU pair would be on same or different sockets, thus, it would be obvious to a person of ordinary skill before the effective filing date of the application to recognize that GPUs can be on a different socket than the CPU allocated because doing so improves resource allocation that matches with workload resource usage patterns.  Fig. 2B depicts where CPU and GPU can be connected over Network Fabric.) ;
attaching the portion of the GPU to the application instance (paragraph 16, 27 and 30.  “…collecting information on workload resource requirements for execution including GPUs for offloading the computation to, matching the resource requirements …and dispatching the workload to allocated GPU, CPU, and memory resources…” “”…for example, a workload…requires execution using one container of two GPUs ….”  and “necessary containers are created from feasible resources…”  Assigning of GPUs to container running the application instance is understood as attaching the GPU to it.);

While Fong teaches running of workload such as machine learning workloads (paragraph 2), thus, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to recognize running a machine learning workload necessarily involves loading the machine learning model that is fundamental to any machine learning workload to the container and by extension the GPUs assigned to the container running the workload.  Nevertheless, in the interest of compact prosecution, Examiner note Fong does not explicitly state the workload requirements includes an arithmetic precision of the machine learning model to be used, a storage location of the application, or loading the machine learning model onto the attached portion of the GPU or performing inference using the loaded machine learning model of the application using the portion of the GPU on the attached GPU.
However, Wang teaches a known method of executing a machine learning workload in a cloud environment including workload requirements includes an arithmetic precision of the machine learning model to be used Page 3, Section 3.1: “…ELNA manager selects the CNN hyperparameters and the hardware operating mode, as the final configuration…map the final model to the accelerator at the correct mode, and generate the online-control bitstreams…” Page 4, Section 3.3, “the accelerator can operate in different precision/throughput modes.  For example in Fig. 3, each PE can work in unison-mode as a single-issue 16-bit PE or in separate mode as double-issue 8-bit PE…depending on the precision mode decided by the control bits…” teaches that the precision of the neural network (ML model) determines the mode of the accelerator to use in executing the neural network), and a storage location of the application (Page 4, Section 3.3, “…ELNA compiler prepares the layout of weight through data tiling and partitioning to improve locality.  The details of network mapping can be referred to [3] and [8]…”  teaches that the application’s data locality dictates the partitioning of the work.  While note explicitly stated, in order to take into account of data locality, there necessarily be locality information to take into account.  Thus, storage location of the application is implicitly and obviously disclosed because doing so is a necessary basic information required to do locality based mapping of application for execution a networked environment.);
loading the machine learning model [model] onto the attached portion of the GPU and performing inference [network inference] using the loaded machine learning model of the application using the portion of the GPU on the attached GPU (Page 3, Section 3.1: “…Afterwards, ELNA manager selects the CNN hyperparameters and the hardware operating mode, as the final configuration.  When the configuration is decided, ELNA compiler will map the final model to the accelerator at the correct mode, and generate the online control bitstreams that will be used by the ELNA scheduler to direct the accelerator to execute the network inference…” and Pg. 4, Section 3.3: “…Second, the accelerator can operate in different precision/throughput modes…to suit the precision-mode of the reshaped CNN model.” teaches the inference is performed with a mode of the accelerator, depending on the precision of the neural network.). This known technique is applicable to the system of Fong as they both share characteristics and capabilities, namely, they are directed to execution of ML workloads utilizing neural network accelerators.
One of ordinary skill in the art before the effective filing date of the application would have recognized that applying the known technique of Wang would have yielded predictable results and resulted in an improved system.  It would have been recognized that applying the technique of Wang to the teachings of Fong would have yielded predictable results because the level of ordinary skill in the art demonstrated by the references applied shows the ability to incorporate such neural network execution features into similar systems.  Further, applying workload requirements including arithmetic precision of the model and storage location of ML workload  and loading and performance inference using the loaded ML model to Fong with receiving an application instance configuration and executing a ML model on the attached portion of the GPU in a clouded environment accordingly, would have been recognized by those of ordinary skill in the art as resulting in an improved system that would allow improved support of accelerator of various topologies to execute ML workloads. (Wang, PG 4, Section 3.3).


As for claims 5 and 16, they contain similar limitations as claim 1 above.  Thus, they are rejected under the same rationales.
In addition, Fong also teaches the accelerator contain GPUs (paragraph 1).

As for claim 3, Wang teaches the machine learning model includes a description of a computation graph for inference and weights obtained from training (Pg. 2, Section 3.1, “…neural weight and data to carry out convolution and other functions in a set of processing elements…”).

As for claims 6 and 19, they contain similar limitations as claim 3 above.  Thus, they are rejected under the same rationales.


As for claim 9, Fong also teaches the accelerator is one of a plurality of accelerators of an accelerator appliance (paragraph 22 and Fig. 2C).

As for claim 10, Wang also teaches the accelerator appliance includes accelerators of different capabilities (Pg. 3, Section 3.2, “…multiple precision modes, changing the model of high-precision operation into lower-bit-width operation…” teaches accelerators having different capabilities to operate in different modes).

As for claim 11, Fong also teaches a central processing unit of the accelerator appliance is shared proportional to capabilities of the plurality of accelerators (Fig. 2B, where each GPU1 and GPU2 have 80GB/s bandwidth to the CPU, thus, sharing it proportionally in terms of bandwidth of data that can be send to/from CPU for processing/to be processed).

As for claim 12, Wang also teaches deattaching the portion of an attached accelerator; and migrating the machine learning model to a different portion of the attached accelerator (Page 3, Section 3.2, “…reconfigurable ELNA accelerator that supports multiple precision modes, changing the model of high-precision operation into lower bit-width operation…” teaches changing the operation from portion of accelerator performing high-precision operation to portion performing lower bit-width operation.).

As for claim 17, it contain similar limitations as claim 12 above.  Thus, it is rejected under the same rationales.

As for claim 13, Fong also teaches deattaching the portion of an attached accelerator; and migrating the machine learning model to a portion of a different accelerator (Fig. 3F and 3G.  Workload B was migrated from GPU 1 to GPU4 connected to different CPUs).

As for claim 18, it contain similar limitations as claim 13 above.  Thus, it is rejected under the same rationales.

As for claim 14, Wang also teaches prior performing inference using the loaded machine learning model of the application using the portion of the accelerator on the attached accelerator, determining an inference engine to use based on the loaded machine learning model (Page 3, Section 2, “…ELNA manager selects the CNN hyper-parameters and the hardware operating mode, as the final configuration.  When the configuration is decided, ELNA compiler will map the final model to the accelerator at the correct mode, and generate the on-line control bitstreams that will be used by the ELNA scheduler to direct the accelerator to execute the network inference…” teaches the performance of inference using the loaded model, where the inference engine corresponds to the ML model loaded).

Claim 2 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Fong and Wang, in view of Thampy (US PGPUB 2019/0068627).

As for claim 2, Fong and Wang do not explicitly teach the applicant instance and the portion of the attached GPU are within different virtual networks.
However, Thampy teaches a method of cloud based deep learning method including the application instance and the portion of the attached GPU are within different virtual networks (paragraphs 70, 316, 321, “…client devices 406 can access a service while on a network of an organization…while on a network 450 external to the computing environment of the organization….or while connected to the network of the organization when on a network 450 external to the organization…in the latter case the client device maybe…connected to a virtual private network of the organization”  teaches 3 embodiments, the second the client is on a different network than the one of the organization, and not connected to the network of the organization via a VPN in contrast to the 3rd embodiment.  “…activity data may be obtained for one or more services…obtained for one or more users.  “…one or more client applications to interact with the server 2112 to use the services provided by these components…” teaches applicant instance offloading workload to server including Security monitoring and control system including sending data for the server to process.  See, e.g., paragraph 324.  “….examples of algorithms include one or more classifier algorithms…based on one or more models of patterns…based on one or more supervised learning techniques, one or more unsupervised learning techniques…a deep learning toolkit that implements neural network…” teaches the Security monitoring and control system running deployed models, wherein, the security monitoring and control system is backed by GPUs.  See, paragraph 79).). This known technique is applicable to the system of Fong and Wang as they both share characteristics and capabilities, namely, they are directed to execution of ML workloads.
One of ordinary skill in the art before the effective filing date of the application would have recognized that applying the known technique of Thampy would have yielded predictable results and resulted in an improved system.  It would have been recognized that applying the technique of Thampy to the teachings of Fong and Wang would have yielded predictable results because the level of ordinary skill in the art demonstrated by the references applied shows the ability to incorporate such neural network execution features into similar systems.  Further, applying application instance remote from the network where the model/inference engine is running to Fong and Wang with running ML model/inference engine on GPU in a clouded environment accordingly, would have been recognized by those of ordinary skill in the art as resulting in an improved system that would allow improved security. (Thampy, paragraph 5).

As for claim 8, it contain similar limitations as claim 2 above.  Thus, it is rejected under the same rationales.

Claim 4, 7, 15, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Fong and Wang, in view of Ravi et al. (US PGPUB 2020/0125956).  

As for claim 4, While it is merely a design choice which format the ML model is represented in not dissimilar to arbitrary choice to choose a different language among multiple common programming languages to implement a feature to run on same underlying software.  Nevertheless, in the interest of compact prosecution, Examiner note Fong and Wang do not explicitly teach the machine learning model is in TensorFlow, MXNet, or ONNX format.  
However, Ravi teaches a known method of Machine learning model running in a distributed, multi-tenant environment including the machine learning model is in TensorFlow, MXNet, or ONNX format (paragraph 103, “the model…TensorFlow graph of a model…”).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to incorporate Ravi’s teaching of machine learning model written in TensorFlow, MXNet, or ONNX format for implementing the ML Models because doing so allows for cross compatibilities using conversion of known standard formats of representations (Ravi, paragraph 103).

As for claims 7 and 20, they contain similar limitations as claim 4 above.  Thus, they are rejected under the same rationales.

As for claim 15, wherein the inference engine is compatible with the version number of the machine learning model format (paragraph 103.  Examiner note, the claim limitation here is only recited with no other details directly related to it, without teaching of checking version numbers.  In contrast, Examiner note Specification teaches an AIM and ASM determine compatibility, where, in response to determine compatibility, inference app then performs actions.  See, paragraph 109.  The inference engine itself is never checked for compatibility with ML model format.  Instead, Specification teaches checking if the ML model is compatible with the components supporting running of the model at where the model is run and inference is made.  See, e.g., paragraph 54.  Here, prior art teaches checking and converting if necessary, version of the model representation to something compatible with the ML services).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEVIN X LU whose telephone number is (571)270-1233.  The examiner can normally be reached on M-F 10am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.  
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Lewis Bullock can be reached on 5712723759.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KEVIN X LU/
Examiner, Art Unit 2199

/LEWIS A BULLOCK  JR/Supervisory Patent Examiner, Art Unit 2199