Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The instant office action having application number 17142407, filed on January 6, 2021, has claims 1-20 pending in this application.


Drawings
The drawing filed on January 6, 2021 is accepted by the Examiner.

Allowable Subject Matter
Claims 1-5, 7-12 and 14-19 are allowable over prior art of record.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 03/10/2021. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

EXAMINER'S AMENDMENT
An examiner's amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee. 
Authorization for this examiner's amendment was given in a telephone interview with Ryan P. McCarthy, Reg. No. (50,636) on July 13, 2022. The application has been amended as follows:


In the claims:
1. (Currently Amended) A computer-implemented method for deployment of a multi-model machine learning (ML) inference service in a cloud environment, the method comprising: receiving, by an application programming interface (API) server of a plurality of API servers, a prediction request from a client system, each of the plurality of API servers comprising a stateless server; selecting, by the API server, a model server from a plurality of model servers based on the prediction request, each of the plurality of model servers comprising a stateful server; calling, by the API server, the model server to execute inference using a ML model loaded to memory of the model server; receiving, by the API server, an inference result from the ML model; ; periodically calculating a ratio based on a number of cache hits and a number of cache misses; and adjusting a number of model servers in the plurality of model servers based on the ratio.

2. The method of claim 1, wherein selecting a model server comprises: providing a list of model servers indicating one or more model servers that are currently deployed; calculating a hash value for each model server in the list of model servers; sorting the list of model servers based on hash values to provide a sorted list; and selecting the model server based on the sorted list.

3. The method of claim 2, wherein each hash value is calculated based on a concatenation of a node identifier of a respective model server and a model identifier of the ML model.

4. The method of claim 1, further comprising determining, by the model server, that the ML model is loaded in memory, and in response, incrementing a cache hit.

5. The method of claim 1, further comprising determining, by the model server, that the ML model is not loaded in memory, and in response: incrementing a cache miss, retrieving the ML model from a model storage, and loading the ML model to the memory of the model server.

6. Cancelled

7. The method of claim 6, wherein at least one of a maximum number of model servers and a minimum number of model servers is determined based on the ratio, and the number of model servers is adjusted based on the maximum number of model servers and the minimum number of model servers.

8. (Currently Amended) A non-transitory computer-readable storage medium coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations deployment of a multi-model machine learning (ML) inference service in a cloud environment, the operations comprising: receiving, by an application programming interface (API) server of a plurality of API servers, a prediction request from a client system, each of the plurality of API servers comprising a stateless server; selecting, by the API server, a model server from a plurality of model servers based on the prediction request, each of the plurality of model servers comprising a stateful server; calling, by the API server, the model server to execute inference using a ML model loaded to memory of the model server; receiving, by the API server, an inference result from the ML model; ; periodically calculating a ratio based on a number of cache hits and a number of cache misses; and adjusting a number of model servers in the plurality of model servers based on the ratio.

9. The computer-readable storage medium of claim 8, wherein selecting a model server comprises: providing a list of model servers indicating one or more model servers that are currently deployed; calculating a hash value for each model server in the list of model servers; sorting the list of model servers based on hash values to provide a sorted list; and selecting the model server based on the sorted list.

10. The computer-readable storage medium of claim 9, wherein each hash value is calculated based on a concatenation of a node identifier of a respective model server and a model identifier of the ML model.

11. The computer-readable storage medium of claim 8, wherein operations further comprise determining, by the model server, that the ML model is loaded in memory, and in response, incrementing a cache hit.

12. The computer-readable storage medium of claim 8, wherein operations further comprise determining, by the model server, that the ML model is not loaded in memory, and in response: incrementing a cache miss, retrieving the ML model from a model storage, and loading the ML model to the memory of the model server.

13. Cancelled

14. The computer-readable storage medium of claim 13, wherein at least one of a maximum number of model servers and a minimum number of model servers is determined based on the ratio, and the number of model servers is adjusted based on the maximum number of model servers and the minimum number of model servers.

15.  (Currently Amended) A system, comprising: a computing device; and a computer-readable storage device coupled to the computing device and having instructions stored thereon which, when executed by the computing device, cause the computing device to perform operations for deployment of a multi-model machine learning (ML) inference service in a cloud environment, the operations comprising: receiving, by an application programming interface (API) server of a plurality of API servers, a prediction request from a client system, each of the plurality of API servers comprising a stateless server; selecting, by the API server, a model server from a plurality of model servers based on the prediction request, each of the plurality of model servers comprising a stateful server; calling, by the API server, the model server to execute inference using a ML model loaded to memory of the model server; receiving, by the API server, an inference result from the ML model; periodically calculating a ratio based on a number of cache hits and a number of cache misses; and adjusting a number of model servers in the plurality of model servers based on the ratio.

16. The system of claim 15, wherein selecting a model server comprises: providing a list of model servers indicating one or more model servers that are currently deployed; calculating a hash value for each model server in the list of model servers; sorting the list of model servers based on hash values to provide a sorted list; and selecting the model server based on the sorted list.

17. The system of claim 16, wherein each hash value is calculated based on a concatenation of a node identifier of a respective model server and a model identifier of the ML model.

18. The system of claim 15, wherein operations further comprise determining, by the model server, that the ML model is loaded in memory, and in response, incrementing a cache hit.

19. The system of claim 15, wherein operations further comprise determining, by the model server, that the ML model is not loaded in memory, and in response: incrementing a cache miss, retrieving the ML model from a model storage, and loading the ML model to the memory of the model server.

20. Cancelled


Reason for Allowance
The following is an examiner’s statement of reasons for allowance:
The closest prior art found for this application is Herrera et al. (US 7849045 B2), which describes A method for building a rulebase includes receiving a plurality of rulebase components. The method also includes merging the rulebase components to create a consolidated rulebase.
The next closest prior art found for this application is Martin et al. (US 20200394566 A1) which describes A lightweight machine learning model (MLM) microservice is hosted in a cloud computing environment suitable for large-scale data processing. A client system can utilize the MLM service to run a MLM on a dataset in the cloud computing environment. The MLM can be already developed, trained, and tested using any appropriate ML libraries on the client side or the server side. However, no data schema is required to be provided from the client side. Further, neither the MLM nor the dataset needs to be persisted on the server side. When a request to run a MLM is received by the MLM service from a client system, a data schema is inferred from a dataset provided with the MLM. The MLM is run on the dataset utilizing the inferred data schema to generate a prediction which is then returned by the MLM service to the client system.
The next closest prior art found for this application is Januschowski et al. (US 11120361 B1), which discloses An input data set with a plurality of item descriptors comprising respective time series observations is identified. A routing directive indicating a predicate to be evaluated to determine whether a particular item descriptor is to be included in a training data set for a first learning algorithm is obtained. A plurality of learning algorithms are trained using training data sets derived from the input data set according to respective routing directives, and the trained algorithms are stored.

Any individual or combination of any of these prior art does not explicitly taught or suggest the claimed invention of “a prediction request from a client system, each of the plurality of API servers comprising a stateless server; selecting, by the API server, a model server from a plurality of model servers based on the prediction request, each of the plurality of model servers comprising a stateful server; calling, by the API server, the model server to execute inference using a ML model loaded to memory of the model server; receiving, by the API server, an inference result from the ML model; sending, by the API server, the inference result to the client system; periodically calculating a ratio based on a number of cache hits and a number of cache misses; and adjusting a number of model servers in the plurality of model servers based on the ratio.”, as disclosed in independent claims 1, 8 and 15.
The dependent claims 2-5, 7, 9-12, 14 and 16-19 are also distinct from the prior art for the same reasons.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submission should be clearly labeled “Comments on Statement of Reasons for Allowance”.




Conclusion
Patent applicants with problems or questions regarding electronic images that can be viewed in the Patent Application Information Retrieval system (PAIR) can now contact the USPTO's Patent Electronic Business Center (Patent EBC) for assistance.  
Representatives are available to answer your questions daily from 6 am to midnight (EST). The toll free number is (866) 217-9197. When calling please have your application serial or patent number, the type of document you are having an image problem with, the number of pages and the specific nature of the problem.  The Patent Electronic Business Center will notify applicants of the resolution of the problem within 5-7 business days. Applicants can also check PAIR to confirm that the problem has been corrected.  The USPTO's Patent Electronic Business Center is a complete service center supporting all patent business on the Internet. The USPTO's PAIR system provides Internet-based access to patent application status and history information. It also enables applicants to view the scanned images of their own application file folder(s) as well as general patent information available to the public. 
For all other customer support, please call the USPTO Call Center (UCC) at 800-786-9199.  The USPTO's official fax number is 571-272-8300.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Noosha Arjomandi, whose telephone number is (571) 272-9784.  The examiner can normally be reached on Monday-Friday from 8 A.M. to 4 P.M.
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Robert Beausoliel, can be reached on (571) 272-3645.
July 16, 2022
/NOOSHA ARJOMANDI/Primary Examiner, Art Unit 2167