Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
Claims 1-20 are pending in the application.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-4, 6-11 and 13-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over US Patent 10,805,179 to Guim Bernat et al. (hereafter Bernat) in view of PG Pub. 2019/0156247 to Faulhaber Jr. et al. (hereafter Faulhaber). 

As to claim 1, Bernat teaches the invention substantially as claimed including a computer-implemented method, comprising: 
determining, with at least one processor based on an inference request, at least one model to process the inference request on a plurality of computing platforms [receiving request identifying an inference model with a requirement and determines a platform of a plurality of platforms to handle the request, the model being implemented on various platforms, the gateway maintaining information of platforms/models and model characteristics/capabilities which maps and routes the request to the appropriate platform/model, abstract; Figs. 3-4 and corresponding text; col. 4, lines 21-30]; 
obtaining, with the at least one processor, profile information of the at least one model, the profile information including measured characteristics of the at least one model [type of model received and hence being identified having associated performance characteristic and capabilities, abstract; col. 9, lines 5-8; Fig. 3 and corresponding text]; 
dynamically determining, with the at least one processor, a selected computing platform from between the plurality of computing platforms for responding to the inference request based on an optimized objective associated with a status of the computing platform and the profile information [type of model received and hence being identified having associated performance characteristic and capabilities requirements and selecting a particular platform of particular type and performance capability/capacity or availability, abstract; col. 9, lines 5-8; Fig. 3 and corresponding text]; 
routing, with the at least one processor, the inference request to the selected computing platform [routing the request to the appropriate platform/model, abstract; Figs. 3-4 and corresponding text; col. 4, lines 21-30]; and 
processing the inference request [handling the inference request by an inference model of a platform, col. 16, lines 27-45].  

Bernat does not specifically teach the platforms being a CPU and GPU and selecting the platform between the CPU and GPU.  However, Bernat disclosed various model types implemented by different accelerator such as FPGA [col. 4, lines 21-30; col. 9, lines 5-8].   Furthermore, Faulhaber teaches a model selector that dynamically route inference requests to particular ML models based at least in part on characteristics of the inference request, ML models performance such as accuracy [paragraphs 19, 32, 34, 52, 72]; machine to train a machine learning model being GPU or CPU instance type [paragraph 77].  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Bernat’s assignment of inference request to a variety of additional platforms including a GPU and CPU to extend the applicability of Bernat’s teaching.

As to claim 2, Bernat as modified teaches the invention substantially as claimed including further comprising determining the optimized objective by balancing throughput associated with processing the at least one model based on at least one of the following: a latency target, a power budget, a server node count, or any combination thereof [balancing load by distributing incoming inference request to a platform that can satisfy the request by meeting the timing constrain (i.e. latency), col. 9, lines 22-44; col. 15, lines 8-49].
  
As to claim 3, Bernat as modified teaches the invention substantially as claimed including wherein the profile information is obtained based on the at least one model and comprises a runtime deployment model trained for at least one of the plurality of computing platforms [gateway maintaining information of platforms/models and model characteristics/capabilities, abstract; Figs. 3-4 and corresponding text; col. 4, lines 21-30].  

As to claim 4, Bernat as modified teaches the invention substantially as claimed including wherein the measured characteristics of the at least one model comprise at least one of the following: a throughput, a size, a power efficiency, an accuracy level, a model type, or any combination thereof [request of type of model received and hence being identified having associated performance characteristic and capabilities requirements and selecting a particular platform of particular type, performance capability/capacity or availability and processing power, abstract; col. 3, lines 28-37; col. 9, lines 5-8; Fig. 3 and corresponding text].  

As to claim 6, Bernat as modified teaches the invention substantially as claimed including wherein the status of the computing platform is determined based on at least one of the following: CPU utilization, GPU utilization, RAM utilization, inference throughput, number of jobs, latency data, historical completion data, CPU type, GPU type, a size of the at least one model, a memory footprint of the at least one model, or any combination thereof [platform model/CPU/accelerator type and resource usage, abstract; Fig. 3 and corresponding text; col. 3, lines 26-41; col. 8, lines 51-56; col. 9, lines 5-8 and 57-60; platform processing latency, col. 15, lines 8-49].  

As to claim 7, Bernat as modified teaches the invention substantially as claimed including further comprising at least one of: sending the inference request to the at least one CPU if a throughput is less than a threshold; sending the inference request to the at least one GPU if the at least one GPU is accumulating inference jobs and the inference request can fit into a batch based on the measured characteristics; sending the inference request to the at least one CPU if the at least one GPU is unavailable to process the at least one model; or creating a new job for the at least one GPU [non-overloading platform that can satisfy the request will receive the request, such that load is balanced by distributing incoming inference request to platform that can satisfy the request by meeting the timing constrain (i.e. latency), col. 9, lines 22-44; col. 15, lines 8-49].
As to claims 8-11, 13-14 and 15-20, Bernat as modified teaches the method of routing an inference request to a selected computing platform as recited in claims 1-4 and 6-7, therefore Bernat as modified teaches the system and computer program product for implementing the method. 

Claim(s) 5 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Bernat and Faulhaber, further in view of US Patent 10,417,350 to Mohamed et al. (hereafter Mohamed).
 
As to claim 5, Bernat and Faulhaber do not specifically teach wherein the at least one GPU is configured in a blocking state until a job size is maximized to start inferences.  A and B do not teach the recited limitation. However, Mohamed teaches requests to train ML models being handled as batch jobs [col. 13, lines 61-66].  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combine Bernat, Faulhaber and Mohamed because they are in the same field of endeavor directed to machine learning and because batch execution allow more efficient used of resources by reducing overheat associated with communication.

As to claim 12, this claim is rejected for the same reason as claim 5 above.

Response to Arguments
Applicant's remarks filed 10/21/22 have been fully considered. 
Applicant argued in substance that: 
Bernat is directed to a fixed system that determines an inference appliance that can best meet the constraints of a static SLA which does not include a dynamically determined computing platform based on an optimized objective associated with a status of the computing platform. 

Examiner respectfully traversed Applicant's remarks:
As to point (a), the examiner respectfully disagree and submit that optimized objective associated with a status of the (selected) computing platform is broadly interpreted as meeting the performance objective or goal based on the current operating status of the platform, as disclosed in Bernat, factors such as a load-based performance characteristic and capabilities varies in time such that the load condition and performance capabilities of the platform at a particular point in time when the gateway received an incoming request is dynamic, therefore selecting a particular platform of particular type having a particular performance capability/capacity or availability at the time in meeting  a constraint is also dynamic [abstract; col. 9, lines 5-8 and 22-44; Fig. 3 and corresponding text], hence satisfied the limitations as claimed

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to QING YUAN WU whose telephone number is (571)272-3776.  The examiner can normally be reached on M-F 9AM-6PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Lewis Bullock can be reached on 571-272-3759.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/QING YUAN WU/Primary Examiner, Art Unit 2199