DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Applicant's submission filed on 2022-11-22 has been entered.  The status of the claims are as follows: 
Claims 1-20 remain pending in the application.
Claims 1, 7, and 20 have been amended.
Response to Arguments
Applicant’s first argument with respect to rejections under 35 USC 103, on Remarks Page 9 has been considered but is moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.  Applicant’s newly amended limitation, “evicting other resident models from the resident memory of the prediction service server”, changes the scope of the claims and necessitates a change in the applied art.  A new reference has been applied to teach this matter, as shown in the rejections below.
Applicant’s remaining arguments with respect to the objections under 35 USC 103 have been fully considered but are not persuasive.
Applicant argues on Remarks Pages 9-13 that the cited art does not teach the “selecting…in response to a prediction request” limitation.  In particular, on Remarks Page 11, Applicant argues, regarding Liu, that “the selection of the actual model to be processed for combination of criteria is not made before processing starts and when the prediction request is received, but instead such selection made for further processing.  Therefore, unlike the claimed invention, Liu does not make a selection of the version of the model and then store into a resident memory for processing. Instead as noted below as a problem with the art, the model has to be loaded first for processing before further selection.”  Examiner points out that “further processing” is still processing, and thus the selection is made before processing starts.  Nevertheless, this argument is moot because Chu, not Liu, was relied upon to teach the “selecting…in response to a prediction request” limitation.  
Applicant continues on Remarks Page 13, evidently against the limitation “selecting…according to compression and fidelity of the Al model”, regarding Chu, that “The cited art merely mentions different criteria, but no specifics. Therefore, is no specific mention or suggestion of a relationship to compression and the loading into the resident memory based on the selection.”  Examiner this time points out that Liu, not Chu, was relied upon to teach this limitation.  Liu recites a relationship to compression and fidelity.  Examiner points out that Liu, Col 8 Line 65 – Col 9 Line 12, discloses: “highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both”, and thus discloses providing several versions of each of the available Al models in differing levels of compression and fidelity.  A “combination of both” suggests a balance between resource savings and accuracy (predicted performance).  “Resource savings” may comprise response speed, as Liu discloses in Col 15 Lines 46-62:  “One non-limiting advantage of the error tolerant compression features described is that the compressed model requires a fewer resources than the uncompressed model. For example, the compressed model may require less memory to store than the uncompressed version. As another example, the compressed model may require less bandwidth to transmit than the uncompressed version. As yet another example, the compressed model may be executed more efficiently by a target system, such as an ASR system, than the uncompressed version. The efficiency may be measured by the amount of processing needed to obtain a prediction from the model. The amount of processing may be indicated by, for example, a number of processor cycles, a quantity of memory used, or a duration of time used to obtain a result from the model. Another non-limiting advantage of the features described is that the compression avoids degrading the accuracy of the compressed model.”  Above, Liu describes “fewer resources” and gives an example of “a duration of time used to obtain a result from the model”. Thus, via the “combination of both”, Liu discloses a policy according to a balance between predicted response speed and predicted performance of the model.
Applicant argues on Remarks Pages 13-14 that “The cited art also fails to teach or suggest (e.g., claim 2), ‘wherein the determining of which version of the AI model to serve for the request is determined by a policy according to balance between a predicted response speed and predicted performance of processing the AI model.’ Liu, in Col 8 Line 65 - Col 9 Line 12 shown above, discloses: "Each of the compressed models has an accuracy and resource requirements. The compressor 300 may then select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing. "  However, Liu does not teach or suggest determining of which version of the Al model to serve for the request is determined by a policy according to balance between a predicted response speed and predicted performance of processing the Al model.”  
Examiner respectfully disagrees.  Liu discloses in Col 15 Lines 46-62:  “One non-limiting advantage of the error tolerant compression features described is that the compressed model requires a fewer resources than the uncompressed model. For example, the compressed model may require less memory to store than the uncompressed version. As another example, the compressed model may require less bandwidth to transmit than the uncompressed version. As yet another example, the compressed model may be executed more efficiently by a target system, such as an ASR system, than the uncompressed version. The efficiency may be measured by the amount of processing needed to obtain a prediction from the model. The amount of processing may be indicated by, for example, a number of processor cycles, a quantity of memory used, or a duration of time used to obtain a result from the model. Another non-limiting advantage of the features described is that the compression avoids degrading the accuracy of the compressed model.”  Above, Liu describes “fewer resources” and gives an example of “a duration of time used to obtain a result from the model”. Thus, Liu discloses a policy according to a balance between predicted response speed and predicted performance of the model.)
Applicant argues on Remarks Pages 14-15 that “The cited art (and other references cited in combination), fail to teach or suggest (e.g., claim 4), ‘wherein in determining which version of the AI model to use further comprises initially loading into a working memory for processing compressed versions from among a plurality of AI models before loading a full version of a selected AI model to use for processing the received prediction request, and wherein the plurality of AI models are compressed at several levels of fidelity, and the processor dynamically determines which one of the AI models to evict from the working memory and subsequently loads a requested AI model with a selected level of fidelity’".  Applicant continues that “Chu is completely silent as to the memory usage in determining the Al model to use. The mechanism of initial loading is not shown or suggested. Moreover, the mechanism of dynamically determines which one of the Al models to evict from the working memory and subsequently loads a requested Al model with selected fidelity, from low to high fidelity is also not taught or suggested as seen in paragraphs [0166]-[0170] of Chu.”  Examiner respectfully disagrees.  Chu is not completely silent on memory usage in determining which AI model to use, as Chu, Para [0169], discloses:  “In some examples, the system can select as the candidate champion model (i) the most accurate model among the candidate models, (ii) the model that requires the least amount of computation time among the candidate models, (iii) the model that requires the least amount of memory usage among the candidate models”.  Applicant continues: “The mechanism of initial loading is not shown or suggested. Moreover, the mechanism of dynamically determines which one of the Al models to evict from the working memory and subsequently loads a requested Al model with selected fidelity, from low to high fidelity is also not taught or suggested as seen in paragraphs [0166]-[0170] of Chu.”  Examiner refers Applicant to the rejection below, but also summarizes thusly.  Chu Para [0080] discloses data “resident in memory”, and discloses that models use memory in [0168]. Chu, Para [0182], discloses “retiring” of projects, which comprises “deleting” or “moving” the project.  One of ordinary skill in the art will appreciate that “deleting” comprises removing from memory.  Chu, Para [0175], discloses to “abandon” the champion model, thus suggesting replacing it with a more confident champion model, as Chu [0184] discloses:  “In some examples, the system can receive a new model-building tool, template, or other software at any point during the process discussed above. This may automatically trigger a retrain or rebuild of the champion model, or the creation of a new model.” Thus, Chu discloses both evicting models from memory (“evicting”) and subsequently loading a requested AI model with a selected fidelity.  Finally, Applicant argues that “Liu is also silent as to the limitations mentioned as seen in Col 8 Line 65 - Col 9 Line 12 since the selection of the model is not before processing and loading into memory is performed as show above”.  Examiner repeats from the first argument above that Chu [0184] discloses choosing a model in response to a request.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6 and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Chu et. al. (US 2018/0060759 A1; hereinafter “Chu”) in view of Liu et. al. (US 10,229,356 B1; hereinafter “Liu”) and Kandoi et al. (US 2020/0104750 A1; hereinafter “Kandoi”)
As per Claim 1, Chu teaches a method, comprising:  storing at least one artificial intelligence (AI) model in a model store memory in a plurality of different versions of a same Al model, each different version having a different level of fidelity including different levels of model performance in relation to model compression (Chu, Para [0166], discloses:  “In block 1312, the system selects a candidate champion model to be used with the new version of the project. For example, the system can create multiple versions of the model, which can be referred to as candidate models. The system can then compare the candidate models to determine the best model among multiple candidate models according to a predefined criterion. The system can then select the best model as the candidate champion model, and use the candidate champion model to perform one or more tasks associated with the project.”  Here, Chu discloses storing a plurality of different versions (“the system can create multiple versions of the model, which can be referred to as candidate models”).  Chu, Para [0169], discloses:  “In some examples, the system can select as the candidate champion model (i) the most accurate model among the candidate models, (ii) the model that requires the least amount of computation time among the candidate models, (iii) the model that requires the least amount of memory usage among the candidate models, (iv) the model that requires the least amount of processing power or processing cycles among the multiple candidate models, (v) the model that is most easily interpreted according to predefined criteria, (vi) the model that has a least amount of predictors, or (vii) any combination of these. The system can select more than one candidate champion model in some examples.”  Here, Chu discloses each different version having a different level of fidelity including different levels of model performance in relation to model compression, as Chu discloses compression (“the model that requires the least amount of memory usage among the candidate models”), performance (“the most accurate model among the candidate models”), and fidelity, defined by Applicant as “performance in relation to model compression” (“any combination of these”).  Note that “compression” is a broad term that is given no special definition in the Specification, although the Specification does specify that “Relative to model compression, compressed models are typically faster and smaller in terms of memory usage than an original model”, and thus Chu’s disclosure of “requires the least amount of memory usage among the candidate models” discloses compression.
Examiner note:  Liu, which will be combined with Chu below, also explicitly recites levels of compression.)
receiving a prediction request to process the Al model from a client device (Chu, Para [0145], discloses:  “In block 1110, new data is received.”  Chu, Para [0146], discloses:  “In block 1112, the trained machine-learning model is used to analyze the new data and provide a result. For example, the new data can be provided as input to the trained machine-learning model. The trained machine-learning model can analyze the new data and provide a result that includes a classification of the new data into a particular class, a clustering of the new data into a particular group, a prediction based on the new data, or any combination of these.”  Here, Chu discloses receiving (“new data is received”) a prediction request to process the AI model (“trained machine-learning model can analyze the new data and provide a result that includes… a prediction based on the new data”)  Chu also discloses from a client device, as Chu [0145] states: “In some examples, the new data is…input by a user.”)
selecting, using a processor on a computer of a prediction service server, which version of the Al model to use for processing the received prediction request [according to compression and fidelity of the Al model], and loading into a resident memory of the prediction service server for processing from the model store memory in response to the prediction request (Chu, Para [0045], discloses:  “The computing environment 114 can include one or more processing devices (e.g., distributed over one or more networks or otherwise in communication with one another) that, in some examples, can collectively be referred to as a processor or a processing device.”  Here, Chu discloses a processor.  Chu, Para [0184], discloses:  “After creating the new model, the system can compare the new model to the existing champion model to determine which of the models is the “best” to use in the new version of the project. For example, the system can provide an input value to the new model and to the champion model, and compare outputs from the new model and the champion model to a desired output value that corresponds to the input value. The system can select, as a new champion model, whichever of the two models has an output that is closest to the desired output value or meets some other predefined criterion. For example, if the new model has an output that is closer to the desired output value than an output from the champion model, the system can select the new model as a new champion model for future use (e.g., in performing a task associated with the new version of the project), disregard the existing champion model, or both of these. This process may iterate each time a new model-building tool is added to the system.”  Here, Chu discloses selecting which version of the AI model to use (“the system can select the new model”) in response to the prediction request (“the system can provide an input value to the new model and to the champion model, and compare outputs from the new model and the champion model to a desired output value that corresponds to the input value”.  Chu also discloses a prediction service server (“servers 2004-n, capable of handling the model-building request”), as Chu [0200] states:  “The system 2000 includes a model-building dispatcher 2002 for receiving a model-building request and forwarding the model-building request to a server, such as servers 2004-n, capable of handling the model-building request.”  Also recall above Chu [0146] above discloses receiving a prediction request.)
Examiner also points out that Chu suggests that the selection is according to compression and fidelity of the Al model, as Chu, Para [0169], discloses:  “In some examples, the system can select as the candidate champion model…(vii) any combination of these”, and thus suggests a combination of the compression and fidelity in the decision.  However, Examiner will rely on Liu below for the explicit teaching).
Chu, Para [0080], discloses resident memory:  “A gridded computing environment may be employed in a distributed system with non-interactive workloads where data resides in memory on the machines, or compute nodes. In such an environment, analytic code, instead of a database management system, can control the processing performed by the nodes. Data is co-located by pre-distributing it to the grid nodes, and the analytic code on each node loads the local data into memory.” Here Chu discloses data “resides in memory”.  Chu also discloses that the model is loaded into resident memory, as they describe the model requiring memory usage in [0168]:  “In some examples, the system can select as the candidate champion model (i) the most accurate model among the candidate models, (ii) the model that requires the least amount of computation time among the candidate models, (iii) the model that requires the least amount of memory usage among the candidate models”.)
using the processor to process input data accompanying the received prediction request and using the determined version of the Al model (Chu, Para [0045], discloses:  “The computing environment 114 can include one or more processing devices (e.g., distributed over one or more networks or otherwise in communication with one another) that, in some examples, can collectively be referred to as a processor or a processing device.”  Here, Chu discloses a processor.  Chu, Para [0145], discloses:  “In block 1110, new data is received.”  Here, Chu discloses input data accompanying the received prediction request.  Chu, Para [0146], discloses:  “In block 1112, the trained machine-learning model is used to analyze the new data and provide a result. For example, the new data can be provided as input to the trained machine-learning model. The trained machine-learning model can analyze the new data and provide a result that includes a classification of the new data into a particular class, a clustering of the new data into a particular group, a prediction based on the new data, or any combination of these.”  Here, Chu discloses using a model to process the data for a prediction request (“trained machine-learning model can analyze the new data and provide a result that includes… a prediction based on the new data”).  Chu, Para [0173], discloses:  “In block 1317, the system uses the champion model to perform one or more tasks associated with the project.”  Here, Chu discloses using the determined version of the AI model (“uses the champion model”)).
responding to the received prediction request with a result of the processing of the input data using the determined Al model version to the client device.  (Chu, Para [0145], discloses:  “In block 1110, new data is received.”  Here, Chu discloses input data accompanying the received prediction request.  Chu, Para [0146], discloses:  “In block 1112, the trained machine-learning model is used to analyze the new data and provide a result. For example, the new data can be provided as input to the trained machine-learning model. The trained machine-learning model can analyze the new data and provide a result that includes a classification of the new data into a particular class, a clustering of the new data into a particular group, a prediction based on the new data, or any combination of these.”  Here, Chu discloses using a model to process the data for a prediction request (“trained machine-learning model can analyze the new data and provide a result that includes… a prediction based on the new data”).  Chu, Para [0173], discloses:  “In block 1317, the system uses the champion model to perform one or more tasks associated with the project.”  Here, Chu discloses using the determined version of the AI model (“uses the champion model”).  Chu, Para [0180], discloses:  “In some examples, the system can analyze outputs from the champion model to determine a frequency at which to retrain the champion model, and then retrain the champion model at that frequency.”  Here, Chu discloses responding to the request with a result of the processing (“analyze outputs from the champion model.”)  Recall above Chu also discloses input to a client device, as Chu [0145] states: “In some examples, the new data is…input by a user.”  Chu also discloses output to the client device at the end of [0164]:  “The reports can be presented in the form of tables, charts, graphs, or any combination of these, which may make the information more digestible for a user.”)
Chu suggests, but does not explicitly teach selecting, using a processor on a computer of a prediction service server, which version of the Al model to use for processing the received prediction request according to compression and fidelity of the AI model.
Liu explicitly teaches selecting, using a processor on a computer of a prediction service server, which version of the Al model to use according to compression and fidelity of the AI model (Recall above that Chu discloses a prediction service server.  Liu, Col 12 Line 66 – Col 13 Line 5, discloses a processor:  “In some implementations, the compressor 300 may include a computer-readable memory configured to store executable instructions. In such a configuration, the compressor 300 may further include a processor in data communication with the computer-readable memory. The processor may be programmed by the executable instructions to implement the features described.”  Liu, Col 8 Line 65 – Col 9 Line 12, discloses:  “In some implementations, the selecting may include weighing two or more compression methods. The selection may be performed by the compressor 300 to identify an “optimal” set of quantization parameters. The optimal set of parameters may be identified as the set of parameters providing a compressed model having the highest accuracy, the highest resource savings as compared to the original model, or a combination of both. For example, the compressor 300 may perform multiple compressions for the model using different parameters. Each of the compressed models has an accuracy and resource requirements. The compressor 300 may then select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”  Here, Liu discloses different versions of the AI model (“the compressor 300 may perform multiple compressions for the model using different parameters”).  Liu then discloses determining which version of the Al model to use according to compression and fidelity of the AI model (“select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”)  Here, Liu explicitly discloses using compression and a combination of compression and accuracy (or “fidelity”): (“combination of both”) in order to determine which model to use.)
Chu and Liu are analogous art because they are both in the field of endeavor of machine learning.
It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Chu and Liu.  One of ordinary skill in the art would be motivated to do so in order to save on time and resources while not sacrificing accuracy of the results (Liu, Col 3 Line 41 – Col 4 Line 45: “Some training systems compress the model to facilitate efficient storage and transfer…In view of the constraints and limitations of NN model compression discussed above, improved devices and methods for error tolerant NN model compression are desirable. The error tolerance may be provided to allow a floating point DNN model to be compressed such that the precision of the model is higher than a conventionally quantized DNN model.”)
However, the combination of Chu and Liu does not explicitly teach evicting other resident models from the resident memory of the prediction service server.
Kandoi teaches evicting other resident models from the resident memory of the prediction service server (Kandoi, Para [0046], discloses:  “FIG. 5 illustrates embodiments of an inference service on a single host. The entry point of the inference service 500 is the inference orchestration service 501. The inference orchestrator service 501 generates interpretations from text. In some embodiments, the inference orchestrator service 501 comprises a plurality of software modules to perform artifact/bundle management, pre-processing, recognizing, resolving (slot resolution), context managing (e.g., dialog act support, context carryover), connecting with the data hub 311 to provide results of an inference, and connecting with the MASS 313 to bring in a ML model to disk 511 or cache 505/509 or evict a model to the MASS 313.”  Here, Kandoi discloses a prediction service server (“inference service” running on “a single host”).  Kandoi also discloses “bring in a ML model to disk 511 or cache 505/509 or evict a model to the MASS 313”.  Kandoi provides more detail in [0050]:  “When a model inference request comes in that there is not a corresponding model in either the loaded model caches 505 or 509, or the overflow model cache 511, a call is made to the MASS to fetch the model, the least frequently used model is evicted from the caches, and the fetched model is loaded for execution and used to generate an inference.”  Here, Kandoi discloses that an other model than the “fetched model” is “evicted”, the other model being that which is “least frequently used”.)
Kandoi and the combination of Chu and Liu are analogous art because they are both in the field of endeavor of hosting machine learning models.
It would have been obvious before the effective filing date of the claimed invention to combine the candidate and champion models at different levels of fidelity of Chu and Liu, with the evicting from memory of a subset of a plurality of models of Kandoi.  One of ordinary skill in the art would be motivated to do so in order to conserve computing resources (Kandoi [0001]:  “Further, memory sizes for a single host typically do not allow for all models to be cached, or for all models to be cached economically.”)

As per Claim 2, the combination of Chu, Liu, and Kandoi teaches the method of claim 1.  Liu teaches wherein the different versions of the Al model comprise the Al model at different levels of compression, including a version having no compression (Liu, Col 8 Line 65 – Col 9 Line 12, discloses:  “In some implementations, the selecting may include weighing two or more compression methods. The selection may be performed by the compressor 300 to identify an “optimal” set of quantization parameters. The optimal set of parameters may be identified as the set of parameters providing a compressed model having the highest accuracy, the highest resource savings as compared to the original model, or a combination of both. For example, the compressor 300 may perform multiple compressions for the model using different parameters. Each of the compressed models has an accuracy and resource requirements. The compressor 300 may then select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”  Here, Liu discloses different versions of the AI model at different levels of compression (“perform multiple compressions for the model using different parameters”).  Liu also discloses a version having no compression (“the original uncompressed model”)).
wherein the determining of which version of the Al model to serve for the request is determined by a policy agreed upon by a user or device making the request, the user or device thereby selecting a policy that implements a tradeoff between a response speed and a response performance accuracy (Liu, Col 8 Line 65 – Col 9 Line 12 shown above, discloses: “Each of the compressed models has an accuracy and resource requirements. The compressor 300 may then select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”  Here, Liu discloses a policy agreed upon by a user or device making the request (“accuracy and resource requirements”) that implements a tradeoff between a response speed and a response performance accuracy (”highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing”), where “resource savings” includes saving time, and thus is a tradeoff regarding response speed.)
further comprising managing available AI models from among the plurality of Al models and providing several versions of each of the available Al models in differing levels of compression and fidelity in a backend storage before use in a working memory wherein the determining of which version of the Al model to serve for the request is determined by a policy according to balance between a predicted response speed and predicted performance of processing the Al model. ((Liu, Col 8 Line 65 – Col 9 Line 12 shown above, discloses: “highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both”, and thus discloses providing several versions of each of the available Al models in differing levels of compression and fidelity.  These models are stored in memory, and thus in “backend storage”.  If the model is “selected”, then it is executed for a machine learning problem, and is thus in “working memory” which is where programs are executed.  Liu, as shown above, discloses “accuracy”, “resource savings”, and “a combination of both”.  A “combination of both” suggests a balance between resource savings and accuracy (predicted performance).  “Resource savings” may comprise response speed, as Liu discloses in Col 15 Lines 46-62:  “One non-limiting advantage of the error tolerant compression features described is that the compressed model requires a fewer resources than the uncompressed model. For example, the compressed model may require less memory to store than the uncompressed version. As another example, the compressed model may require less bandwidth to transmit than the uncompressed version. As yet another example, the compressed model may be executed more efficiently by a target system, such as an ASR system, than the uncompressed version. The efficiency may be measured by the amount of processing needed to obtain a prediction from the model. The amount of processing may be indicated by, for example, a number of processor cycles, a quantity of memory used, or a duration of time used to obtain a result from the model. Another non-limiting advantage of the features described is that the compression avoids degrading the accuracy of the compressed model.”  Above, Liu describes “fewer resources” and gives an example of “a duration of time used to obtain a result from the model”. Thus, Liu discloses a policy according to a balance between predicted response speed and predicted performance of the model.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Chu and Liu for at least the reasons recited in Claim 1.

As per Claim 3, the combination of Chu, Liu, and Kandoi teaches the method of claim 1.  Liu teaches wherein the determining of which version of the Al model to serve for the request is determined by a policy agreed upon by a user or device making the request, the user or device thereby selecting a policy that implements a tradeoff between a response speed and a response performance accuracy wherein the determining of which version of the Al model to serve for the request is determined by a variable policy (Liu, Col 8 Line 65 – Col 9 Line 12 shown above, discloses: “Each of the compressed models has an accuracy and resource requirements. The compressor 300 may then select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”  Here, Liu discloses a policy agreed upon by a user or device making the request (“accuracy and resource requirements”) that implements a tradeoff between a response speed and a response performance accuracy (”highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing”), where “resource savings” includes saving time, and thus is a tradeoff regarding response speed.  Liu discloses several possibilities (“highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination”), and thus discloses a variable policy).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Chu and Liu for at least the reasons recited in Claim 1.
However, Liu does not explicitly teach and further comprising determining whether an Al model currently resident in a resident memory will need to be evicted from the resident memory and, if so, which currently-resident AI model will be evicted to accommodate the received request. 
Kandoi teaches and further comprising determining whether an Al model currently resident in a resident memory will need to be evicted from the resident memory and, if so, which currently-resident AI model will be evicted to accommodate the received request.  (Kandoi, Para [0050], discloses:  “When a model inference request comes in that there is not a corresponding model in either the loaded model caches 505 or 509, or the overflow model cache 511, a call is made to the MASS to fetch the model, the least frequently used model is evicted from the caches, and the fetched model is loaded for execution and used to generate an inference.”  Here, Kandoi discloses determining whether a model will need to be evicted (“When a model inference request comes in that there is not a corresponding model in either the loaded model caches 505 or 509, or the overflow model cache 511”), and if so, determining which one will be evicted (“the least frequently used model is evicted from the caches”)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kandoi with Chu and Liu for at least the reasons recited in Claim 1.

As per Claim 4, the combination of Chu, Liu, and Kandoi teaches the method of claim 1.  Chu teaches providing a confidence score to the user or device (Chu, Para [0174], discloses:  “The predicted values can then be compared to the actual values from the sensors or other electronic devices to determine the accuracy of the champion model. In some such examples, if the accuracy of the mode is high, the champion model can be assigned a high performance score, and if the accuracy of the model is low, the champion model can be assigned a low performance score.”)
providing the user or device with a mechanism to be served by a higher fidelity version of the AI model (Chu, Para [0175], discloses:  “For example, the system can determine that the champion model has an accuracy that is below a predetermined threshold. Based on this determination, the system can assign the champion model a low performance score, abandon the champion model, temporarily stop using the champion model, or any combination of these.”  Here, Chu discloses, if the confidence score is below a given threshold, to “abandon” the champion model, thus suggesting replacing it with a more confident champion model.  Chu, Para [0171], discloses:  “For example, if the candidate champion model was disapproved for being too inaccurate, the system can further train the candidate champion model until the candidate champion model is sufficiently accurate”.)
However, Chu suggests, but does not explicitly teach wherein in determining which version of the Al model to use further comprises initially loading into a working memory for processing compressed versions from among a plurality of Al models before loading a full version of a selected Al model to use for processing the received prediction request; and wherein the plurality of Al models are compressed at several levels of fidelity and the processor dynamically determines which one of the Al models to evict from the working memory and subsequently loads a requested Al model with a selected level of fidelity, wherein the determining of which version of the Al model to serve for the request is determined by a policy according to a predicted response speed and a predicted performance of processing the Al model.
Liu teaches wherein determining which version of the Al model to use further comprises initially loading into a working memory for processing compressed versions from among a plurality of Al models before loading a full version of a selected Al model to use for processing the received prediction request wherein the determining of which version of the Al model to serve for the request is determined by a policy according to a predicted response speed and a predicted performance of processing the Al model. (Liu, Col 8 Line 65 – Col 9 Line 12, discloses:  “In some implementations, the selecting may include weighing two or more compression methods. The selection may be performed by the compressor 300 to identify an “optimal” set of quantization parameters. The optimal set of parameters may be identified as the set of parameters providing a compressed model having the highest accuracy, the highest resource savings as compared to the original model, or a combination of both. For example, the compressor 300 may perform multiple compressions for the model using different parameters. Each of the compressed models has an accuracy and resource requirements. The compressor 300 may then select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”  Here, Liu discloses different versions of the AI model at different levels of compression (“perform multiple compressions for the model using different parameters”).  Liu also discloses a version having no compression (“the original uncompressed model”).  Liu compares these by “accuracy”, and in order to compare accuracy, the model must be run on some predictions, and thus is put in working memory.  This is done for several models, and all the models may be done in any order, including before the “full version” of the model is also loaded and run (“the original uncompressed model”)).
wherein the determining of which version of the Al model to serve for the request is determined by a policy according to a predicted response speed and a predicted performance of processing the Al model. (Liu, as shown above, discloses “accuracy”, “resource savings”, and “a combination of both”.  A “combination of both” suggests a balance between resource savings and accuracy (predicted performance).  “Resource savings” may comprise response speed, as Liu discloses in Col 15 Lines 46-62:  “One non-limiting advantage of the error tolerant compression features described is that the compressed model requires a fewer resources than the uncompressed model. For example, the compressed model may require less memory to store than the uncompressed version. As another example, the compressed model may require less bandwidth to transmit than the uncompressed version. As yet another example, the compressed model may be executed more efficiently by a target system, such as an ASR system, than the uncompressed version. The efficiency may be measured by the amount of processing needed to obtain a prediction from the model. The amount of processing may be indicated by, for example, a number of processor cycles, a quantity of memory used, or a duration of time used to obtain a result from the model. Another non-limiting advantage of the features described is that the compression avoids degrading the accuracy of the compressed model.”  Above, Liu describes “fewer resources” and gives an example of “a duration of time used to obtain a result from the model”. Thus, Liu discloses a policy according to a balance between predicted response speed and predicted performance of the model.))
The combination of Chu and Liu further teaches and wherein the plurality of Al models are compressed at several levels of fidelity and the processor dynamically determines which one of the Al models to evict from the working memory and subsequently loads a requested Al model with a selected level of fidelity (Liu, Col 8 Line 65 – Col 9 Line 12 as shown above, discloses compressed at several levels of fidelity (“may perform multiple compressions for the model using different parameters”).  Chu, Para [0182], discloses evicting models from memory.  Chu, Para [0175], discloses:  “For example, the system can determine that the champion model has an accuracy that is below a predetermined threshold. Based on this determination, the system can assign the champion model a low performance score, abandon the champion model, temporarily stop using the champion model, or any combination of these.”  Here, Chu discloses, if the confidence score is below a given threshold, to “abandon” the champion model, thus suggesting replacing it with a more confident champion model, as Chu [0184] discloses:  “In some examples, the system can receive a new model-building tool, template, or other software at any point during the process discussed above. This may automatically trigger a retrain or rebuild of the champion model, or the creation of a new model.” Thus, Chu discloses both evicting models from memory (“evicting”) and subsequently loading a requested AI model with a selected fidelity.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Chu and Liu for at least the reasons recited in Claim 1.

As per Claim 5, the combination of Chu, Liu, and Kandoi teaches the method of claim 1.  Chu teaches wherein the determining of which version of the Al model to use is based on a decision model that implements a tradeoff between any of: a memory usage; a latency in providing a response to the received request; a performance accuracy; a confidence level of the response; a power consumption of the processing; and a consideration of concurrent requests for processing, wherein the determining of which version of the Al model to serve for the request is determined by a policy according at least a predicted performance of processing the Al model.  (Chu, Para [0169], discloses:  “In some examples, the system can select as the candidate champion model (i) the most accurate model among the candidate models, (ii) the model that requires the least amount of computation time among the candidate models, (iii) the model that requires the least amount of memory usage among the candidate models, (iv) the model that requires the least amount of processing power or processing cycles among the multiple candidate models, (v) the model that is most easily interpreted according to predefined criteria, (vi) the model that has a least amount of predictors, or (vii) any combination of these.  Among these is “the most accurate model among the candidate models”, and thus Chu discloses a policy according to at least a predicted performance of processing the AI model.)

As per Claim 6, the combination of Chu, Liu, and Kandoi teaches the method of claim 1.  Kandoi teaches wherein the determining of which version of the AI model to use comprises one or more of: 
determining whether any version of the AI model is currently stored in a resident memory of the computer as available and appropriate to process the input data of the received request; 
determining whether a version of the Al model stored in the model store memory needs to be served by loading it into the resident memory; 
determining whether an Al model currently resident in the resident memory will need to be evicted from the resident memory and, if so, which currently-resident Al model will be evicted to accommodate the received request.  (Kandoi, Para [0050], discloses:  “When a model inference request comes in that there is not a corresponding model in either the loaded model caches 505 or 509, or the overflow model cache 511, a call is made to the MASS to fetch the model, the least frequently used model is evicted from the caches, and the fetched model is loaded for execution and used to generate an inference.”  Here, Kandoi discloses determining whether a model will need to be evicted (“When a model inference request comes in that there is not a corresponding model in either the loaded model caches 505 or 509, or the overflow model cache 511”), and if so, determining which one will be evicted (“the least frequently used model is evicted from the caches”)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kandoi with Chu and Liu for at least the reasons recited in Claim 1.

As per Claim 6, the combination of Chu, Liu, and Kandoi teaches the method of claim 1.  Kandoi teaches wherein the determining of which version of the AI model to use comprises one or more of: 
determining whether any version of the AI model is currently stored in a resident memory of the computer as available and appropriate to process the input data of the received request; 
determining whether a version of the Al model stored in the model store memory needs to be served by loading it into the resident memory; 
determining whether an Al model currently resident in the resident memory will need to be evicted from the resident memory and, if so, which currently-resident Al model will be evicted to accommodate the received request.  (Kandoi, Para [0050], discloses:  “When a model inference request comes in that there is not a corresponding model in either the loaded model caches 505 or 509, or the overflow model cache 511, a call is made to the MASS to fetch the model, the least frequently used model is evicted from the caches, and the fetched model is loaded for execution and used to generate an inference.”  Here, Kandoi discloses determining whether a model will need to be evicted (“When a model inference request comes in that there is not a corresponding model in either the loaded model caches 505 or 509, or the overflow model cache 511”), and if so, determining which one will be evicted (“the least frequently used model is evicted from the caches”)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kandoi with Chu and Liu for at least the reasons recited in Claim 1.


As per Claim 14, the combination of Chu and Liu teaches the method of claim 1.  Chu teaches as implemented as a cloud service (Chu, Para [0054], discloses:  “Data transmission network 100 may also include one or more cloud networks 116. Cloud network 116 may include a cloud infrastructure system that provides cloud services.”)

As per Claim 15, the combination of Chu and Liu teaches the method of claim 1.  Chu teaches as embodied in a set of machine-readable instructions stored in a non-transitory memory device and executable on a processor. (Chu, Para [0005], discloses:  “In another example, a non-transitory computer-readable medium can include instructions that are executable by a processing device for causing the processing device to perform operations”).

Claims 7, 8, 10-11, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chu, Liu, and Kandoi in view of Petev et. al. (US 2006/0248124 A1; hereinafter “Petev”).
As per claim 7 states, the combination of Chu, Liu, and Kandoi teaches the method of claim 6.  Kandoi teaches wherein the evicting includes evicting other resident models that are previously stored from the resident memory of the prediction service server, is according to the [preset eviction/loading policy including] memory space and popularity of a version. (Kandoi, Para [0050], discloses:  “When a model inference request comes in that there is not a corresponding model in either the loaded model caches 505 or 509, or the overflow model cache 511, a call is made to the MASS to fetch the model, the least frequently used model is evicted from the caches, and the fetched model is loaded for execution and used to generate an inference.”  Here, Kandoi discloses evicting other resident models that are previously stored (“evicted from the caches”), and this is based on memory space (intrinsically, as loading and removing is done from memory in order to manage used memory space) and on popularity of a version (“the least frequently used model”)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Liu and Chu for at least the reasons recited in Claim 1.
However, the combination of Chu, Liu, and Kandoi does not explicitly teach wherein the determining of an eviction and a loading is based on a preset eviction/loading policy.
Petev teaches wherein the determining of an eviction and a loading is based on a preset eviction/loading policy (Petev Para [0087]:  “Several functionalities are also associated with eviction policy plug-in 610. These functionalities include Sorting 611, Eviction Timing 612, and Object Key Attribution 613. The various functionalities of eviction policy plug-in 610, which also define a treatment of objects in local memory cache 630 and shared memory cache 632, are described in greater detail further below with respect to FIGS. 13a,b-15.”)
Petev and the combination of Chu, Liu, and Kandoi are analogous art because the problem faced by Petev is pertinent to the problem faced by Chu, Liu, and Kandoi (see MPEP 2141.01(a):  “Rather, a reference is analogous art to the claimed invention if: (1) the reference is from the same field of endeavor as the claimed invention (even if it addresses a different problem); or (2) the reference is reasonably pertinent to the problem faced by the inventor (even if it is not in the same field of endeavor as the claimed invention). See Bigio, 381 F.3d at 1325, 72 USPQ2d at 1212.”)  Also, Petev, like Kandoi, discusses loading and eviction from a cache.
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the combination of Chu, Liu, and Kandoi to use an eviction policy plug-in as disclosed in Petev to manage the loading and evicting of the champion models. The combination would have been obvious because a person of ordinary skill in the art would want to achieve efficient memory management, by making room for faster usage of more commonly used models (Petev [0105]:  “Caches, either local or shared, have limited storage capacities. As such, a cache may require a procedure to remove lesser used objects in order, for example, to add new objects to the cache”)

As per Claim 8 the combination of Chu, Liu, Kandoi, and Petev teaches the method of claim 7.  Petev teaches wherein loading and eviction decisions are determined in accordance with an eviction/loading policy that comprises one of: loading and eviction decisions are determined in accordance with an eviction/loading policy that comprises one of: a first-in/first-out (FIFO) policy; a least-recently-used (LRU) policy; and a potential gain in a confidence level between different versions of an Al model stored in the memory. (Petev, Para [0109], discloses:  “FIG. 13B illustrates a more detailed perspective of various types of sorting components 611 that may be chosen for use within a particular eviction policy plug-in 603. In one embodiment, four types of queues may be implemented by sorting component 611: 1) a Least Recently Used (LRU) queue 617; 2) a Least Frequently Used (LFU) queue 618; 3) a size-based queue 619; and, 4) a First In First Out (FIFO) queue 621.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Petev with the combination of Chu, Liu, and Kandoi for at least the reasons recited in Claim 7.

As per Claim 10 the combination of Chu, Liu, Kandoi, and Petev teaches the method of claim 7 as well a compressed model (see Rejection to Claim 1) and a preset eviction/loading policy (see Rejection to Claim 7).  Chu teaches wherein the preset eviction/loading policy comprises a quick load policy that always first loads a low fidelity, compressed model that quickly returns a lower confidence score. (While Petev discloses cache management, the combination with Chu’s selection of a compressed model based on various metrics suggests first loading a model based on those metrics.  Chu, Para [0169], discloses:  “In some examples, the system can select as the candidate champion model (i) the most accurate model among the candidate models, (ii) the model that requires the least amount of computation time among the candidate models, (iii) the model that requires the least amount of memory usage among the candidate models, (iv) the model that requires the least amount of processing power or processing cycles among the multiple candidate models, (v) the model that is most easily interpreted according to predefined criteria, (vi) the model that has a least amount of predictors, or (vii) any combination of these.”  If a more compressed model is chosen, then the accuracy will be lower, which would then return a lower confidence score.  Chu, Para [0174], discloses a confidence score:  “The predicted values can then be compared to the actual values from the sensors or other electronic devices to determine the accuracy of the champion model. In some such examples, if the accuracy of the mode is high, the champion model can be assigned a high performance score, and if the accuracy of the model is low, the champion model can be assigned a low performance score.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Petev with the combination of Chu, Liu, and Kandoi for at least the reasons recited in Claim 7.

As per Claim 11 the combination of Chu, Liu, Kandoi, and Petev teaches the method of claim 10.  Chu teaches wherein the quick load policy further then loads a higher fidelity model for subsequent requests by the user or device.  (While Petev discloses cache management, the combination with Chu’s selection of a compressed model based on various metrics suggests loading a model based on those metrics.  Chu, Para [0169], discloses:  “In some examples, the system can select as the candidate champion model (i) the most accurate model among the candidate models, (ii) the model that requires the least amount of computation time among the candidate models, (iii) the model that requires the least amount of memory usage among the candidate models, (iv) the model that requires the least amount of processing power or processing cycles among the multiple candidate models, (v) the model that is most easily interpreted according to predefined criteria, (vi) the model that has a least amount of predictors, or (vii) any combination of these.”  Here, Chu discloses that a higher fidelity model may be chosen (“the most accurate model among the candidate models”)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Petev with the combination of Chu, Liu, and Kandoi for at least the reasons recited in Claim 7.

As per Claim 20 the combination of Chu, Liu, and Petev teaches the method of claim 17.  Petev teaches the preset eviction/loading policy comprising one of: a first-in/first-out (FIFO) policy; a least-recently-used (LRU) policy; and a policy that considers a potential gain in confidence level or fidelity level between different versions of Al models (Petev, Para [0109], discloses:  “FIG. 13B illustrates a more detailed perspective of various types of sorting components 611 that may be chosen for use within a particular eviction policy plug-in 603. In one embodiment, four types of queues may be implemented by sorting component 611: 1) a Least Recently Used (LRU) queue 617; 2) a Least Frequently Used (LFU) queue 618; 3) a size-based queue 619; and, 4) a First In First Out (FIFO) queue 621.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Petev with the combination of Chu and Liu for at least the reasons recited in Claim 17.
However, Petev does not teach and further comprising evicting versions of the Al models that are not selected from the resident memory.
Kandoi teaches and further comprising evicting versions of the Al models that are not selected from the resident memory. (Kandoi, Para [0046], discloses:  “FIG. 5 illustrates embodiments of an inference service on a single host. The entry point of the inference service 500 is the inference orchestration service 501. The inference orchestrator service 501 generates interpretations from text. In some embodiments, the inference orchestrator service 501 comprises a plurality of software modules to perform artifact/bundle management, pre-processing, recognizing, resolving (slot resolution), context managing (e.g., dialog act support, context carryover), connecting with the data hub 311 to provide results of an inference, and connecting with the MASS 313 to bring in a ML model to disk 511 or cache 505/509 or evict a model to the MASS 313.”  Here, Kandoi discloses a prediction service server (“inference service” running on “a single host”).  Kandoi also discloses “bring in a ML model to disk 511 or cache 505/509 or evict a model to the MASS 313”.  Kandoi provides more detail in [0050]:  “When a model inference request comes in that there is not a corresponding model in either the loaded model caches 505 or 509, or the overflow model cache 511, a call is made to the MASS to fetch the model, the least frequently used model is evicted from the caches, and the fetched model is loaded for execution and used to generate an inference.”  Here, Kandoi discloses that an other model than the “fetched model” is “evicted”, the other model being that which is “least frequently used”.)
Kandoi and the combination of Chu and Liu are analogous art because they are both in the field of endeavor of hosting machine learning models.
It would have been obvious before the effective filing date of the claimed invention to combine the candidate and champion models at different levels of fidelity of Chu and Liu, with the evicting from memory of a subset of a plurality of models of Kandoi.  One of ordinary skill in the art would be motivated to do so in order to conserve computing resources (Kandoi [0001]:  “Further, memory sizes for a single host typically do not allow for all models to be cached, or for all models to be cached economically.”)

Claims 9 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chu, Liu, Kandoi, and Petev in view of Li et. al. (US 9,946,462 B1; hereinafter “Li”).
As per Claim 9, the combination of Chu, Liu, Kandoi, and Petev teaches the method of claim 7 as well as a preset eviction/loading policy (see Rejection to Claim 7) and an original, uncompressed version of the Al model for a requested prediction and evicts models as necessary (see Rejection to Claim 2).  However, the combination of Chu, Liu, Kandoi, and Petev does not explicitly teach wherein the preset eviction/loading policy comprises a load first policy that always loads an original, uncompressed version of the Al model for a requested prediction and evicts models as necessary, thereby providing a policy of a priority of an accuracy over a latency.
Li teaches wherein the preset eviction/loading policy comprises a load first policy that always loads an original, uncompressed version (Li, Col 15 Line 49 – 63, discloses:  “For example, the order in which the data groups may be compressed (or decompressed) may be configurable based on demand. If the data corresponding to data group two is needed to be updated or retrieved prior to data corresponding to data group one, the CDC can apply priority to access requests. The priority may be based on first in, first out (“FIFO”), a weighted priority based on a total number of access requests corresponding to a group, or other priority. Further, the CDC may keep a cache of uncompressed data for fast retrieval and updating. In this case, the CDC may choose which group to remove from the cache based on parameters of access to those groups. In an example, the CDC may evict a least recently used (“LRU”) group from its cache. The evicted group may then be compressed and stored to the mapping table. There may be many different caching eviction policies that can be used.” Here, Li discloses a load first policy that always loads an original, uncompressed version (“The priority may be based on first in, first out (“FIFO”), a weighted priority based on a total number of access requests corresponding to a group, or other priority. Further, the CDC may keep a cache of uncompressed data for fast retrieval and updating.”  Examiner notes that the uncompressed model will be the most accurate, and thus this achieves “a policy of a priority of an accuracy over a latency”.  However, Examiner also notes that the claim language “thereby providing a policy of a priority of an accuracy over a latency” is not limiting (see MPEP 2111.04(I):  “Claim scope is not limited by claim language that suggests or makes optional but does not require steps to be performed”.)
Li and the combination of Chu, Liu, Kandoi, and Petev are analogous art because the problem faced by Li is pertinent to the problem faced by Chu, Liu, Kandoi, and Petev (see MPEP 2141.01(a):  “Rather, a reference is analogous art to the claimed invention if: (1) the reference is from the same field of endeavor as the claimed invention (even if it addresses a different problem); or (2) the reference is reasonably pertinent to the problem faced by the inventor (even if it is not in the same field of endeavor as the claimed invention). See Bigio, 381 F.3d at 1325, 72 USPQ2d at 1212.”)
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the combination of Chu, Liu, Kandoi, and Petev to use a load first policy of an uncompressed version as disclosed in Li to manage the loading of the candidate models. The combination would have be obvious because a person of ordinary skill in the art would want to be able to quickly access the uncompressed model, which is the most accurate model (Li, Col 15 Line 62-63  “Further, the CDC may keep a cache of uncompressed data for fast retrieval and updating.”)

As per Claim 12, the combination of Chu, Liu, and Kandoi teaches the method of claim 1.  However, the combination of Chu, Liu, and Kandoi does not explicitly teach further comprising making a provision for defining a predetermined eviction/loading policy, the predetermined eviction/loading policy comprising one of: a load first policy that always loads an original, uncompressed model for a prediction and evicts models as necessary, thereby prioritizing an accuracy over a latency; and a quick load policy that always first loads a low fidelity, compressed model to quickly return a prediction result but with a lower confidence score, while immediately thereafter loading a higher fidelity model for subsequent requests.
Petev, covers centralized cache configuration but also teaches wherein the determining of an eviction and a loading is based on a preset eviction/loading policy (Petev Para [0087]:  “Several functionalities are also associated with eviction policy plug-in 610. These functionalities include Sorting 611, Eviction Timing 612, and Object Key Attribution 613. The various functionalities of eviction policy plug-in 610, which also define a treatment of objects in local memory cache 630 and shared memory cache 632, are described in greater detail further below with respect to FIGS. 13a,b-15.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Petev with the combination of Chu, Liu, and Kandoi for at least the reasons recited in Claim 7.
However, the combination of Chu, Liu, Kandoi, and Petev thus far fails to teach the predetermined eviction/loading policy comprising one of: a load first policy that always loads an original, uncompressed model for a prediction and evicts models as necessary, thereby prioritizing an accuracy over a latency; and a quick load policy that always first loads a low fidelity, compressed model to quickly return a prediction result but with a lower confidence score, while immediately thereafter loading a higher fidelity model for subsequent requests.
Li teaches the predetermined eviction/loading policy comprising one of: a load first policy that always loads an original, uncompressed model (Li, Col 15 Line 49 – 63, discloses:  “For example, the order in which the data groups may be compressed (or decompressed) may be configurable based on demand. If the data corresponding to data group two is needed to be updated or retrieved prior to data corresponding to data group one, the CDC can apply priority to access requests. The priority may be based on first in, first out (“FIFO”), a weighted priority based on a total number of access requests corresponding to a group, or other priority. Further, the CDC may keep a cache of uncompressed data for fast retrieval and updating. In this case, the CDC may choose which group to remove from the cache based on parameters of access to those groups. In an example, the CDC may evict a least recently used (“LRU”) group from its cache. The evicted group may then be compressed and stored to the mapping table. There may be many different caching eviction policies that can be used.” Here, Li discloses a load first policy that always loads an original, uncompressed version (“The priority may be based on first in, first out (“FIFO”), a weighted priority based on a total number of access requests corresponding to a group, or other priority. Further, the CDC may keep a cache of uncompressed data for fast retrieval and updating.”  Examiner notes that the uncompressed model will be the most accurate, and thus this achieves “a policy of a priority of an accuracy over a latency”.  However, Examiner also notes that the claim language “thereby providing a policy of a priority of an accuracy over a latency” is not limiting (see MPEP 2111.04(I):  “Claim scope is not limited by claim language that suggests or makes optional but does not require steps to be performed”.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Li with the combination of Chu, Liu, Kandoi, and Petev for at least the reasons recited in Claim 9.

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chu, Liu, and Kandoi in view of Lee (US 2013/0054810 A1; hereinafter “Lee”).
As per Claim 13, the combination of Chu, Liu, and Kandoi teaches the method of claim 1.  Chu teaches as implemented by a prediction service (Chu, Para [0146], discloses:  “In block 1112, the trained machine-learning model is used to analyze the new data and provide a result. For example, the new data can be provided as input to the trained machine-learning model. The trained machine-learning model can analyze the new data and provide a result that includes a classification of the new data into a particular class, a clustering of the new data into a particular group, a prediction based on the new data, or any combination of these.”  Here, Chu discloses “a prediction based on the new data”, and thus discloses that Chu’s method may be called a “prediction service”).
further comprising managing a loading of non-resident Al models from a store of the plurality of Al models to a resident memory by dynamically determining which Al models to evict and which models to load based on policies (Chu, Para [0182], discloses:  “As another example, if the system creates yet another version of the project after determining that the champion model is to be retrained in block 1322, the system can determine that the existing version of the project is to be retired. Retiring the new version of the project can include deleting the new version of the project, removing the new version of the project from the production environment, moving the new version of the project to a repository of retired or unused versions of the project, or any combination of these.” Here, Chu discloses determining that a given currently-resident AI model is to be “retired”, which comprises “deleting” or “moving” the project, based on policies (“system can determine”)).
However, Chu does not explicitly teach which include consideration of service level agreements as related to a plurality of users in order to determine which version of the Al model to use for processing a received prediction request.
Lee teaches which include consideration of service level agreements as related to a plurality of users in order to determine which [version of the Al model] resource to use for processing a received prediction request. (Recall above that Chu discloses versions of an AI model.  Lee, Para [0035], discloses:  “Although in the example shown in FIG. 1, two media service delivery apparatuses 1a and 1b are provided for convenience of explanation, there may be present more than two media service delivery apparatuses, and there may be present two or more users and service providers. In the presence of a plurality of users and service providers, service resources are provided from different service networks in accordance with the service level agreement, and the provided service resources are analyzed to offer an optimal service resource to the user 2.”  Here, Lee discloses consideration of service level agreements as related to a plurality of users in order to determine a resource.)
Lee and the combination of Chu, Liu, and Kandoi are analogous art because the problem faced by Lee is pertinent to the problem faced by Chu, Liu, and Kandoi. (see MPEP 2141.01(a):  “Rather, a reference is analogous art to the claimed invention if: (1) the reference is from the same field of endeavor as the claimed invention (even if it addresses a different problem); or (2) the reference is reasonably pertinent to the problem faced by the inventor (even if it is not in the same field of endeavor as the claimed invention). See Bigio, 381 F.3d at 1325, 72 USPQ2d at 1212.”)
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the combination of Chu, Liu, and Kandoi to use a service level agreement as disclosed in Lee to manage the loading of the candidate models. The combination would have be obvious because a person of ordinary skill in the art would want to be able to maximize user satisfaction (Lee [0035]:  “offer an optimal service resource to the user 2”).

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Chu in view of Liu.
As per Claim 16, Chu teaches A method, comprising: storing at least one artificial intelligence (AI) model in a model store memory in a plurality of different versions, each different version having a different level of fidelity including different levels of model performance in relation to model compression[, including an original version with no loss of fidelity] (Chu, Para [0166], discloses:  “In block 1312, the system selects a candidate champion model to be used with the new version of the project. For example, the system can create multiple versions of the model, which can be referred to as candidate models. The system can then compare the candidate models to determine the best model among multiple candidate models according to a predefined criterion. The system can then select the best model as the candidate champion model, and use the candidate champion model to perform one or more tasks associated with the project.”  Here, Chu discloses storing a plurality of different versions (“the system can create multiple versions of the model, which can be referred to as candidate models”).  Chu, Para [0169], discloses:  “In some examples, the system can select as the candidate champion model (i) the most accurate model among the candidate models, (ii) the model that requires the least amount of computation time among the candidate models, (iii) the model that requires the least amount of memory usage among the candidate models, (iv) the model that requires the least amount of processing power or processing cycles among the multiple candidate models, (v) the model that is most easily interpreted according to predefined criteria, (vi) the model that has a least amount of predictors, or (vii) any combination of these. The system can select more than one candidate champion model in some examples.”  Here, Chu discloses each different version having a different level of fidelity including different levels of model performance in relation to model compression, as Chu discloses compression (“the model that requires the least amount of memory usage among the candidate models”), performance (“the most accurate model among the candidate models”), and fidelity, defined by Applicant as “performance in relation to model compression” (“any combination of these”).  Note that “compression” is a broad term that is given no special definition in the Specification, although the Specification does specify that “Relative to model compression, compressed models are typically faster and smaller in terms of memory usage than an original model”, and thus Chu’s disclosure of “requires the least amount of memory usage among the candidate models” discloses compression.
Examiner note:  Liu, which will be combined with Chu below, also explicitly recites levels of compression.  An original with no loss of fidelity will also be taught by Liu.)
receiving a request to process the AI model, the request including input data to be processed by the AI model (Chu, Para [0145], discloses:  “In block 1110, new data is received.”  Chu, Para [0146], discloses:  “In block 1112, the trained machine-learning model is used to analyze the new data and provide a result. For example, the new data can be provided as input to the trained machine-learning model. The trained machine-learning model can analyze the new data and provide a result that includes a classification of the new data into a particular class, a clustering of the new data into a particular group, a prediction based on the new data, or any combination of these.”  Here, Chu discloses receiving (“new data is received”) a prediction request to process the AI model (“trained machine-learning model can analyze the new data and provide a result that includes… a prediction based on the new data”) including input data to be processed by the model (“new data can be provided as input”)).
selecting, using a processor on a computer, which version of the AI model to use for responding to the request (Chu, Para [0045], discloses:  “The computing environment 114 can include one or more processing devices (e.g., distributed over one or more networks or otherwise in communication with one another) that, in some examples, can collectively be referred to as a processor or a processing device.”  Here, Chu discloses a processor.  Chu, Para [0184], discloses:  “After creating the new model, the system can compare the new model to the existing champion model to determine which of the models is the “best” to use in the new version of the project. For example, the system can provide an input value to the new model and to the champion model, and compare outputs from the new model and the champion model to a desired output value that corresponds to the input value. The system can select, as a new champion model, whichever of the two models has an output that is closest to the desired output value or meets some other predefined criterion. For example, if the new model has an output that is closer to the desired output value than an output from the champion model, the system can select the new model as a new champion model for future use (e.g., in performing a task associated with the new version of the project), disregard the existing champion model, or both of these. This process may iterate each time a new model-building tool is added to the system.”  Here, Chu discloses selecting which version of the AI model to use (“the system can select the new model”) in response to the prediction request (“the system can provide an input value to the new model and to the champion model, and compare outputs from the new model and the champion model to a desired output value that corresponds to the input value”.
loading into a resident memory for processing the selected version of the AI model according to the request; (Chu, Para [0080], discloses resident memory:  “A gridded computing environment may be employed in a distributed system with non-interactive workloads where data resides in memory on the machines, or compute nodes. In such an environment, analytic code, instead of a database management system, can control the processing performed by the nodes. Data is co-located by pre-distributing it to the grid nodes, and the analytic code on each node loads the local data into memory.” Here Chu discloses data “resides in memory”.  Chu also discloses that the model is loaded into resident memory, as they describe the model requiring memory usage in [0168]:  “In some examples, the system can select as the candidate champion model (i) the most accurate model among the candidate models, (ii) the model that requires the least amount of computation time among the candidate models, (iii) the model that requires the least amount of memory usage among the candidate models”.)
processing the input data using the determined version of the AI model; (Chu, Para [0045], discloses:  “The computing environment 114 can include one or more processing devices (e.g., distributed over one or more networks or otherwise in communication with one another) that, in some examples, can collectively be referred to as a processor or a processing device.”  Here, Chu discloses a processor.  Chu, Para [0145], discloses:  “In block 1110, new data is received.”  Here, Chu discloses input data accompanying the received prediction request.  Chu, Para [0146], discloses:  “In block 1112, the trained machine-learning model is used to analyze the new data and provide a result. For example, the new data can be provided as input to the trained machine-learning model. The trained machine-learning model can analyze the new data and provide a result that includes a classification of the new data into a particular class, a clustering of the new data into a particular group, a prediction based on the new data, or any combination of these.”  Here, Chu discloses using a model to process the data for a prediction request (“trained machine-learning model can analyze the new data and provide a result that includes… a prediction based on the new data”).  Chu, Para [0173], discloses:  “In block 1317, the system uses the champion model to perform one or more tasks associated with the project.”  Here, Chu discloses using the determined version of the AI model (“uses the champion model”)).
and providing a result of the processing to the response to the request (Chu, Para [0145], discloses:  “In block 1110, new data is received.”  Here, Chu discloses input data accompanying the received prediction request.  Chu, Para [0146], discloses:  “In block 1112, the trained machine-learning model is used to analyze the new data and provide a result. For example, the new data can be provided as input to the trained machine-learning model. The trained machine-learning model can analyze the new data and provide a result that includes a classification of the new data into a particular class, a clustering of the new data into a particular group, a prediction based on the new data, or any combination of these.”  Here, Chu discloses using a model to process the data for a prediction request (“trained machine-learning model can analyze the new data and provide a result that includes… a prediction based on the new data”).  Chu, Para [0173], discloses:  “In block 1317, the system uses the champion model to perform one or more tasks associated with the project.”  Here, Chu discloses using the determined version of the AI model (“uses the champion model”).  Chu, Para [0180], discloses:  “In some examples, the system can analyze outputs from the champion model to determine a frequency at which to retrain the champion model, and then retrain the champion model at that frequency.”  Here, Chu discloses responding to the request with a result of the processing (“analyze outputs from the champion model”)).
wherein the determining which version of the AI model to use comprises implementing at least one of: 
a determination policy preselected; 
a preset eviction/loading policy that determines whether to evict an Al model currently in a resident memory to accommodate the received request and, if so, which At model to evict; 
and a preset policy that implements a preset tradeoff involving predetermined ones of any of: a latency, a model performance (accuracy), a confidence, a memory usage, a power consumption, a central processing unit (CPU) usage, and a consideration of a concurrent processing.
(Chu discloses a determination policy preselected, in [0168]:  “In some examples, the system can select as the candidate champion model (i) the most accurate model among the candidate models, (ii) the model that requires the least amount of computation time among the candidate models, (iii) the model that requires the least amount of memory usage among the candidate models, (iv) the model that requires the least amount of processing power or processing cycles among the multiple candidate models, (v) the model that is most easily interpreted according to predefined criteria, (vi) the model that has a least amount of predictors, or (vii) any combination of these.”  One or a combination of these is “preselected”, as it is selected before the model is used.)
However, Chu does not explicitly teach including an original version with no loss of fidelity.
Liu explicitly teaches including an original version with no loss of fidelity (Liu, Col 8 Line 65 – Col 9 Line 12, discloses:  “In some implementations, the selecting may include weighing two or more compression methods. The selection may be performed by the compressor 300 to identify an “optimal” set of quantization parameters. The optimal set of parameters may be identified as the set of parameters providing a compressed model having the highest accuracy, the highest resource savings as compared to the original model, or a combination of both. For example, the compressor 300 may perform multiple compressions for the model using different parameters. Each of the compressed models has an accuracy and resource requirements. The compressor 300 may then select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”  Here, Liu discloses “original uncompressed model”.)
Liu also explicitly teaches levels of compression (Liu, Col 8 Line 65 – Col 9 Line 12, as shown above, discloses “multiple compressions for the model using different parameters”)
Liu also explicitly teaches and a preset policy that implements a preset tradeoff involving predetermined ones of any of: a latency, a model performance (accuracy), a confidence, a memory usage, a power consumption, a central processing unit (CPU) usage, and a consideration of a concurrent processing.  (Liu, as shown above, discloses “accuracy”, “resource savings”, and “a combination of both”.  A “combination of both” suggests a balance between resource savings and accuracy (predicted performance).  “Resource savings” may comprise response speed (latency), as Liu discloses in Col 15 Lines 46-62:  “One non-limiting advantage of the error tolerant compression features described is that the compressed model requires a fewer resources than the uncompressed model. For example, the compressed model may require less memory to store than the uncompressed version. As another example, the compressed model may require less bandwidth to transmit than the uncompressed version. As yet another example, the compressed model may be executed more efficiently by a target system, such as an ASR system, than the uncompressed version. The efficiency may be measured by the amount of processing needed to obtain a prediction from the model. The amount of processing may be indicated by, for example, a number of processor cycles, a quantity of memory used, or a duration of time used to obtain a result from the model. Another non-limiting advantage of the features described is that the compression avoids degrading the accuracy of the compressed model.”  Above, Liu describes “fewer resources” and gives an example of “a duration of time used to obtain a result from the model”. Thus, Liu discloses a policy that implements a preset tradeoff involving a latency and a model performance (accuracy)).
Chu and Liu are analogous art because they are both in the field of endeavor of machine learning.
It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Chu and Liu.  One of ordinary skill in the art would be motivated to do so in order to save on time and resources while not sacrificing accuracy of the results (Liu, Col 3 Line 41 – Col 4 Line 45: “Some training systems compress the model to facilitate efficient storage and transfer…In view of the constraints and limitations of NN model compression discussed above, improved devices and methods for error tolerant NN model compression are desirable. The error tolerance may be provided to allow a floating point DNN model to be compressed such that the precision of the model is higher than a conventionally quantized DNN model.”)

Claims 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chu, Liu, and Petev.
As per Claim 17, Chu teaches a method in a prediction service (Chu, Para [0146], discloses:  “In block 1112, the trained machine-learning model is used to analyze the new data and provide a result. For example, the new data can be provided as input to the trained machine-learning model. The trained machine-learning model can analyze the new data and provide a result that includes a classification of the new data into a particular class, a clustering of the new data into a particular group, a prediction based on the new data, or any combination of these.”  Here, Chu discloses “a prediction based on the new data”, and thus discloses that Chu’s method may be called a “prediction service”).
the method comprising: 
storing a plurality of artificial intelligence (Al) models in a model store memory, each Al model being stored in a plurality of different versions of a same one of the Al models, each different version having a different level of fidelity according to compression[, including an original version with no loss of fidelity] (Chu, Para [0166], discloses:  “In block 1312, the system selects a candidate champion model to be used with the new version of the project. For example, the system can create multiple versions of the model, which can be referred to as candidate models. The system can then compare the candidate models to determine the best model among multiple candidate models according to a predefined criterion. The system can then select the best model as the candidate champion model, and use the candidate champion model to perform one or more tasks associated with the project.”  Here, Chu discloses storing a plurality of different versions (“the system can create multiple versions of the model, which can be referred to as candidate models”).  Chu, Para [0169], discloses:  “In some examples, the system can select as the candidate champion model (i) the most accurate model among the candidate models, (ii) the model that requires the least amount of computation time among the candidate models, (iii) the model that requires the least amount of memory usage among the candidate models, (iv) the model that requires the least amount of processing power or processing cycles among the multiple candidate models, (v) the model that is most easily interpreted according to predefined criteria, (vi) the model that has a least amount of predictors, or (vii) any combination of these. The system can select more than one candidate champion model in some examples.”  Here, Chu discloses each different version having a different level of fidelity including different levels of model performance in relation to model compression, as Chu discloses compression (“the model that requires the least amount of memory usage among the candidate models”), performance (“the most accurate model among the candidate models”), and fidelity, defined by the user as “performance in relation to model compression” (“any combination of these”).  Note that “compression” is a broad term that is given no special definition in the Specification, although the Specification does specify that “Relative to model compression, compressed models are typically faster and smaller in terms of memory usage than an original model”, and thus Chu’s disclosure of “requires the least amount of memory usage among the candidate models” discloses compression.)  *Liu below will teach an original version with no loss of fidelity 
receiving a prediction request for processing a requested Al model of the plurality of Al models, the prediction request including input data for the processing of the requested Al model receiving a prediction request to process the Al model (Chu, Para [0145], discloses:  “In block 1110, new data is received.”  Chu, Para [0146], discloses:  “In block 1112, the trained machine-learning model is used to analyze the new data and provide a result. For example, the new data can be provided as input to the trained machine-learning model. The trained machine-learning model can analyze the new data and provide a result that includes a classification of the new data into a particular class, a clustering of the new data into a particular group, a prediction based on the new data, or any combination of these.”  Here, Chu discloses receiving  a prediction request to process the AI model (“trained machine-learning model can analyze the new data and provide a result that includes… a prediction based on the new data”), the request including input data (“new data is received”)).
selecting, using a processor on a computer, which version of the requested Al model will be used to process the input data included with the prediction request after receiving the prediction request (Chu, Para [0045] as shown above, discloses a processor.  Chu, Para [0169], discloses:  “In some examples, the system can select as the candidate champion model (i) the most accurate model among the candidate models, (ii) the model that requires the least amount of computation time among the candidate models, (iii) the model that requires the least amount of memory usage among the candidate models, (iv) the model that requires the least amount of processing power or processing cycles among the multiple candidate models, (v) the model that is most easily interpreted according to predefined criteria, (vi) the model that has a least amount of predictors, or (vii) any combination of these. The system can select more than one candidate champion model in some examples.”  Here, Chu discloses each different version having a different level of fidelity including different levels of model performance in relation to model compression, as Chu discloses compression (“the model that requires the least amount of memory usage among the candidate models”), performance (“the most accurate model among the candidate models”), and fidelity, defined by the user as “performance in relation to model compression” (“any combination of these”).  Here, Chu discloses determining which version of the Al model to use for processing the received prediction request (“select as the candidate champion model”). Chu, Para [0184], discloses:  “After creating the new model, the system can compare the new model to the existing champion model to determine which of the models is the “best” to use in the new version of the project. For example, the system can provide an input value to the new model and to the champion model, and compare outputs from the new model and the champion model to a desired output value that corresponds to the input value. The system can select, as a new champion model, whichever of the two models has an output that is closest to the desired output value or meets some other predefined criterion. For example, if the new model has an output that is closer to the desired output value than an output from the champion model, the system can select the new model as a new champion model for future use (e.g., in performing a task associated with the new version of the project), disregard the existing champion model, or both of these. This process may iterate each time a new model-building tool is added to the system.”  Here, Chu discloses selecting which version of the AI model to use (“the system can select the new model”) in response to the prediction request (“the system can provide an input value to the new model and to the champion model, and compare outputs from the new model and the champion model to a desired output value that corresponds to the input value”.)
and whether the version needs to be loaded from the model store memory into a resident memory for processing the input data (Chu, Para [0182], discloses moving models between memory locations:  “If the system determines that the champion model is not to be retrained or rebuilt, the process can proceed to block 1324, where the system can determine if the new version of the project is to be retired. For example, the system can receive input from a user or client device indicating that the new version of the project is no longer being used. The system can respond to the input by determining that the project is to be retired. As another example, if the system creates yet another version of the project after determining that the champion model is to be retrained in block 1322, the system can determine that the existing version of the project is to be retired. Retiring the new version of the project can include deleting the new version of the project, removing the new version of the project from the production environment, moving the new version of the project to a repository of retired or unused versions of the project, or any combination of these.”  Here, Chu discloses “removing the new version of the project from the production environment, moving the new version of the project to a repository of retired or unused versions of the project, or any combination of these.”)
when the requested Al model version is to be loaded from the model store memory into the resident memory, determining whether another Al model currently resident in the resident memory will need to be evicted from the resident memory to accommodate moving the version of the requested Al model into the resident memory (Chu, Para [0182] above, discloses “existing version of the project is to be retired” which may comprise “removing the new version of the project from the production environment, moving the new version of the project to a repository of retired or unused versions of the project, or any combination of these.”  Here, Chu discloses “retiring” models and moving them between different memory locations.)
processing the input data to provide a prediction result; (Chu, Para [0045], discloses:  “The computing environment 114 can include one or more processing devices (e.g., distributed over one or more networks or otherwise in communication with one another) that, in some examples, can collectively be referred to as a processor or a processing device.”  Here, Chu discloses a processor.  Chu, Para [0145], discloses:  “In block 1110, new data is received.”  Here, Chu discloses input data accompanying the received prediction request.  Chu, Para [0146], discloses:  “In block 1112, the trained machine-learning model is used to analyze the new data and provide a result. For example, the new data can be provided as input to the trained machine-learning model. The trained machine-learning model can analyze the new data and provide a result that includes a classification of the new data into a particular class, a clustering of the new data into a particular group, a prediction based on the new data, or any combination of these.”  Here, Chu discloses using a model to process the data for a prediction request (“trained machine-learning model can analyze the new data and provide a result that includes… a prediction based on the new data”).  Chu, Para [0173], discloses:  “In block 1317, the system uses the champion model to perform one or more tasks associated with the project.”  Here, Chu discloses using the determined version of the AI model (“uses the champion model”)).
and responding to the prediction request by transmitting the prediction result. (Chu, Para [0180], discloses:  “In some examples, the system can analyze outputs from the champion model to determine a frequency at which to retrain the champion model, and then retrain the champion model at that frequency.”  Here, Chu discloses responding to the request with a result of the processing (“analyze outputs from the champion model”).  Chu, Para [0044], discloses transmitting data (“Data transmission network 100 is a specialized computer system that may be used for processing large amounts of data where a large number of computer processing cycles are required.”)
However, Chu does not explicitly teach and, when so, determining which currently-resident Al model will be evicted to accommodate the received request, using a preset eviction/loading policy according to compression and fidelity of the Al model version; 
Liu teaches including an original version with no loss of fidelity (Liu, Col 8 Line 65 – Col 9 Line 12, discloses:  “In some implementations, the selecting may include weighing two or more compression methods. The selection may be performed by the compressor 300 to identify an “optimal” set of quantization parameters. The optimal set of parameters may be identified as the set of parameters providing a compressed model having the highest accuracy, the highest resource savings as compared to the original model, or a combination of both. For example, the compressor 300 may perform multiple compressions for the model using different parameters. Each of the compressed models has an accuracy and resource requirements. The compressor 300 may then select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”  Here, Liu discloses “original uncompressed model”.)
using a [preset eviction/loading] policy according to compression and fidelity of the Al model version (Liu, Col 8 Line 65 – Col 9 Line 12, discloses:  “In some implementations, the selecting may include weighing two or more compression methods. The selection may be performed by the compressor 300 to identify an “optimal” set of quantization parameters. The optimal set of parameters may be identified as the set of parameters providing a compressed model having the highest accuracy, the highest resource savings as compared to the original model, or a combination of both. For example, the compressor 300 may perform multiple compressions for the model using different parameters. Each of the compressed models has an accuracy and resource requirements. The compressor 300 may then select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”  Here, Liu discloses different versions of the AI model (“the compressor 300 may perform multiple compressions for the model using different parameters”).  Liu then discloses determining which version of the Al model to use according to compression and fidelity of the AI model (“select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”)  Here, Liu explicitly discloses using compression and a combination of compression and accuracy (or “fidelity”): (“combination of both”) in order to determine which model to use.)
Chu and Liu are analogous art because they are both in the field of endeavor of machine learning.
It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Chu and Liu.  One of ordinary skill in the art would be motivated to do so in order to save on time and resources while not sacrificing accuracy of the results (Liu, Col 3 Line 41 – Col 4 Line 45: “Some training systems compress the model to facilitate efficient storage and transfer…In view of the constraints and limitations of NN model compression discussed above, improved devices and methods for error tolerant NN model compression are desirable. The error tolerance may be provided to allow a floating point DNN model to be compressed such that the precision of the model is higher than a conventionally quantized DNN model.”)
Petev teaches when so, determining which [currently-resident Al] model will be evicted to accommodate the received request, using a preset eviction/loading policy (Recall above that Chu discloses currently resident AI model.  Petev Para [0087]:  “Several functionalities are also associated with eviction policy plug-in 610. These functionalities include Sorting 611, Eviction Timing 612, and Object Key Attribution 613. The various functionalities of eviction policy plug-in 610, which also define a treatment of objects in local memory cache 630 and shared memory cache 632, are described in greater detail further below with respect to FIGS. 13a,b-15.”)
Petev and the combination of Chu and Liu are analogous art because the problem faced by Petev is pertinent to the problem faced by Chu and Liu (see MPEP 2141.01(a):  “Rather, a reference is analogous art to the claimed invention if: (1) the reference is from the same field of endeavor as the claimed invention (even if it addresses a different problem); or (2) the reference is reasonably pertinent to the problem faced by the inventor (even if it is not in the same field of endeavor as the claimed invention). See Bigio, 381 F.3d at 1325, 72 USPQ2d at 1212.”)  Also, Petev, like Kandoi, discusses loading and eviction from a cache.
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the combination of Chu and Liu to use an eviction policy plug-in as disclosed in Petev to manage the loading and evicting of the champion models. The combination would have been obvious because a person of ordinary skill in the art would want to achieve efficient memory management, by making room for faster usage of more commonly used models (Petev [0105]:  “Caches, either local or shared, have limited storage capacities. As such, a cache may require a procedure to remove lesser used objects in order, for example, to add new objects to the cache”)

As per Claim 18 the combination of Chu, Liu, and Petev teaches the method of claim 17. Liu teaches wherein the determining of which version of the AI model to use is based on a policy preset by a user or device that provided the prediction request, the preset policy defining a tradeoff between a speed of receiving the prediction result and a performance of the version of the model for the prediction result. (Liu, Col 8 Line 65 – Col 9 Line 12 shown above, discloses: “Each of the compressed models has an accuracy and resource requirements. The compressor 300 may then select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”  Here, Liu discloses a policy agreed upon by a user or device making the request (“accuracy and resource requirements”) that implements a tradeoff between a response speed and a response performance accuracy (”highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing”), where “resource savings” includes saving time, and thus is a tradeoff regarding response speed.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Liu with Chu for at least the reasons recited in Claim 17.

As per Claim 19 the combination of Chu, Liu, and Petev teaches the method of claim 17.  Liu teaches wherein the preset eviction/loading policy defines a tradeoff between a lower latency period for loading Al models from the model store memory and a lower memory usage versus a degradation of a performance and confidence of prediction results. (Liu, Col 8 Line 65 – Col 9 Line 12 shown above, discloses: “Each of the compressed models has an accuracy and resource requirements. The compressor 300 may then select the model having the highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing.”  Here, Liu discloses a policy agreed upon by a user or device making the request (“accuracy and resource requirements”) that implements a tradeoff between a response speed and a response performance accuracy (”highest accuracy, the highest resource savings as compared to the original uncompressed model, or a combination of both for further processing”), where “resource savings” includes saving memory, and thus is a tradeoff regarding latency, as one of ordinary skill in the art will appreciate that the more memory an entity uses, the more time required to load it from memory (latency), and thus Liu teaches a tradeoff between latency for loading the models, and performance accuracy.  Also recall that Chu explicitly discloses memory usage in [0169]:  “the model that requires the least amount of memory usage among the candidate models.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Liu with Chu for at least the reasons recited in Claim 17.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/L.A.S./Examiner, Art Unit 2126                                                                                                                                                                                                        
/VIKER A LAMARDO/Primary Examiner, Art Unit 2126