DETAILED ACTON
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Examiner notes the entry of the following papers:
Amended claims filed 9/21/2022.
Applicant’s arguments/remarks made in amendment filed 9/21/2022.
Claims 1, 15, and 20 are amended.
Claims 17 and 19 stand cancelled.
Claims 1-16, 18, and 20-22 are presented for examination.
Response to Arguments
Applicant presents several arguments.  Each is addressed.
Applicant’s arguments that the prior art of record does not disclose the amended limitations to independent claim 1 and by extension, independent claims 15 and 20 are moot in view of new grounds of rejection.  See detailed rejection below.
Applicant argues that “Plainly, the Examiner’s only incentive or motivation for modifying You using the teachings of Jacobs in the manner suggested in the Office Action results from using Appellant’s disclosure as a blueprint to reconstruct the claimed invention out of isolated teaching in the prior art…” (Remarks, page 12, paragraph 2, line 1.)  However, You is directed to knowledge distillation from an ensemble of expert machine learning models, specifically neural networks. Likewise, Jacobs is directed to an ensemble of expert neural networks.  In addition, Jacobs teaches evaluating a respective performance of each of the expert models based on their outputs, and then determining whether to include the outputs.  This is a feature that could be beneficial to You.  As such, it makes sense and is obvious to combine what is disclosed by Jacobs into You.  Therefore, the rejection is proper and maintained.
Applicant argues that “Only with Applicants’ specification could the structure of claim 1 be attained, and any attempt to arrive at the structure of claim 1 through study of the cited references is only reachable from improper hindsight analysis after viewing Applicants’ specification.” (Remarks, page 12, paragraph 3, line 1.) Applicant seems to be arguing that only Applicant could combine the various limitations of previously disclosed art to arrive at the claimed invention. However, the field of knowledge distillation, in some form, has been around for years with numerous researchers and engineers actively producing publications and implementations for improvement by combining and/or adding to ideas from previous implementations. In addition, Applicant fails to point out specifically why or how the motivation for combination presented by examiner is not obvious to ordinary skill in the art to arrive at the presently described limitations of claim 1.    Therefore, rejection is proper and maintained.
Applicant argues that “Moreover, Jacobs teaches training the expert models and the gating network together (Jacobs pp. 3-4).  Therefore, the gating network Jacobs differs from the machine-learned trust model as claimed (e.g., ‘wherein the one or more machine-learned trust models were trained to evaluate pre-trained machine-learned model performance based on the set of outputs for the respective pre-trained machine-learned model, wherein the one or more machine-learned trust models were trained with a validation dataset, wherein the validation dataset differs from the plurality of respective pre-trained model training datasets’).” (Remarks, page 13, paragraph 3, line 1.)  Jacobs was mapped to the first limitation “wherein the one or more machine-learned trust models were trained to evaluate pre-trained machine-learned model performance based on the set of outputs for the respective pre-trained machine-learned model”.  (Jacobs, Figure 1. In other words, gating networks (trust models) of Jacobs, were trained to evaluate expert networks (pre-trained machine-learned models) on the set of outputs (performance) of the respective experts.) However, Jacobs was not mapped to the second limitation “wherein one or more machine-learned trust models were trained with a validation dataset, wherein the validation dataset differs from the plurality of respective pre-trained model training datasets”. (See subparagraph e.)   
Applicant argues that “Hinton fails to teach or disclose a machine-learned trust model as claimed, but instead, the cited passage teach training a generalist model on all data and training special models on subsets of the data for very big datasets.” (Remarks, page 13, paragraph 3, line 8.) However, the limitation that describes the trust model recites “one or more machine-learned trust models were trained with a validation dataset, wherein the validation dataset differs from the plurality of respective pre-trained model training datasets” (Claim 1, line 18.) This is a limitation describing what dataset was used to train the machine-learned trust model and that it differs from the plurality of respective pre-trained model training dataset. Hinton teaches this. (Hinton, page 6, paragraph 2, line 1 “When the number of classes is very large, it makes sense for the cumbersome model to be an ensemble that contains one generalist model trained on all the data and many ‘specialist’ models, each of which is trained on data that is highly enriched from a very confusable subset..”.  In other words, Examiner is interpreting that generalist model is one or more machine-learned trust models, trained on all the data is trained with the validation dataset, specialist model is pre-trained model, and trained on data that is highly enriched is dataset that differs from the validation dataset.) Therefore, rejection is proper and maintained. See detailed rejection.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 6-11, 13-15, 20, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over You et al (Learning from Multiple Teacher Networks, herein You), Ba, et al (Do Deep Nets Really Need to be Deep?, herein Ba), Jacobs et al (Adaptive Mixtures of Local Experts, herein Jacobs), and Hinton et al (Distilling the Knowledge in a Neural Network, herein Hinton). 
Regarding claim 1, 
	You teaches a computer-implemented method for performing knowledge distillation (You, Figure 1, and page 1285, column 1, paragraph 1, line 7 “In this paper, we present a method to train a thin deep network by incorporating multiple teacher networks not only in output layer by averaging the softened outputs (dark knowledge) from different networks, but also in the intermediate layers by imposing a constraint about the dissimilarity among examples.” In other words, method is computer-implemented method, and to train a thin deep network is for performing knowledge distillation.), the method comprising: 


    PNG
    media_image1.png
    739
    1172
    media_image1.png
    Greyscale

	obtaining, by one or more computing devices, an initial training dataset that comprises a set of training examples (You, page 1287, column 2, paragraph 4, line 5 “Formally, given a student network NS parametrized by ΘS and a training dataset 
    PNG
    media_image2.png
    26
    124
    media_image2.png
    Greyscale
,…” In other words, training dataset is initial dataset that comprises a set of training examples.) ;  
	obtaining, by the one or more computing devices, a plurality of sets of outputs respectively produced for the set of training examples by processing the set of examples with a plurality of pre-trained machine-learned models (You, page 1285, column 2, paragraph 2, line 2 “The well-trained wide deep networks are naturally regarded as teachers, which have the capability of guiding the training of a new student network of the smaller size.” And, page 1287, 
column 2, paragraph 4, line 5 “Formally, given a student network NS parametrized by ΘS and a training dataset 
    PNG
    media_image2.png
    26
    124
    media_image2.png
    Greyscale
, our goal is to regularize its both output layer and intermediate layers using the knowledge transferred from multiple teacher networks 
    PNG
    media_image3.png
    29
    74
    media_image3.png
    Greyscale
 parameterized by 
    PNG
    media_image4.png
    26
    27
    media_image4.png
    Greyscale
.” And, page 1287, column 1, paragraph 3, line 10 “Moreover, y refers to the ground-truth label vector; 
    PNG
    media_image5.png
    26
    31
    media_image5.png
    Greyscale
and  
    PNG
    media_image6.png
    29
    29
    media_image6.png
    Greyscale
 are the softened outputs of 
    PNG
    media_image7.png
    21
    30
    media_image7.png
    Greyscale
 and 
    PNG
    media_image8.png
    23
    31
    media_image8.png
    Greyscale
,”  In other words, 
    PNG
    media_image5.png
    26
    31
    media_image5.png
    Greyscale
is output of teacher network, 
    PNG
    media_image6.png
    29
    29
    media_image6.png
    Greyscale
 is output of student network, well-trained wide deep networks is pre-trained machine-learned models, 
    PNG
    media_image2.png
    26
    124
    media_image2.png
    Greyscale
 is the set of training examples, and  
    PNG
    media_image3.png
    29
    74
    media_image3.png
    Greyscale
 is a plurality of sets of outputs respectively produced for the set of training examples by processing the set of examples with a plurality of pre-trained machine-learned models.),
each of the plurality of pre-trained machine-learned models having been previously trained to perform a respective task based on a respective pre-trained model training dataset (You, page 1285, column 1, paragraph 1, line 7 “In this paper, we present a method to train a thin deep network by incorporating multiple teacher networks not only in output layer by averaging the softened outputs (dark knowledge) from different networks, but also in the intermediate layers by imposing a constraint about the dissimilarity among examples.” In other words, multiple teacher networks is plurality of pre-trained machine-learned models, and train a thin deep network…. by incorporating multiple teacher networks is by a plurality of pre-trained machine-learned models. Examiner notes that in order to be used for training a thin deep network the teacher models, by necessity, must have been trained as well.), wherein
the initial training dataset differs from one or more of the plurality of respective pre-trained model training datasets (You, page 1287, column 1, paragraph 2, line 1 “By regarding the network with large depth but thin width as a student network, off-the-shelf teacher networks can be applied to boost its training process and the resulting performance.”  In other words, off-the-shelf teacher networks are a plurality of respective pre-trained models trained with different datasets than the initial training dataset of the student network.); wherein
[the plurality of sets of outputs are obtained without accessing the plurality of respective pre-trained machine-learned models;]
[evaluating, using one or more machine-learned trust models and by the one or more computing devices, a respective performance of each pre-trained machine-learned model based at least in part on the set of outputs generated by the respective pre-trained machine-learned model; wherein]
[the one or more machine-learned trust models were trained to evaluate pre-trained machine-learned model performance based on the set of outputs for the respective pre-trained machine-learned model,] 
[wherein the one or more machine-learned trust models were trained with a validation dataset, wherein the validation dataset differs from the plurality of respective pre-trained model training datasets;]
[determining, by the one or more computing devices, for the set of outputs generated by each pre-trained machine-learned model, whether to include one or more outputs of the set of outputs in a distillation training dataset based at least in part on the respective performance of such pre-trained machine-learned model as determined by the one or more machine-learned trust models,] 
	wherein the distillation training dataset comprises a first output generated by a first pre-trained machine-learned model associated with a first training example and a second output generated by a second pre-trained machine-learned model associated with a second training example (You, page 1286, column 1, paragraph 3, line 1 “In this paper, we investigate training a thin and deep student network by integrating the knowledge from multiple teacher networks.” And, page 1285, column 1, paragraph 1, line 7 “In this paper, we present a method to train a thin deep network by incorporating multiple teacher networks not only in output layer by averaging the softened outputs (dark knowledge) from different networks, but also in the intermediate layers by imposing a constraint about the dissimilarity among examples.” And, page 1290, column 2, paragraph 6, line 8 “…then we train the ultimate model by the whole training dataset (including validation dataset) under the selected parameter configuration.” In other words, the whole training dataset is distillation training dataset, teacher is pre-trained machine-learned model, multiple teachers produce multiple different outputs, dissimilarity among examples is a first training example different from a second training example, and incorporating multiple teacher networks… in output layer is a first output generated by a first pre-trained machine-learned model associated with a first training example and a second output generated by a second machine-learned model associated with a second training example.) ; and
training, by the one or more computing devices, a distilled machine-learned model using at least a portion of the distillation training dataset (You, page 1285, column 1, paragraph 1, line 7 “In this paper, we present a method to train a thin deep network by incorporating multiple teacher networks not only in output layer by averaging the softened outputs (dark knowledge) from different networks, but also in the intermediate layers by imposing a constraint about the dissimilarity among examples.” In other words, a thin deep network is a distilled machine-learned model, and softened outputs is at least a portion of the distillation dataset.) .
Thus far, You does not explicitly teach the plurality of sets of outputs are obtained without accessing the plurality of respective pre-trained machine-learned models.
Ba teaches the plurality of sets of outputs are obtained without accessing the plurality of respective pre-trained machine-learned models (Ba, page 2, paragraph 3, line 1 “On both TIMIT and CIFAR-10 we use model compression to train shallow mimic nets using data labeled by either a deep net, or an ensemble of deep nets, trained on the original TIMIT or CIFAR-10 training data.”  In other words, data labeled by an ensemble of deep nets is a plurality of outputs, and training the mimic nets on the data is training without accessing the pre-trained machine-learned models.);
Both Ba and You are directed to knowledge distillation, among other things, you teaches training a student network from multiple teacher networks, but does not explicitly teach that the student is trained by the multiple teacher networks, without accessing the multiple teacher networks.  Ba teaches knowledge distillation in which the student network is trained from a dataset comprising the outputs of the pre-trained teacher networks, without accessing the pre-trained teacher networks.  In view of the teaching of You it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Ba into You.  This would result in a knowledge distillation method that can train a student network from the outputs of the teacher networks, without accessing the teacher networks.
	One of ordinary skill in the art would be motivated to do this because inference with shallow neural networks is faster and requires less space, but is typically less accurate.  Compression allows for the possibility that you can train a shallow neural network from a pre-trained deep neural network, or a pre-trained ensemble of deep neural networks, thereby reducing training time, and speeding up execution, without reducing accuracy. (Ba, page 2, paragraph 2 “Surprisingly, often it is not (yet) possible to train a small neural net on the original training data to be as accurate as the complex model, nor as accurate as the mimic model. Compression demonstrates that a small neural net could, in principle, learn the more accurate function, but current learning algorithms are unable to train a model with that accuracy from the original training data; instead, we must train the complex intermediate model first and then train the neural net to mimic it. Clearly, when it is possible to mimic the function learned by a complex model with a small net, the function learned by the complex model wasn’t truly too complex to be learned by a small net. This suggests to us that the complexity of a learned model, and the size and architecture of the representation best used to learn that model, are different things.”)
	Thus far, the combination of You and Ba does not explicitly teach evaluating, using one or more machine-learned trust models and by the one or more computing devices, a respective performance of each pre-trained machine-learned model based at least in part on the set of outputs generated by the pre-trained machine-learned model.
Jacobs teaches evaluating, using one or more machine-learned trust models and by the one or more computing devices, a respective performance of each pre-trained machine-learned model based at least in part on the set of outputs generated by the respective pre-trained machine-learned model (Jacobs, page 1, paragraph 2, line 3 “If we know in advance that a set of training cases may be naturally divided into subsets that correspond to distinct subtasks, interference can be reduced by using a system composed of several different “expert” networks plus a gating network that decides which of the experts should be used for each training case. Hampshire and Waibel (1989) have described a system of this kind that can be used when the division into subtasks is known prior to training, and Jacobs, Jordan and Barto (1990) have described a related system that learns how to allocate cases to experts.   The idea behind such a system is that the gating network allocates a new case to one or a few experts, and, if the output is incorrect, the weight changes are localized to those experts (and the gating network).  So there is no interference with the weights of other experts that specialize in quite different cases.  The experts are therefore local in the sense that the weights in one expert are decoupled from the weights in other experts.  In addition, they will often be local in the sense that each expert will be allocated to only a small local region of the space of possible input vectors.” And, page 4, paragraph 1, “Notice that in this new error function, each expert is required to produce the whole of the output vector rather than a residual.  As a result, the goal of a local expert on a given training case in not directly affected by the weights within other local experts….If both the gating network and the local experts are trained by gradient descent in this new error function, the system tends to devote a single expert to each training case.” In other words, gating network is machine-learned trust model, expert networks are pre-trained machine-learned models, each expert is required to produce the whole of the output vector is the set of outputs generated by the respective pre-trained machine-learned model, and gating network that decides which of the experts should be used for each training case is evaluating using one or more machine-learned trust models a respective performance of each pre-trained machine-learned model based at least in part on the set of outputs generated by the pre-trained machine-learned model.) 
Jacobs teaches wherein the one or more machine-learned trust models were trained to evaluate pre-trained machine-learned model performance based on the set of outputs for the respective pre-trained machine-learned model, (Jacobs, see above mapping. In other words, gating network is one or more machine-learned trust models, expert networks are pre-trained machine-learned models, output vector is set of outputs, and gating network…trained by gradient descent is machine-learned trust model trained to evaluate the pre-trained machine-learned model based on the set of outputs for the respective pre-trained machine-learned model.)
Jacobs teaches determining, by the one or more computing devices, for the set of outputs generated by each pre-trained machine-learned model, whether to include one or more outputs of the set of outputs in a [distillation training dataset (see You [page 1290, column 2, paragraph 6, line 8], page 7 of office action] based at least in part on the respective performance of such pre-trained machine-learned model as determined by the one or more machine-learned trust models (Jacobs, Figure 1, and page 2, paragraph 4, line 1 “Instead of linearly combining the outputs of the separate experts, we imagine that the gating network makes a stochastic decision about which single expert to use on each occasion (see figure 1).”   In other words, expert is pre-trained machine-learned model, gating network is machine-learned trust model, and gating network…compares the performance of different experts… to know how to revise its assignment probabilities is determining…which expert to use for a particular input…based at least in part on the respective performance of the pre-trained machine-learned models, as determined by the machine-learned trust model.) 
Both Jacobs and the combination of You and Ba are directed to machine learning using ensemble schemes, among other things.  The combination of You and Ba teaches a method for knowledge distillation, but does not explicitly teach a method for a mixture of experts. Jacobs teaches a method for a mixture of experts.  Jacobs teaches determining, for the set of outputs generated by each pre-trained machine-learned model, whether to include one or more outputs of the set of outputs… based at least in part on the respective performance of such pre-trained machine-learned model as determined by the one or more machine-learned trust models but does not teach a distillation training set. In view of the teaching of the combination of You and Ba, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Jacobs into the combination of You and Ba. This would result in being able to select which teacher network to use for a specific training example based at least in part on the prior output of the respective network to produce the distillation training set. 
One of ordinary skill in the art would be motivated to do this because if a training set can be divided into subsets that correspond to distinct subtasks, interference can be reduced by using a system composed of several different “expert” networks thus improving efficiency. (Jacobs, page 1, paragraph 2, line 1 “If backpropagation is used to train a single, multilayer network to perform different subtasks on different occasions, there will generally be strong interference effects which lead to slow learning and poor generalization.  If we know in advance that a set of training cases may be naturally divided into subsets that correspond to distinct subtasks, interference can be reduced by using a system composed of several different “expert” networks plus a gating network that decides which of the experts should be used for each training case.”)
	Thus far, the combination of You, Ba, and Jacobs does not explicitly teach wherein the one or more machine-learned trust models were trained with a validation dataset, wherein the validation dataset differs from the plurality of respective pre-trained model training datasets.
	Hinton teaches wherein the one or more machine-learned trust models were trained with a validation dataset, wherein the validation dataset differs from the plurality of respective pre-trained model training datasets (Hinton, page 1, paragraph 1, line 11 “We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.  Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.” And, page 2, paragraph 5, line 1 “The transfer set that is used to train the small model could consist entirely of unlabeled data [1] or we could use the original training set.”  And, page 2, paragraph 6, line 1, “Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, zi, computed for each class into a probability, qi, by comparing zi with the other logits.

    PNG
    media_image9.png
    68
    611
    media_image9.png
    Greyscale

And, page 2, paragraph 3, line 1 “An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities…” and, page 6, paragraph 2, line 1 “When the number of classes is very large, it makes sense for the cumbersome model to be an ensemble that contains one generalist model trained on all the data and many “specialist” models, each of which is trained on data that is highly enriched in examples from a very confusable subset of the classes (like different types of mushroom).” In other words, generalist model is machine-learned trust model, trained on all the data is trained with the validation set, specialist model is pre-trained model, and trained on data that is highly enriched is dataset that differs from the validation dataset.)
	Both Hinton and the combination of You, Ba, and Jacobs are directed to distilling knowledge from teacher networks, among other things. In view of the teaching of the combination of You, Ba, and Jacobs, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Hinton into the combination of You, Ba, and Jacobs.  This would result in being able to evaluate pre-trained machine-learned models (also known as specialists or experts) based on their performance for the purpose of inclusion in the distillation training set.
	One of ordinary skill in the art would be motivated to do this because it would improve the accuracy of the model as well as the speed of training. (Hinton, page 1, paragraph 1, line 11 “We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.  Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.”)
Regarding claim 2,
	the combination of You, Ba, Jacobs, and Hinton teaches the computer-implemented method of claim 1,
	wherein the plurality of pre-trained machine-learned models comprise pre-trained classifier models configured to infer one or more classification labels for each training example as an output.  (You, Figure 1, and page 1285, column 2, paragraph 2, line 2 “The well-trained wide deep networks are naturally regarded as teachers, which have the capability of guiding the training of a new student network of the smaller size.” In other words, teachers are pre-trained machine-learned models that are pre-trained classifier models, configured to infer one or more classification labels for each training example as an output.)
Regarding claim 6, 
	the combination of You, Ba, Jacobs, and Hinton teaches the computer-implemented method of claim 1,
	wherein evaluating, by the one or more computing devices, the respective performance of each pre-trained machine-learned model based at least in part on the set of outputs generated by the pre-trained machine-learned model comprises training, by the one or more computing devices using the validation dataset, the one or more machine-learned trust models to evaluate the respective performance of each pre-trained machine-learned model. (Hinton, page 6, paragraph 2, line 1 “When the number of classes is very large, it makes sense for the cumbersome model to be an ensemble that contains one generalist model trained on all the data and many “specialist” models, each of which is trained on data that is highly enriched in examples from a very confusable subset of the classes (like different types of mushroom).” And, page 8, paragraph 4, line 3 “At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run.” In other words, generalist is trust model, trained on all the data is using the validation set, specialist model is pre-trained model, and predictions from the generalist model to decide which  specialists are relevant is evaluating the respective performance of each pre-trained machine-learned model based at least in part on the set of outputs.)
Regarding claim 7,
	the combination of You, Ba, Jacobs, and Hinton teaches the computer-implemented method of claim 6,
	wherein determining, by the one or more computing devices for the set of outputs generated by each pre-trained machine-learned model, whether to include one or more outputs of the set of outputs in the distillation training dataset based at least in part on the respective performance of such pre-trained machine-learned model comprises, for each pre-trained machine-learned model: providing, by the one or more computing devices, each respective training example as an input into at least one of the one or more trust models; and (Hinton, page 6, paragraph 2, line 1, “When the number of classes is very large, it makes sense for the cumbersome model to be an ensemble that contains one generalist model trained on all the data and many “specialist” models, each of which is trained on data that is highly enriched in examples from a very confusable subset of the classes (like different types of mushroom).” In other words, generalist model is trust model, (from claim 6 above) predictions from the generalist model to decide which  specialists are relevant is evaluating the respective performance of each pre-trained machine-learned model based at least in part on the set of outputs, and generalist model trained on all the data is providing… each respective training example as an input into as least one of the one or more trust models.)
	receiving, by the one or more computing devices, an output from the at least one of the one or more trust models that indicates whether the corresponding output generated by the pre- trained machine-learned model for the respective training example should be included in the distillation training dataset.  (Hinton, page 6, paragraph 6, line 2 “In addition to the specialist models, we always have a generalist model so that we can deal with classes for which we have no specialists and so that we can decide which specialists to use.” In other words, deciding which specialists to use is indicates whether the corresponding output…should be included in the distillation training dataset.)
Regarding claim 8,
	the combination of You, Ba, Jacobs, and Hinton teaches the computer-implemented method of claim 6,
	wherein the initial training dataset comprises a first portion that is labeled and a second portion that is not labeled (You, page 1290, column 2, paragraph 4, line 1 “Remark 2. Like the generalized distillation [23], our method also can be extended to semi-supervised cases.  Since the losses related with teacher networks in objective Eq. (7) are both label-free, the numerous unlabeled examples can still be involved in the training.”  In other words, semi-supervised cases is a training dataset where a first portion is labeled and a second portion is not labeled.) , and wherein
the first portion of the initial training dataset is used as the validation dataset (You, page 1290, column 2, paragraph 4, line 7 “Then the labeled examples are capable of fine tuning the student network to further improve the performance.”  In other words, the labeled examples is the first portion, and fine tuning the student network is used as the validation set.).
Regarding claim 9,
	the combination of You, Ba, Jacobs, and Hinton teaches the computer-implemented method of claim 6, wherein
	the one or more machine-learned trust models each comprise a neural network (Hinton, page 8, paragraph 4, line 3 “At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run.” And, page 1, paragraph 1, line 3 “Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.” In other words, the generalist model is trust model, and the generalist model is a neural network.) .
Regarding claim 10,
	the combination of You, Ba, Jacobs, and Hinton teaches the computer-implemented method of claim 1, wherein
	evaluating, by the one or more computing devices, a respective performance of each pre-trained machine-learned model comprises: selecting, by the one or more computing devices, one or more expert models from the plurality of pre-trained machine-learned models by comparing a population statistic determined in part from the plurality of sets of outputs to the set of outputs determined by at least one of the pre-trained machine-learned models.  (Hinton, page 8, paragraph 4, line 3 “At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run.” And, page 2, paragraph 6, line 1, “Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, zi, computed for each class into a probability, qi, by comparing zi with the other logits.

    PNG
    media_image9.png
    68
    611
    media_image9.png
    Greyscale

And, page 2, paragraph 3, line 1 “An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities…” In other words, decide which specialists are relevant is selecting…by comparing, class probabilities is population statistic, and comparing zi with the other logits is determined in part… from the plurality of sets of outputs.)
Regarding claim 11,
	the combination of You, Ba, Jacobs, and Hinton teaches the computer-implemented method of claim 10 wherein
	selecting, by the one or more computing devices, one or more expert models comprises comparing the population statistic to each set of outputs determined respectively by each pre-trained machine-learned model.  (Hinton, page 8, paragraph 4, line 3 “At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run.” And, page 2, paragraph 6, line 1, “Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, zi, computed for each class into a probability, qi, by comparing zi with the other logits.

    PNG
    media_image9.png
    68
    611
    media_image9.png
    Greyscale

And, page 2, paragraph 3, line 1 “An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities…” In other words, decide which specialists are relevant is selecting… one or more expert models, and use class probabilities is comparing the population statistic to each set of outputs.)
Regarding claim 13,
	the combination of You, Ba, Jacobs, and Hinton teaches the computer-implemented method of claim 1, wherein
	at least 50% of the initial comprises unlabeled or weakly labeled training examples  (Hinton, page 2, paragraph 5, line 1 “The transfer set that is used to train the small model could consist entirely of unlabeled data [1] or we could use the original training set.”  In other words, consist entirely of unlabeled data is at least 50% of the initial consists of unlabeled or weakly labeled training examples.) .
Regarding claim 14,
	the combination of You, Ba, Jacobs, and Hinton teaches the computer-implemented method of claim 1, wherein
	obtaining the plurality of sets of outputs comprises respectively performing inference on the set of training examples with the plurality of pre-trained machine-learned models (You, page 1285, column 1, paragraph 1, line 7 “In this paper, we present a method to train a thin deep network by incorporating multiple teacher networks not only in output layer by averaging the softened outputs (dark knowledge) from different networks, but also in the intermediate layers by imposing a constraint about the dissimilarity among examples.” In other words, softened outputs is sets of outputs, multiple teacher networks is plurality of pre-trained machine-learned models, and incorporating…output layer by averaging the softened outputs is generated by performing inference on the set of training examples.) .
Claim 15 is a computing system claim corresponding to the computer-implemented method of claim 1. Otherwise, they are the same.  It is implicit that a computer-implemented method requires a computing system comprising one or more processors and one or more non-transitory computer-readable media that collectively store instructions for the one or more processors to execute in order to be executed.  Therefore, claim 15 is rejected for the same reasons as claim 1.
Claim 20 is a non-transitory computer-readable medium claim that corresponds to the computer-implemented method of claim 1.  Otherwise, they are the same.  It is implicit that a computer-implemented method requires one or more non-transitory computer-readable medium to store instructions in order to execute.  Therefore, claim 20 is rejected for the same reasons as claim 1.
Regarding claim 21,
the combination of You, Ba,, Jacobs, and Hinton teaches the computing system of claim 15, wherein
whether to include the one or more outputs of the set of outputs is based at least in part on a learning rate of the distilled machine-learned model (Hinton, page 1, paragraph 1, line 11 “We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.  Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.” In other words, trained rapidly and in parallel is learning rate, and unlike a mixture of experts, these specialist models can be trained rapidly is deciding whether to include one or more outputs is based at least in part on the learning rate.) .
Claims 3, 4, 5, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over You, Ba, Jacobs, and Hinton, in view of Avnimelech et al (Boosted Mixture of Experts: And Ensemble Learning Scheme, herein Avnimelech).
Regarding claim 3,
	the combination of You, Ba, Jacobs, and Hinton teaches the computer-implemented method of claim 2, wherein
	Thus far, the combination of You, Ba, Jacobs, and Hinton does not explicitly teach 	evaluating, by the one or more computing devices, the respective performance of each pre-trained machine-learned model based at least in part on the set of outputs generated by the pre-trained machine-learned model comprises evaluating, by the one or more computing devices, the respective performance of each pre-trained machine-learned model on a per-classification label basis.
	Avnimelech teaches evaluating, by the one or more computing devices, the respective performance of each pre-trained machine-learned model based at least in part on the set of outputs generated by the pre-trained machine-learned model comprises evaluating, by the one or more computing devices, the respective performance of each pre-trained machine-learned model on a per-classification label basis.  (Avnimelech, page 486, paragraph 1, line 2 “The gating function assigns probability to each of the experts based on the current input.  In the training stage, this value states the probability of a pattern’s appearing in an expert’s training set.  In the test stage, it defines the relative contribution of each expert to the ensemble.  The training attempts to achieve two goals: (1) for a given expert, find the optimal gating function, and (2) for a given gating function, train each expert to achieve maximal performance on the distribution assigned to it by the gating function.” And, page 487, paragraph 2, line 3 “In boosting, the first classifier is trained on all patterns, and the localization criterion for the distributions presented to the two other classifiers is the level of difficulty of the patterns as measured by classification performance.” In other words, gating function is evaluating, classification performance is respective performance where the respective performance is on a per-classification label basis.)
	Both Avnimelech and the combination of You, Ba, Jacobs, and Hinton are directed to ensemble learning schemes, among other things.  Avnimelech calls the ensemble a Mixture of Experts whereas the combination of You, Ba, Jacobs, and Hinton refer to an Ensemble of Teachers or Multiple Teachers. In view of the teaching of the combination of You, Ba, Jacobs, and Hinton it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Avnimelech into the combination of You, Ba, Jacobs, and Hinton.  This would result in being able to evaluate outputs of pre-trained machine-learned models (also known as predictors, or teachers) for the purpose of determining their relative contribution to the distillation training set.
	One of ordinary skill in the art would be motivated to do this in order to improve the performance of the ensemble trained model by partitioning parts of the training set to different classifiers.  (Avnimelech, paragraph 4, line 6 “A different approach is training the classifiers on different parts of the training set, partitioned in a manner such that their distributions differ. Such an approach, which is presented here, combines two algorithms: boosting and mixture of experts (also known as BME).” And, page 494, paragraph 6, line 1 “The results indicate that the performance of an ensemble machine trained with the BME algorithm (and combined appropriately) is significantly better than a standard ensemble (parallel machine).)
Regarding claim 4,
	the combination of You, Ba, Jacobs, Hinton, and Avnimelech teaches the computer-implemented method of claim 3, wherein
	determining, by the one or more computing devices for the set of outputs generated by each pre-trained machine-learned model, whether to include one or more outputs of the plurality set of outputs in the distillation training dataset comprises determining, by the one or more computing devices for each pre- trained machine-learned model and for each classification label, whether to include in the distillation training dataset all outputs of the set of outputs with the classification label.  (Avnimelech, page 484, paragraph 2, line 1 “Other methods use dynamic linear combination models, using a confidence measure of the ensemble members regarding each pattern.  Different measures of the confidence of each predictor can be used for determining the relative contribution of each expert (Tresp & Taniguchi, 1995; Shimshoni & Intrator, 1996).” And, page 484, paragraph 3, line 6 “A different approach is training the classifiers on different parts of the training set, partitioned in a manner such that their distributions different.  Such an approach, which is presented here, combines two algorithms: boosting and mixture of experts.” In other words, using a confidence measure for each predictor to determine relative contribution is determining… whether to include one or more outputs in a distillation dataset based, at least in part, on the respective performance of the pre-trained machine-learned model, and training the classifiers on different parts of the training set is for each classification label.)
Regarding claim 5,
	the combination of You, Ba, Jacobs, Hinton, and Avnimelech teaches the computer-implemented method of claim 3, wherein
	determining, by the one or more computing devices for the set of outputs generated by each pre-trained machine-learned model, whether to include one or more outputs of the plurality set of outputs in the distillation training dataset comprises (Avnimelech, page 484, paragraph 2, line 1 “Other methods use dynamic linear combination models, using a confidence measure of the ensemble members regarding each pattern.  Different measures of the confidence of each predictor can be used for determining the relative contribution of each expert (Tresp & Taniguchi, 1995; Shimshoni & Intrator, 1996).” And, page 484, paragraph 3, line 6 “A different approach is training the classifiers on different parts of the training set, partitioned in a manner such that their distributions different.  Such an approach, which is presented here, combines two algorithms: boosting and mixture of experts.” In other words, determining relative contribution is determining… whether to include… one or more outputs of the plurality set of outputs, and training the classifiers on different parts is for each classification label.), for each classification label:
	selecting, by the one or more computing devices, a highest-performing pre-trained machine-learned model for such classification label; and (Hinton, page 8, paragraph 4, line 3 “At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run.” In other words, decide which specialists are relevant is selecting…a highest-performing pre-trained machine-learned model for such classification label.)
	including, by the one or more computing devices, in the distillation training dataset all outputs of the set of outputs generated by the highest-performing pre-trained machine-learned model that have such classification label (Hinton, page 6, paragraph 2, line 1 “When the number of classes is very large, it makes sense for the cumbersome model to be an ensemble that contains one generalist model trained on all the data and many “specialist” models, each of which is trained on data that is highly enriched in examples from a very confusable subset of the classes (like different types of mushroom).” And, page 6, paragraph 6, line 12 “The distribution pm is a distribution over all the specialist classes of m plus a single dustbin class, so when computing its KL divergence from the full q distribution we sum all of the probabilities that the full q distribution assigns to all the classes in m’s dustbin.” In other words, specialist models are highest performing pre-trained machine-learned models, and sum all of the probabilities that the full q distribution assigns is including all of the specialist outputs for their specifically assigned classification labels.) .
Regarding claim 22,
the combination of You, Ba, Jacobs, Hinton, and Avnimelech teaches the computing system of claim 15, wherein
	evaluating the respective performance of each pre-trained machine-learned model comprises determining at least one of a respective accuracy or a precision of one or more of the set of outputs for each of the plurality of pre-trained machine-learned models (Avnimelech, page 484, paragraph 2, line 1 “Other methods use dynamic linear combination models, using a confidence measure of the ensemble members regarding each pattern.  Different measures of the confidence of each predictor can be used for determining the relative contribution of each expert (Tresp & Taniguchi, 1995; Shimshoni & Intrator, 1996).” In other words, a confidence measure is based on evaluating respective performance, ensemble member is pre-trained machine-learned model, different measures is a respective accuracy or a precision, and confidence of each predictor is evaluating the respective performance of each pre-trained machine-learned model.) .
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over You, Ba, Jacobs, and Hinton, in view of Guyon et al (Feature Extraction, Foundations and Applications, herein Guyon).
Regarding claim 12,
	The combination of You, Ba, Jacobs, and Hinton teaches the computer-implemented method of claim 10, wherein
	Thus far the combination of You, Ba, Jacobs, and Hinton does not explicitly teach determining, by the one or more computing devices, whether to include one or more outputs of the plurality of sets of outputs in the distillation training dataset is further based at least in part on a weighting applied to each of the sets of outputs generated by the one or more expert models.  
	Guyon teaches determining, by the one or more computing devices, whether to include one or more outputs of the plurality of sets of outputs in the distillation training dataset is further based at least in part on a weighting applied to each of the sets of outputs generated by the one or more expert models (Guyon, Chapter 11, Ensembles of Regularized Least Squares Classifiers for High-Dimensional Problems, page 311, paragraph 6, Section 11.6.3 Ensemble Postprocessing, line 1 “A well-known avenue to improve the accuracy of an ensemble is to replace the simple averaging of individual experts by a weighting scheme.  Instead of giving equal weight to each expert, the outputs of more reliable experts are weighted up.  Linear regression can be applied to learn these weights.” In other words, using a weighting scheme is based at least in part on a weighting applied to each of the sets of outputs generated by the one or more expert models.) .
	Both Guyon and the combination of You, Ba, Jacobs, and Hinton are directed to using variations of ensemble methods as classifiers. In view of the teaching of the combination of You, Ba, Jacobs, and Hinton, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Guyon into the combination of You, Ba, Jacobs, and Hinton.  This would result in being able to improve the accuracy of an ensemble by using a weighting to the outputs.
	One of ordinary skill in the art would be motivated to do this to improve the accuracy of the ensemble (also known as pre-trained machine-learned models). (Guyon, page 311, paragraph 6, line 1 “A well-known avenue to improve the accuracy of an ensemble is to replace the simple averaging of individual experts by a weighting scheme.”)
Claims 16 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over You, Ba, Jacobs, and Hinton, in view of Caruana et al (Ensemble Selection from Libraries of Models, herein Caruana).
Regarding claim 16,
	the combination of You, Ba, Jacobs, and Hinton teaches the computer system of claim 15, wherein
	evaluating the respective performance of each pre-trained machine-learned model based at least in part on the set of outputs generated by the pre-trained machine-learned model comprises: training, using the validation dataset comprising ground truth data, the one or more machine-learned trust models to evaluate the respective performance of each pre-trained machine-learned model (Hinton, page 8, paragraph 4, line 3 “At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run.” And, page 4, paragraph 6, line 1 “Although it is possible (and desirable) to train the DNN in such a way that the decoder (and, thus, the language model) is taken into account by marginalizing over all possible paths, it is common to train the DNN to perform frame-by-frame classification by (locally) minimizing the cross entropy between the predictions made by the net and the labels given by a forced alignment with the ground truth sequence of states for each observation: 
    PNG
    media_image10.png
    43
    265
    media_image10.png
    Greyscale
 where  are the parameters of our acoustic model P which maps acoustic observations at time t, st to a probability P(ht|st;’), of the “correct” HMM state ht, which is determined by a forced alignment with the correct sequence of words.” In other words, decide which specialists are relevant is determining… whether to include one or more outputs of the plurality set of outputs in the distillation training dataset, train the DNN to perform…classification… given by a forced alignment with the ground truth is comprising ground truth data, and generalist model is trust model.) ; and
Thus far, the combination of You, Ba, Jacobs, and Hinton does not explicitly teach determining a respective binary classification for each of the set of outputs, wherein the binary classification is descriptive of a comparison between an output and ground truth data.
Caruana teaches determining a respective binary classification for each of the set of outputs, wherein the binary classification is descriptive of a comparison between an output and ground truth data (Caruana, page 7, paragraph 2, line 1 “Ensemble selection is straightforward for binary classification and regression.  If the base-level models make predictions for multiple classes, no modification is necessary for multi-class problems. If some base-level models make predictions one dichotomy at a time (e.g. SVMs), ensemble selection is easiest if the base-level models are combined so that they return a predicted probability for each class.” In other words, ensemble selection…for binary classification is determining a respective binary classification, and, from the mapping above (Hinton, page, 4, paragraph 6, line 1) a forced alignment with the ground truth is a comparison between an output and ground truth data.) .
	Both Caruana and the combination of You, Ba, Jacobs, and Hinton are directed to using some variation of ensemble to speed up inference and improve the accuracy of classifiers.  In view of the teaching of the combination of You, Ba, Jacobs, Avnimelech, and Hinton, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Caruana into the combination of You, Ba, Jacobs, Avnimelech, and Hinton.  This would result in being able to perform binary classification for comparison between output and ground truth data.
	One of ordinary skill in the art would be motivated to do so in order to improve the accuracy of the method by increasing the diversity of the ensemble by being able to select from thousands of models. (Caruana, page 1, column 1, paragraph 1, line 1 “We present a method for constructing ensembles from libraries of thousands of models.  Model libraries are generated using different learning algorithms and parameter settings.  Forward stepwise selection is used to add to the ensemble the models that maximize its performance.  Ensemble selection allows ensembles to be optimized to performance metrics such as accuracy, cross entropy, mean precision, or ROC Area.”)
Regarding claim 18,
	The combination of You, Ba, Jacobs, Hinton, and Caruana teach the computing system of claim 15, wherein
	Thus far, the combination of You, Ba, Jacobs, Hinton, and Caruana does not explicitly teach obtaining the plurality of sets of outputs comprises accessing the plurality of sets of outputs from a database that stores previously generated inferences for the plurality of pre-trained machine-learned models.  
	Caruana teaches obtaining the plurality of sets of outputs comprises accessing the plurality of sets of outputs from a database that stores previously generated inferences for the plurality of pre-trained machine-learned models (Caruana, page 1, column 1, line 1 “We present a method for constructing ensembles from libraries of thousands of models. Model libraries are generated using different learning algorithms and parameter settings.  Forward stepwise selection is used to add to the ensemble the models that maximize its performance.” In other words, libraries is database that stores previously generated inferences for the plurality of pre-trained machine-learned models.) .
	Both Caruana and the combination of You, Ba, Jacobs, Hinton, and Caruana are directed to using some variation of ensemble to speed up inference and improve the accuracy of classifiers.  In view of the teaching of the combination of You, Ba, Jacobs, Hinton, and Caruana it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Caruana into the combination of You, Ba, Jacobs, Hinton, and Caruana.  This would result in being able to select from libraries of machine learning models in order to build an accurate ensemble.
	One of ordinary skill in the art would be motivated to do so in order to improve the accuracy of the method by increasing the diversity of the ensemble. (Caruana, page 1, column 1, paragraph 2, line 3 “Dietterich (2000) states that “A necessary and sufficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are accurate and diverse.”)
Conclusion
	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to BART RYLANDER whose telephone number is (571)272-8359. The examiner can normally be reached Monday - Thursday 8:00 to 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/B.I.R./Examiner, Art Unit 2124                                                                                                                                                                                                        

/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124