DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 11/06/2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Specification
The disclosure is objected to because of the following informalities: 
Page 9, [0046], line 5: “function wth” should read “function with”
Page 12, [0059], lines 3 and 6: “Liptschitz” should read “Lipschitz”
Page 16, [0076], line 1: “doe snot” should read “does not”
Appropriate correction is required.
Drawings
The drawings are objected to because element 2 from Fig. 2 does not appear in the Specification, and element 712 appears in line 6 of [0069] on page 15 of the Specification but does not appear in.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 3-4, 11, and 13-14 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by  Wierzynski (Pub. No. US 20170024641 A1).
Regarding claim 1, Wierzynski teaches a method for distilling knowledge from a first neural network to train a second neural network (Spec. page 1, [0014]), the method comprising: 
receiving a plurality of training samples corresponding to a first set of pre-defined classes from a given dataset (Spec. page 5, [0061] discloses that the model receives training samples, second data. Page 6, [0071], lines 7-9; the second data can be identical to the first data, and the first data corresponds to labels, i.e. classes, as disclosed in [0067], lines 1-4. Therefore the second data can also correspond to the pre-defined first classes of the first dataset); 
retrieving the first neural network that is pre-trained to classify input samples into the first set of pre-defined classes (Spec. page 6, [0067], lines 1-4; the first neural network is trained prior to training the second model to classify data x into labels, or predefined classes, y from the first training set); 
generating one or more out-of-distribution training samples from the plurality of training samples (Spec. page 6, [0076]; a third training set is generated with a new out-of-distribution training sample, a new car model); 
generate a second set of classes by adding an out-of-distribution class to the first set of pre-defined classes (Spec. page 6, [0076]; a new out-of-distribution class is added to account for the new sample); 
obtaining a first plurality of classifications by feeding the plurality of training samples to the first neural network (Spec. page 6, [0072], lines 1-6; the second data, i.e. the first plurality of training samples, are fed to the first neural network to generate the second training set of second data and second labels, i.e. the first plurality of classifications); and 
training the second neural network defined with the second set of classes based on the plurality of training samples, the one or more out-of-distribution training samples and the obtained first plurality of classifications from the first neural network (Spec. page 6, [0072], lines 8-11; the second neural network is trained with the second training set, which includes the second data, i.e. the plurality of training samples, and the second labels, i.e. the first plurality of classifications, from the first network. [0073], lines 1-3; the second neural network may be trained additionally with the third training set which has the out-of-distribution training sample as described above).

Regarding claim 3, Wierzynski further teaches the method of claim 1, wherein the second neural network has a smaller size than the first neural network, and the second neural network is implementable on a central processing unit (Spec. page 2, [0035], lines 1-5; system is implemented on a chip with a CPU. Page 6, [0072], lines 11-15; the second neural network may be smaller than the first neural network).

Regarding claim 4, Wierzynski further teaches the method of claim 1, further comprising: 
training, using a customer dataset (Spec. page 6, [0069], lines 1-3; the first training set used to train the first neural network as described above with respect to claim 1 may be unavailable for external distribution because of licensing restriction; under broadest reasonable interpretation, the license holder of the training set may be considered a customer of the party training the neural networks), the first neural network to classify input samples into the first set of pre-defined classes, wherein the customer dataset includes the plurality of training samples (Spec. page 6, [0066-67], lines 1-2; the first neural network is trained to classify input samples into the first set of pre-defined classes on the first training set, which includes data samples. Page 6, [0071], lines 7-9; the second data, i.e. the plurality of training samples, can be identical to the first data, therefore the first neural network is trained on a dataset that can include the plurality of training samples).

Regarding claim 11, the claim is directed to a system for distilling knowledge from a first neural network to train a second neural network, the system comprising: 
a communication interface that receives a plurality of training samples; 
a memory containing machine readable medium storing machine executable code; and 
one or more processors coupled to the memory and configurable to execute the machine executable code to cause the one or more processors to perform the features presented in the claimed method of claim 1. Wierzynski teaches a system comprising these elements (Spec. page 1, [0017]; lines 1-3; Spec. page 5,  [0060], lines 1-2; an example device for running the applications with a model detailed in the disclosure is a smartphone, which has a communication interface. [0061] discloses that the model receives training samples) for performing the method of claim 1, therefore claim 11 is rejected under the same grounds.

Regarding claim 13, the claim is directed to the system of claim 11 detailed above corresponding to the claimed method of claim 3 and  is rejected under the same grounds.

Regarding claim 14, the claim is directed to the system of claim 11 detailed above corresponding to the claimed method of claim 4 and  is rejected under the same grounds.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 2 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Wierzynski in view of Sun S, Cheng Y, Gan Z, Liu J. "Patient knowledge distillation for bert model compression." arXiv preprint arXiv:1908.09355. 2019 Aug 25., hereinafter Sun.
Regarding claim 2, Wierzynski teaches all of the elements of the current invention according to the method of claim 1 as stated above. However, Wierzynski does not explicitly teach wherein the first neural network includes any combination of a bidirectional encoder representation from transformers (BERT) model and embeddings from language models (ELMO).
Sun teaches an approach for patiently distilling knowledge from multiple layers of a large teacher model to a smaller student model (Abstract). Sun further discloses that the proposed approach uses BERT as the teacher model as an example, but can also use other pre-trained models as the teacher (Sect. 2 Related Work, “Language Model Pre-training,” paragraph 4), citing ELMo as a pre-trained language model (Sect. 2 Related Work, “Language Model Pre-training,” paragraph 3). 
Adapting Wierzynski’s method for knowledge distillation to use BERT and ELMo as the teacher models as taught by Sun discloses the method of claim 1 (as detailed above), wherein the first neural network includes any combination of a bidirectional encoder representation from transformers (BERT) model and embeddings from language models (ELMO) (the method of Wierzynski, now adapted such that the teacher neural network is a combination of BERT and ELMo, as used for teacher models by Sun).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wierzynski to incorporate the teachings of Sun to provide the features of claim 2. Sun is considered to be analogous to Wierzynski as both disclosures are directed to the distillation of knowledge from one larger teacher model to a smaller student model. Wierzynski teaches the use of deep learning architecture with a hierarchy of features that may recognize spoken phrases (Spec. page 3, [0040], lines 1-2, 12-13). Wierzynski further teaches that new architectures may boost the performance of deep learning (Spec. page 4, [0053], lines 6-8). Sun provides models, ELMo and Bert, that can be used for language processing which are suitable for use as teacher models for knowledge distillation. Therefore, it would have been obvious to combine the features of both disclosures to improve the performance of the neural networks by teaching the student model with a teacher model comprising a combination of BERT and ELMo. 

Regarding claim 12, the claim is directed to the system of claim 11 detailed above corresponding to the claimed method of claim 2 and  is rejected under the same grounds.

Claims 5-6 and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Wierzynski in view of Masataki et al. (Pub. No. US 20190244604 A1), hereinafter Masataki.
Regarding claim 5, Wierzynski teaches the method of claim 1, wherein the training the second neural network defined with the second set of classes comprises: 
generating a second plurality of classification outputs by feeding the plurality of training samples to the second neural network (Spec. page 6, [0072], lines 8-11; the second neural network outputs labels, i.e. classes, for the input training samples, i.e. it generates a second plurality of classification outputs); and 
using backpropagation to update parameters (Spec. page 4, [0046], the disclosure teaches using backpropagation in the training of neural networks to adjust weights, i.e. parameters, by computing a gradient vector to reduce error). 
However, Wierzynski does not explicitly teach computing a knowledge distillation loss between the first plurality of classification outputs and the second plurality of classification outputs; or using backpropagation on the second neural network by the knowledge distillation loss to update parameters for the second neural network.
Masataki discloses a model learning device which uses a parameter of a learned first model including a neural network to set a parameter of a second model including a neural network having a same network structure as the first model (Abstract). Masataki further discloses that the system calculates a cross entropy between the first output probability distribution corresponding to the first neural network and the second output probability distribution corresponding to the second neural network and obtains a weighted sum of a second loss function and the cross entropy. The weighted sum can be considered to be a knowledge distillation loss between the first plurality of classification outputs and the second plurality of classification outputs. The system updates the parameters of the second model to reduce the weighted sum (Spec. page 3, [0017]).
Adapting Wierzynski’s method for knowledge distillation to use the knowledge distillation loss calculated as in Masataki discloses the method of claim 1 (as detailed above), wherein the training the second neural network defined with the second set of classes comprises: 
generating a second plurality of classification outputs by feeding the plurality of training samples to the second neural network (Spec. page 6, [0072], lines 8-11; the second neural network outputs labels, i.e. classes, for the input training samples, i.e. it generates a second plurality of classification outputs);
computing a knowledge distillation loss between the first plurality of classification outputs and the second plurality of classification outputs (the method of Wierzynski, which computes a gradient vector to reduce error for adjusting the weights of a neural network with backpropagation, now adapted to compute a knowledge distillation loss between the first plurality of classification outputs and the second plurality of classification outputs as taught by Masataki in Spec. page 3, [0017] as detailed above); and 
using backpropagation on the second neural network by the knowledge distillation loss to update parameters for the second neural network (the teachings of Wierzynski in the Spec. page 4, [0046] for using backpropagation in the training of neural networks to adjust weights, i.e. parameters, now adapted to use the knowledge distillation loss calculated to update parameters for the second neural network as in Masataki Spec. page 3, [0017]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wierzynski to incorporate the teachings of Masataki to provide the features of claim 5. Masataki is considered to be analogous to Wierzynski as both disclosures are directed to the distillation of knowledge from a teacher model to a student model. Wierzynski recognizes the need to fine tune a network by adjusting the parameters to reduce error and teaches backpropagation to accomplish this (Spec. page 4, [0045], lines 13-15, [0046]), however Wierzynski is not explicit about applying this to fine tune the student network. Similarly, Masataki teaches the use of backpropagation to adjust a network and applies this process to a second neural network acting as a student for knowledge distillation (Spec. page 12, [0146]). Therefore, it would have been obvious to combine the features of both disclosures to improve the performance of the neural networks by computing a knowledge distillation loss between the first plurality of classification outputs and the second plurality of classification outputs, and using backpropagation on the second neural network by the knowledge distillation loss to update parameters for the second neural network.

Regarding claim 6, the combination of Wierzynski and Masataki further teaches the method of claim 5, further comprising: 
generating one or more additional classification outputs by feeding the one or more out- of-distribution training samples to the second neural network (Wierzynski, Spec. page 6, [0073], lines 1-3; the second neural network may be trained additionally with the third training set which has the out-of-distribution training sample as described above. [0076]; a new out-of-distribution class is added to account for the new sample. As the second neural network outputs a second plurality of classification outputs when given the input training samples, it must also output additional classification outputs when given out-of-distribution training samples in the third training data); 
computing a loss metric between the one or more additional classification outputs and a classification distribution corresponding to the added out-of-distribution class (Wierzynski, now adapted such that after generating one or more additional classification outputs, the method includes calculating a loss metric according to Masataki, Spec. page 3, [0017]; the system calculates a second loss function from correct information corresponding to the learning data used to train the models and from the second output probability distribution); and 
incorporating the loss metric into the knowledge distillation loss (the second loss function is used to calculate the weighted sum, i.e. the knowledge distillation loss, as detailed above in Masataki, Spec. page 3, [0017]).

Regarding claim 15, the claim is directed to the system of claim 11 detailed above corresponding to the claimed method of claim 5 and  is rejected under the same grounds.

Regarding claim 16, the claim is directed to the system of claim 11 detailed above corresponding to the claimed method of claim 6 and  is rejected under the same grounds.

Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Wierzynski in view of Yim, J., Joo, D., Bae, J. and Kim, J., 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4133-4141), hereinafter Yim.
Regarding claim 7, Wierzynski teaches the method of claim 1 as stated above, wherein the generating one or more out-of-distribution training samples from the plurality of training samples comprises: 
identifying one or more elements within an in-distribution training sample that are relevant to in-distribution classification (Spec. page 7, [0077]; model type is identified as an element of the in-distribution training sample that are relevant to in-distribution classification). 
However, Wierzynski does not explicitly teach generating the one or more out-of-distribution training samples by replacing the one or more elements from the in-distribution training sample with one or more random elements.
Yim discloses a technique for knowledge distillation from a pretrained deep neural network to a student deep neural network. Yim further discloses the creation of training samples, i.e. out-of-distribution samples, by randomly cropping 32 x 32 pixel images. I.e., an element of the in-distribution samples, image boundaries, was randomly replaced (Page 4137, Sect. 4.1.1 CIFAR-10, paragraph 1).
Adapting Wierzynski’s method for knowledge distillation to use the technique for generating out-of-distribution training samples of Yim discloses the method of claim 1, wherein the generating one or more out-of-distribution training samples from the plurality of training samples comprises: 
identifying one or more elements within an in-distribution training sample that are relevant to in-distribution classification (Wierzynski, Spec. page 7, [0077]; model type is identified as an element of the in-distribution training sample that are relevant to in-distribution classification); and 
generating the one or more out-of-distribution training samples by replacing the one or more elements from the in-distribution training sample with one or more random elements (the method of Wierzynski for generating out-of-distribution training samples as detailed above, now adapted to use the technique of Yim as detailed above to replace an element of the in-distribution training sample with one or more random elements).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wierzynski to incorporate the teachings of Yim to provide the features of claim 7. Yim is considered to be analogous to Wierzynski as both disclosures are directed to the distillation of knowledge from a teacher model to a student model. Wierzynski recognizes that it is desirable to augment training sets for incremental learning (Spec. page 6, [0069-70]). Yim provides a method of sample augmentation for training. Therefore, it would have been obvious to combine the features of both disclosures to augment in-distribution training samples to generate out-of-distribution training samples. 

Regarding claim 17, the claim is directed to the system of claim 11 detailed above corresponding to the claimed method of claim 7 and  is rejected under the same grounds.


Claims 8 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Wierzynski in view of Sau BB, Balasubramanian VN. Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650. 2016 Oct 30..
Regarding claim 8, Wierzynski teaches the method of claim 1 as detailed above. However, Wierzynski does not teach wherein the training the second neural network defined with the second set of classes further comprises: 
preprocessing the plurality of training samples or the one or more out-of-distribution training samples by adding a Gaussian noise component before feeding the plurality of training samples or the one or more out-of-distribution training samples to the second neural network.
Sau discloses a method for including a noise-based regularizer while training the student from the teacher in knowledge distillation to improve the performance of the student network (Abstract). Sau further discloses the addition of Gaussian noise to the outputs of the teach model before inputting to the student network (Page 3, Sect. 3.3. ‘Noisy Teachers’: Student Learning using Logit Perturbation, paragraph 1, lines 10-13; paragraph 2). 
Adapting Wierzynski to incorporate the teachings of Sau provides the method of claim 1 (as taught by Wierzynski, detailed above), wherein the training the second neural network defined with the second set of classes further comprises: 
preprocessing the plurality of training samples or the one or more out-of-distribution training samples by adding a Gaussian noise component before feeding the plurality of training samples or the one or more out-of-distribution training samples to the second neural network (the method of Wierzynski for feeding the plurality of training samples or the one or more out-of-distribution training samples to the second neural network, now adapted to first preprocess the samples by adding Gaussian noise as taught by Sau on Page 3, Sect. 3.3. ‘Noisy Teachers’: Student Learning using Logit Perturbation, paragraph 1, lines 10-13; paragraph 2).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wierzynski to incorporate the teachings of Sau to provide the features of claim 8. Sau is considered to be analogous to Wierzynski as both disclosures are directed to the distillation of knowledge from a teacher model to a student model. Sau further discloses that it is well-known in the art to add noise to input for training models to regularize the model for improved performance (Page 4, Sect. 3.4. Equivalence to Noise-Based Regularization, paragraph 1). Therefore, it would have been obvious to combine the features of both disclosures to preprocess the plurality of training samples or the one or more out-of-distribution training samples by adding Gaussian noise to the input samples before training the second neural network to achieve the well-known improvements of noise-based regularization.

Regarding claim 18, the claim is directed to the system of claim 11 detailed above corresponding to the claimed method of claim 8 and  is rejected under the same grounds.

Claims 9 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Wierzynski in view of Lezama J, Qiu Q, Musé P, Sapiro G. Ole: Orthogonal low-rank embedding-a plug and play geometric loss for deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018 (pp. 8109-8118)., hereinafter Lezama.
Regarding claim 9, Wierzynski teaches the method of claim 1 as detailed above. However, Wierzynski does not teach wherein the training the second neural network defined with the second set of classes further comprises:27Attorney Docket No. 70689.96US01salesforce.com, inc. Reference No. A4508US 
generating a number of reference class vectors corresponding to the first set of pre- defined classes; and 
determining whether an input sample belongs to the added out-of-distribution class based on whether a vector representation of the input sample is orthogonal to the number of reference class vectors.
Lezama discloses a method to enforce intra-class similarity and inter-class margin of learned deep representations for deep neural networks. Lezama further discloses for each class, the deep features are collapsed into a learned linear subspace, or union of them, and inter-class subspaces are pushed to be as orthogonal as possible to improve classification performance (Abstract).
Adapting Wierzynski to incorporate the teachings of Lezama provides method of claim 1, wherein the training the second neural network defined with the second set of classes further comprises:27Attorney Docket No. 70689.96US01salesforce.com, inc. Reference No. A4508US 
generating a number of reference class vectors corresponding to the first set of pre- defined classes (the training of the second neural network of Wierzynski, now adapted to operate as taught in: Lezama last paragraph of page 8109 into 8110; the decision boundary for the softmax loss is determined by the angle between the feature vector and the vectors corresponding to each class, i.e. reference class vectors corresponding to the first set of pre- defined classes, in the last linear classifier); and 
determining whether an input sample belongs to the added out-of-distribution class based on whether a vector representation of the input sample is orthogonal to the number of reference class vectors (the training of the second neural network of Wierzynski, now adapted to operate as taught in: Lezama last paragraph of page 8109 into 8110; the features are embedded into orthogonal, low-dimensional linear subspaces, aligned with the classifier vector of each class, therefore the determination of whether a feature vector of the input is in a class is determined by the orthogonality of the feature vector to the classifier vector).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wierzynski to incorporate the teachings of Lezama to provide the features of claim 9. Lezama is considered to be analogous to Wierzynski as both disclosures are directed to deep learning neural networks. Lezama further discloses that the proposed method affords improvement of the model at detecting novel classes outside the training set (page 8110, final paragraph of Sect. 1. Introduction), an advantage which would improve the performance of the models in Wierzynski for training the second neural network on the third training set.

Regarding claim 19, the claim is directed to the system of claim 11 detailed above corresponding to the claimed method of claim 9 and  is rejected under the same grounds.

Claims 10 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wierzynski in view of Dighe P, Asaei A, Bourlard H. Low-rank and sparse subspace modeling of speech for DNN based acoustic modeling. Speech Communication. 2019 May 1;109:34-45., hereinafter Dighe.
Regarding claim 10, Wierzynski teaches the method of claim 1 as detailed above. However, Wierzynski does not teach wherein the training the second neural network defined with the second set of classes further comprises: 
training the second neural network using the plurality of training samples having a first feature dimension; 
in response to receiving an input sample having the first feature dimension, using a Gaussian distribution based sparsification vector to reduce the first feature dimension to a second feature dimension; and 
generating, via the second neural network, an output based on the input sample having the reduced second feature dimension.
Dighe is directed to improving acoustic modeling for automatic speech recognition by investigating the use of low-rank and sparse modeling approaches to model senone subspaces in deep neural network (DNN) posteriors. Dighe further discloses training a student deep neural network (DNN) with training data acoustic features as input and corresponding enhanced DNN posteriors as soft targets (page 39, left column, paragraph 1), the enhanced DNN posteriors being DNN posteriors enhanced by the use of sparse modeling to perform dimension reduction (page 35, left column, paragraph 1). 
Adapting Wierzynski to incorporate the teachings of Dighe provides method of claim 1 (as taught by Wierzynski, detailed above), wherein the training the second neural network defined with the second set of classes further comprises: 
training the second neural network using the plurality of training samples having a first feature dimension (training the second neural network with the training samples as taught by Wierzynski, detailed above); 
in response to receiving an input sample having the first feature dimension, using a Gaussian distribution based sparsification vector to reduce the first feature dimension to a second feature dimension (the training method of Wierzynski, now adapted such that the input samples are reduced to a low dimension with sparse modeling, i.e. the enhanced DNN posteriors, as taught by Dighe); and 
generating, via the second neural network, an output based on the input sample having the reduced second feature dimension (the second neural network of Wierzynski generating output, now adapted to do so using the reduced second feature dimension, the enhanced DNN posteriors, as taught by Dighe).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wierzynski to incorporate the teachings of Dighe to provide the features of claim 10. Dighe is considered to be analogous to Wierzynski as both disclosures are directed to the distillation of knowledge from a larger model to a student model. Wierzynski teaches that the second neural network may be smaller than the first neural network (Spec. page 6, [0072], lines 11-15). Dighe discloses that low-dimensionality and sparse modeling is used to compress a model to a smaller footprint implementation (page 35, Sect. 1.3. Prior research, paragraph 1, lines 1-3). Therefore, it would have been obvious to combine the features of both disclosures to reduce the first feature dimension to a second feature dimension and generate with the second neural network an output based on the input sample having the reduced second feature dimension to compress the second model. 

Regarding claim 20, the claim is directed to the system of claim 11 detailed above corresponding to the claimed method of claim 10 and  is rejected under the same grounds.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Kang et al. (Pub No. US 20170083829 A1) discloses a model training method including selecting a teacher model from a plurality of teacher models and training a student model based on the output of the selected teacher model (Abstract).
Sawada et al. (Pub. No. US 20180025271 A1) discloses a learning apparatus which obtains a first neural network that has learned by using source learning data and obtains target learning data the target learning data including a plurality of first data items each of which is given a first label and a plurality of second data items each of which is given a second label, (b) obtains a plurality of first output vectors by inputting the plurality of first data items to a second neural network and obtains a plurality of second output vectors by inputting the plurality of second data items to the second neural network, and (c) generates a first relation vector corresponding to the first label by using the plurality of first output vectors and generates a second relation vector corresponding to the second label by using the plurality of second output vectors (Abstract).
Choi et al. (Pub. No. US 20180268292 A1) discloses a computer-implemented method for training fast models with knowledge transfer by learning a student model from a teacher model with a weighted cross-entropy layer for classification (Abstract).
 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PARKER L MAYFIELD whose telephone number is (571)272-4745. The examiner can normally be reached Monday - Friday 7:30 AM-5:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PARKER L MAYFIELD/
Examiner
Art Unit 2655



/ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655