Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 3,  4, 5, 6, 8, 9, 10, 11, 12, 13,  15, 16, 17, 18, 19, 20 is rejected under 35 U.S.C. 103 as being unpatentable over Bae (US 20200097850 A1), in further view of CHEN (US 20200334520 A1) and Wang (US 20200134506 A1) 
With respect to claims 1, 8, 9 Bae teaches A method/system/ non-transitory machine-readable medium for transfer of knowledge from a teacher model to a student model, the method comprising:
a memory storing machine executable code ([0062] As described above, the present invention may be implemented in an aspect of an apparatus or a method, and in particular, the function or process of each component in the embodiments of the present invention may be implemented as a hardware element comprising at least one of a DSP (digital signal processor), a processor, a controller, an ASIC (application specific IC), a programmable logic device (such as an FPGA, etc.), other electronic devices and a combination thereof. It is also possible to implement in combination with a hardware element or as independent software, and such software may be stored in a computer-readable recording medium.)); and 
one or more processors  coupled to the memory and configurable to execute the machine executable code to cause the one or more processors ([0062] As described above, the present invention may be implemented in an aspect of an apparatus or a method, and in particular, the function or process of each component in the embodiments of the present invention may be implemented as a hardware element comprising at least one of a DSP (digital signal processor), a processor, a controller, an ASIC (application specific IC), a programmable logic device (such as an FPGA, etc.), other electronic devices and a combination thereof. It is also possible to implement in combination with a hardware element or as independent software, and such software may be stored in a computer-readable recording medium.) to:
initializing one or more shared layers of the teacher model ([0054] The sequential learning method for multiple pieces of hint information is a method for sequentially forwarding hint information one by one from the lowest layer to the highest layer for the L multi-layer pairs selected in the same way as in FIG. 8A. In this method, first, learning is performed such that the Euclidean loss function for the output results of the feature maps (323-1; 343-1) between the teacher model 32 and the student model 34 at the lowest layer, i.e., layer 1 (321-1; 341-1) as shown in FIG. 8B, and learning variables are saved. Next, after loading the saved learning variables as they are, the learning variables from layer 1 (321-1; 341-1) to layer 2 (321-2; 341-2) are randomly initialized, and then, learning is performed such that the Euclidean loss function for the output results of the feature maps (323-2; 343-2) between the teacher model 32 and the student model 34 at the next higher layer 2 (321-2; 341-2) as shown in FIG. 8C, and learning variables are saved. Then, after loading the saved learning variables as they are and randomly initializing the remaining learning variables up to the next higher layer 3 (not shown), the above sequential procedures are repeated until the highest layer L (321-L; 341-L) is reached.); 
refining multiple task layers of the teacher model, each task layer capable of performing a respective task ([0017] The disclosed implementations provide machine learning models that can perform multiple different tasks, referred to herein as “multi-task” models. For example, as discussed more below, a neural network can have different task-specific layers that perform task-specific operations. [0059] Method 800 begins at block 802, where candidate teacher instances of a model can be trained [training refines tasks]. For example, the candidate teacher instances can be different instances of a multi-task machine learning model such as shown above in FIGS. 1 and/or 3.); 
randomly initializing parameters of the student model ([0054] The sequential learning method for multiple pieces of hint information is a method for sequentially forwarding hint information one by one from the lowest layer to the highest layer for the L multi-layer pairs selected in the same way as in FIG. 8A. In this method, first, learning is performed such that the Euclidean loss function for the output results of the feature maps (323-1; 343-1) between the teacher model 32 and the student model 34 at the lowest layer, i.e., layer 1 (321-1; 341-1) as shown in FIG. 8B, and learning variables are saved. Next, after loading the saved learning variables as they are, the learning variables from layer 1 (321-1; 341-1) to layer 2 (321-2; 341-2) are randomly initialized, and then, learning is performed such that the Euclidean loss function for the output results of the feature maps (323-2; 343-2) between the teacher model 32 and the student model 34 at the next higher layer 2 (321-2; 341-2) as shown in FIG. 8C, and learning variables are saved. Then, after loading the saved learning variables as they are and randomly initializing the remaining learning variables up to the next higher layer 3 (not shown), the above sequential procedures are repeated until the highest layer L (321-L; 341-L) is reached.); 
Bae does not explicitly recite but Chen teaches separating data for the multiple tasks of the teacher model into a plurality batches, wherein each batch is specific to a respective task ([0042] In each epoch, a mini-batch bt of labeled task-specific data is selected, and the multi-task machine learning model is updated according to the task-specific objective for that task t. This can approximately optimize the sum of the multi-task objectives across the different tasks performed by the multi-task machine learning model.); 
for each task-specific batch: predicting[[ logits]] from the teacher model ([0042] In each epoch, a mini-batch bt of labeled task-specific data is selected, and the multi-task machine learning model is updated according to the task-specific objective for that task t. This can approximately optimize the sum of the multi-task objectives across the different tasks performed by the multi-task machine learning model… and [0062] Method 800 continues at block 808, where task-specific outputs are obtained for each of the selected teacher instances. For example, each selected teacher instance for a given task can be used to process additional labeled data for that task. The selected teacher instances can output different values representing assessments of each labeled data instance, e.g., the probabilities of each possible label [prediction] )
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Bae in view of Chen, in order to separate data for the multiple tasks of the teacher model into a plurality batches, wherein each batch is specific to a respective task, to improve robustness ([0040], Chen);
Bae, Chen does not explicitly recite but Wang teaches predicting logits from the teacher model ([0046] A difference between output of the teacher model and output of the student model may be indicated by a loss function. The common loss function includes: (1) Logit loss; (2) feature L2 loss; and (3) student model softmax loss.) ; and 
updating the student model according to the predicted logits from the teacher model ([0035] In the method of training a student model [updating the student model]according to the embodiment of the present disclosure, the knowledge distillation is also deployed based on a difference between output of a teacher model and output of a student model, to train a small and quick student model, thereby forcing the student model to learn the expression capability of the teacher model. The method shown in FIG. 2 differs from the conventional method of training a student model shown in FIG. 1 in that, a variation Δ is added to input of the student model.).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Bae, Chen in view of Wang, in order to predict logits from the teacher model, to increase robustness of the trained student mode ([0011], Wang);

With respect to claims 2, 9 and 16 Chen further teaches  wherein the tasks comprise at least one natural language processing task ([0069] As discussed above, in a natural language processing context, the output layers can perform functions such as single or pairwise classification, text similarity, relevance ranking, etc. Generally, the knowledge distillation process can involve training these output layers using objective functions defined by the soft targets alone, and/or a combination of the soft targets and the hard targets, i.e., the correct target labels.).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Bae in view of Chen, in order for tasks to comprise at least one natural language processing task, to improve robustness ([0040], Chen);

With respect to claims 3, 10, 17 Chen further teaches wherein the tasks comprise at least one of a natural language inference task, a single sentence classification task, a sentiment classification task, a semantic text similarity task, and a relevance ranking task ([0069] As discussed above, in a natural language processing context, the output layers can perform functions such as single or pairwise classification, text similarity, relevance ranking, etc. Generally, the knowledge distillation process can involve training these output layers using objective functions defined by the soft targets alone, and/or a combination of the soft targets and the hard targets, i.e., the correct target labels.).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Bae in view of Chen, in order for wherein the tasks comprise at least one of a natural language inference task, a single sentence classification task, a sentiment classification task, a semantic text similarity task, and a relevance ranking task, to improve robustness ([0040], Chen);

With respect to claims 4, 11, 18 Chen further teaches wherein at least one of the student model and the teacher model comprises a language representational model ([0059] Method 800 begins at block 802, where candidate teacher instances of a model can be trained. For example, the candidate teacher instances can be different instances of a multi-task machine learning model such as shown above in FIGS. 1 and/or 3. [0032] Multi-task natural language processing model 300 [300 can be a teacher model as explained in Fig. 3 and [0059]can receive language input 302, which can include words, sentences, phrases, or other representations of language. The language inputs can be processed by shared layers 304, which include a lexicon encoder 304(1) and a transformer encoder 304(2). [0014] There are various types of machine learning frameworks that can be trained to perform a specific task. Support vector machines, decision trees, and neural networks are just a few examples of machine learning frameworks that have been used in a wide variety of applications, such as image processing and natural language processing.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Bae in view of Chen, in order for at least one of the student model and the teacher model comprising a language representational model, to improve robustness ([0040], Chen);
With respect to claims 5, 12, 19 Chen further teaches , wherein at least one of the student model and the teacher model comprises a transformer model natural language understanding neural network ([0032] Multi-task natural language processing model 300 can receive language input 302, which can include words, sentences, phrases, or other representations of language. The language inputs can be processed by shared layers 304, which include a lexicon encoder 304(1) and a transformer encoder 304(2). [014] There are various types of machine learning frameworks that can be trained to perform a specific task. ([0059] Method 800 begins at block 802, where candidate teacher instances of a model can be trained. For example, the candidate teacher instances can be different instances of a multi-task machine learning model such as shown above in FIGS. 1 and/or 3. [0032] Multi-task natural language processing model 300 [300 can be a teacher model as explained in Fig. 3 and [0059]can receive language input 302, which can include words, sentences, phrases, or other representations of language. The language inputs can be processed by shared layers 304, which include a lexicon encoder 304(1) and a transformer encoder 304(2). [0014] There are various types of machine learning frameworks that can be trained to perform a specific task. Support vector machines, decision trees, and neural networks are just a few examples of machine learning frameworks that have been used in a wide variety of applications, such as image processing and natural language processing.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Bae in view of Chen, in order for at least one of the student model and the teacher model comprises a transformer model natural language understanding neural network, to improve robustness ([0040], Chen);

With respect to claims 6, 13 and 20,  Chen further teaches wherein the student model comprises one or more shared layers and a plurality of task layers (Claim 15. A method performed on a computing device, the method comprising: evaluating candidate teacher instances of a multi-task machine learning model having one or more shared layers, a first task-specific layer that performs a first task, and a second task-specific layer that performs a second task; based at least on the evaluating, selecting one or more first teacher instances for the first task and one or more second teacher instances for the second task; and training a student instance of the multi-task machine learning model using first outputs of the one or more first teacher instances to train the first task-specific layer of the student instance and using second outputs of the one or more second teacher instances to train the second task-specific layer of the student instance.).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Bae in view of Chen, in order so that the student model comprises one or more shared layers and a plurality of task layers, to improve robustness ([0040], Chen);

Claims 7, 14, 21 is rejected under 35 U.S.C. 103 as being unpatentable over Bae, Chen and Wang as applied to claim1 in further view of Ng (US 20190355366 A1)

With respect to claims 7, 14, 21,  Bae, Chen and Wang do not explicitly recite but Ng  teaches wherein updating the student model comprises minimizing a mean squared error between logits predicted by the student model against the predicted logits from the teacher model ([0056] In some examples, PLDA classifiers are trained based on the embedding vectors to derive speaker recognition power. The student network may be trained subject to a minimum mean square error between predicted targets generated by the student network and the embedding vectors output by the teacher network(s), e.g. S.U.V.(x.sub.p) or H.sub.rv(x.sub.p). [0057] Embedding vectors may have a higher entropy than softmax or final layer outputs. Embedding vectors are high-level quantitative representations of speaker characteristics. Training of the student network captures both the target speaker distribution and the generalization power. This is analogous to using logits before the softmax activation in a classification network. ).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Bae, Chen and Wang  in view of Ng, in order to update the student model comprises minimizing a mean squared error between logits predicted by the student model against the predicted logits from the teacher model , to increase robustness of the trained student model ([0011], Ng);

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675.  The examiner can normally be reached on Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.   Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ATHAR N PASHA/Examiner, Art Unit 2657     


/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657