DETAILED ACTION
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1-3, 8, 12-14, 18 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kurata et al (US 20210082399) in view of Tulyakov et al (US Patent 10963748 B1).
Regarding claim 1, Kurata discloses a method for compressing a neural network model (abstract; ¶32-33 & ¶72-73), comprising: 
obtaining a set of training samples including a plurality of pairs of training samples, each pair of the training samples including source data and target data corresponding to the source data (¶35-36 training data store 120 that stores the collection of the training data, each of which includes speech data and a corresponding transcription; the training data is given as an input-output pair of the an input sequence of observations and an output sequence of symbols where the observations are the acoustic features and the symbols are the phones or words);
training an original teacher model by using the source data as an input and using the target data as verification data (¶40-41 the guided CTC training module 110 can include a guiding model training submodule 112 for obtaining/generating the guiding CTC model 130);  
training one or more intermediate teacher models based on the set of training samples and the original teacher model, the one or more intermediate teacher models forming a set of teacher models (¶40-41 a guided model training submodule 114 for training one or more CTC models under the guidance of the guiding CTC model 130; the CTC model that has been trained under the guidance of the guiding CTC model 130 is called a guided CTC model 140); 
training multiple candidate student models based on the set of training samples, the original teacher model, and the set of teacher models, the multiple candidate student models forming a set of student models (¶76 there is further a knowledge distillation module 150 that performs the knowledge distillation to train a student CTC model 160 by combining results originating from the different guided CTC models 140-1.about.140-n).
While Kurata teaches training a student model as disclosed above, Kurata fail to specifically teach training multiple candidate student models based on the set of training samples, the original teacher model, and the set of teacher models, the multiple candidate student models forming a set of student models and selecting a candidate student model of the multiple candidate student models as a target student model according to training results of the multiple candidate student models. 
Tulyakov teaches training multiple candidate student models based on the set of training samples, the original teacher model, and the set of teacher models, the multiple candidate student models forming a set of student models (col 11 lines 45-50 the training engine 615 trains a student generative neural network using a training network; col 13 lines 34-43 Each of the student neural networks may be trained to apply different image effects using the training network 900 discussed above); and 
selecting a candidate student model of the multiple candidate student models as a target student model according to training results of the multiple candidate student models (col 13 lines 55-59 At operation 1020, the activation engine 625 activates one of the student neural networks associated with the modification instruction received at operation 1015. For example, assuming the user 1100 selects "B1", a student GNN associated with the "B1" button is activated). 


Regarding claim 2, the combination of Kurata and Tulyakov teaches the method of claim 1, wherein a number of model parameters of any of the intermediate teacher models is less than that of the original teacher model (Kurata ¶41 guiding model training submodule 112 is configured to train a CTC model having an architecture with a set of training data in the training data store 120 to obtain/generate the guiding CTC model 130. The guided model training submodule 114 is configured to train one or more CTC models having respective architectures with respective sets of training data in the training data store 120 under the guidance of the guiding CTC model 130 to generate the guided CTC models 140-1.about.140-n; ¶53 guiding model training submodule 112 trains a CTC model having a target architecture to obtain the guiding CTC model 130 by minimizing the CTC Loss L.sub.CTC).

Regarding claim 3, the combination of Kurata and Tulyakov teaches method of claim 2, wherein the training one or more intermediate teacher models based on the set of training samples and the original teacher model comprises: training each of the intermediate teacher models to be trained by using the source data as the input and using pseudo target data output by a complex teacher model as the verification data (Kurata ¶35-36 training data store 120 that stores the collection of the training data; the training data is given as an input-output pair of the an input sequence; ¶41-42 guiding model training submodule 112 is configured to train a CTC model having an architecture with a set of training data in the training data store 120 to obtain/generate the guiding CTC model 130), the complex teacher model being one of the original teacher model that has been trained (Kurata ¶41-42 the guided CTC training module 110 can include a guiding model training submodule 112 for obtaining/generating the guiding CTC model 130, and a guided model training submodule 114 for training one or more CTC models under the guidance of the guiding CTC model 130; guiding model training submodule 112 is configured to train a CTC model having an architecture with a set of training data in the training data store 120 to obtain/generate the guiding CTC model 130), or another intermediate teacher model that has been trained and of which a number of model parameters is greater than that of an intermediate teacher model currently under training.

Regarding claim 8, the combination of Kurata and Tulyakov the method of claim 2, wherein the number of model parameters of any of the intermediate teacher models being less than that of the original teacher model comprises a number of model layers of any of the intermediate teacher models being less than that of the original teacher model (Kurata ¶44 Even if the architectures of the guiding CTC model 130 and the guided CTC models 140-1.about.140-n are the same, specific configurations of the neural network such as the number of hidden layers in the neural network and the number of units in each layer may be the same or different from each other; the size and/or complexity of the guiding CTC model 130 may be different from the guided CTC models 140-1.about.140-n).

Regarding claim(s) 12-14 and 18 (drawn to a device):               
The rejection/proposed combination of Kurata and Tulyakov, explained in the rejection of method claim(s) 1-3 and 8, anticipates/renders obvious the steps of the device of claim(s) 12-14 and 18because these steps occur in the operation of the proposed combination as discussed above. Thus, 

Regarding claim(s) 20 (drawn to a CRM):               
The rejection/proposed combination of Kurata and Tulyakov, explained in the rejection of method claim(s) 1, anticipates/renders obvious the steps of the computer readable medium of claim(s) 20 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 1 is/are equally applicable to claim(s) 20. See further Kurata ¶149. 

Claim 5, 7 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Kurata and Tulyakov as applied to claim 1-2 and 13 above, and further in view of Fukuda et al (US 20200034703).
Regarding claim 5, the combination of Kurata and Tulyakov teaches the method of claim 2, but fails to specifically teach wherein the training multiple candidate student models based on the set of training samples, the original teacher model, and the set of teacher models comprises one of: determining multiple training paths, each of the multiple training paths corresponds to one of the multiple candidate student models and starts from the original teacher model and directly arrives at a corresponding candidate student model, and training the corresponding candidate student model on each of the training paths, in an order of models arranged on the training path; or determining multiple training paths, each of the multiple training paths corresponds to one of the multiple candidate student models and starts from the original teacher model, passes at least one of the intermediate teacher models and arrives at a corresponding candidate student model, and training the at least one of the 
Fukuda teaches wherein the training multiple candidate student models based on the set of training samples, the original teacher model, and the set of teacher models comprises one of: determining multiple training paths, each of the multiple training paths corresponds to one of the multiple candidate student models and starts from the original teacher model and directly arrives at a corresponding candidate student model, and training the corresponding candidate student model on each of the training paths, in an order of models arranged on the training path (¶47-49 the student training section 150 may, at block 332, select a teacher neural network in a predetermined order in each iteration of block 320 to block 350. For example, the student training section 150, at block 332, may select a teacher neural network in an ascending order (such as Teacher Neural Network (or TNN) 1 at a first loop of block 332 to block 342, TNN2 at a second loop, TNN3 at a third loop for each iteration of block 320 to block 350; a student training section 150 may train a student neural network with a teacher input data selected at the most recent iteration of block 320 and the corresponding soft label output of the teacher neural network selected at the most recent iteration of block 332. Thereby, the student training section 150 may train the student neural network, at block 340, with the teacher input data and each soft label output of the soft label outputs of the plurality of teacher neural networks during the iteration of block 320 to block 350); or (examiner notes or means one or the other and therefore one limitation met meets the entire claim language)
determining multiple training paths, each of the multiple training paths corresponds to one of the multiple candidate student models and starts from the original teacher model, passes at least one of the intermediate teacher models and arrives at a corresponding candidate student model, and training the at least one of the intermediate teacher models and the corresponding candidate student model on each of the training paths, in an order of models arranged on the training path.	


Regarding claim 7, the combination of Kurata and Tulyakov teaches the method of claim 1, but fail to teach wherein the selecting a candidate student model of the multiple candidate student models as the target student model according to training results of the multiple candidate student models comprises: testing accuracy of output results of the multiple candidate student models through a set of verification data, and selecting the target student model according to the accuracy.
	Fukuda teaches wherein the selecting a candidate student model of the multiple candidate student models as the target student model according to training results of the multiple candidate student models comprises: testing accuracy of output results of the multiple candidate student models through a set of verification data, and selecting the target student model according to the accuracy (¶53 the student training section 150 may train the student neural network, at block 340, with at least a correct test data corresponding to the input data in addition to the teacher input data and the soft label output from the selected teacher neural network; such that a sum of (A) the foregoing soft label errors and (B) hard label errors between (1) a soft label output from the student neural network in response to receiving the teacher input data (e.g., Input Data 1) and (2) the teacher correct data, is minimized; ¶55 student training section 150 may go back to block 332 when there is at least one teacher neural network that has not been selected in a pending iteration of block 320 to block 350).
	Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of wherein the selecting a candidate student model of the multiple candidate student models as the target student model according to training results of the multiple candidate student models comprises: testing accuracy of output results of the multiple candidate student models through a set of verification data, and selecting the target student model according to the accuracy from Fukuda into the method as disclosed by the combination of Kurata and Tulyakov. The motivation for doing this is to improve the accuracy of neural networks.

	Regarding claim(s) 16 (drawn to a device):               
The rejection/proposed combination of Kurata, Tulyakov and Fukuda, explained in the rejection of method claim(s) 5, anticipates/renders obvious the steps of the device of claim(s) 16 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 5 is/are equally applicable to claim(s) 16.
	
Claim 9-11 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Kurata and Tulyakov as applied to claim 1 and 12 above, and further in view of McCann et al (US 20180349359).
Regarding claim 9, the combination of Kurata and Tulyakov teaches the method of claim 1, but fail to teach wherein the obtaining the set of training samples including the plurality of pairs of training samples comprises: obtaining a first language corpus as the source data, and obtaining a second language corpus having a same meaning as the first language corpus as the target data.
	McCann teaches wherein the obtaining the set of training samples including the plurality of pairs of training samples comprises: obtaining a first language corpus as the source data, and obtaining a second language corpus having a same meaning as the first language corpus as the target data (¶24 suppose an input word sequence provided to a computing device 100 includes the English word sequence "Let's go for a walk." The corresponding German word sequence is "Lass uns spazieren gehen." Computing device 100 uses this training data to generate and output context-specific word vectors or "context vectors" (CoVe) for the words or sequences of words in the first language).
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of wherein the obtaining the set of training samples including the plurality of pairs of training samples comprises: obtaining a first language corpus as the source data, and obtaining a second language corpus having a same meaning as the first language corpus as the target data from McCann into the method as disclosed by the combination of Kurata and Tulyakov. The motivation for doing this is to improve the performance of natural language processing.

Regarding claim 10, the combination of Kurata, Tulyakov and McCann the method of claim 9, wherein the training the original teacher model by using the source data as the input and using the target data as verification data comprises: segmenting the first language corpus and the second language corpus to obtain multiple first language words and multiple second language words, respectively (McCann ¶24 suppose an input word sequence provided to a computing device 100 includes the English word sequence "Let's go for a walk." The corresponding German word sequence is "Lass uns spazieren gehen." Computing device 100 uses this training data to generate and output context-specific word vectors or "context vectors" (CoVe) for the words or sequences of words in the first language); vectorizing the multiple first language words and the multiple second language words to correspond to multiple first language word vectors and multiple second language word vectors, respectively (McCann ¶37 the method 600 starts with a process 602. At process 602, word vectors 320a-c for a sequence of words in a first or source language w.sup.x=[w.sup.x.sub.l, . . . , w.sup.x.sub.n] (e.g., English--"Let's go for a walk") are input or provided to the encoder 310. And word vectors 540 for a sequence of words in a second or target language w.sup.z=[w.sup.z.sub.1, . . . , w.sup.z.sub.n[ (e.g., German--"Lass uns spazieren gehen") are input or provided to the decoder 330); obtaining a first language corpus vector based on the first language word vectors through an encoder and an attention mechanism (McCann ¶39 the encoder processes the sequence of word vectors 320a-e to generate one or more new vector 520a-e, each called a hidden vector; ¶41 an attention mechanism 560 looks back at the hidden vectors 520a-e in order to decide which word of the first language (e.g., English) sentence to translate next. The attention mechanism 560 computes a vector of attention weights a representing the relevance of each encoding time-step to the current decoder state); obtaining a second language corpus vector based on the second language word vectors through a decoder and the attention mechanism (McCann ¶37 And word vectors 540 for a sequence of words in a second or target language w.sup.z=[w.sup.z.sub.1, . . . , w.sup.z.sub.n[ (e.g., German--"Lass uns spazieren gehen") are input or provided to the decoder 330; ¶42 he attention mechanism 560 uses the decoder state vector 550a to determine how important each hidden vector 520a-e is, and then produces the context-adjusted state 570 to record its observation); and training the original teacher model according to the first language corpus vector and the second language corpus vector (McCann ¶18 a neural network is taught how to understand words in context by training it on a first NLP task--e.g., teaching it how to translate from English to German. The trained network can then be reused in a new or other neural network that performs a second NLP task; pre-trained network's outputs--context-specific word vectors (CoVe)--are provided as inputs to new networks that learn other NLP tasks.).
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of wherein the training the original teacher model by using the source data as the input and using the target data as verification data comprises: segmenting the first language corpus and the second language corpus to obtain multiple first language words and multiple second language words, respectively; vectorizing the multiple first language words and the multiple second language words to correspond to multiple first language word vectors and multiple second language word vectors, respectively; obtaining a first language corpus vector based on the first language word vectors through an encoder and an attention mechanism; obtaining a second language corpus vector based on the second language word vectors through a decoder and the attention mechanism; and training the original teacher model according to the first language corpus vector and the second language corpus vector from McCann into the method as disclosed by the combination of Kurata and Tulyakov. The motivation for doing this is to improve the performance of natural language processing.

Regarding claim 11, the combination of Kurata and Tulyakov teaches the neural network model of claim 1 but fails to teach corpus translation method, comprising: obtaining a corpus; translating the corpus with a neural network model, and outputting a translation result, wherein the neural network model is a target student model obtained by the method for compressing the neural network model of claim 1.
(¶37 word vectors 320a-c for a sequence of words in a first or source language w.sup.x=[w.sup.x.sub.l, . . . , w.sup.x.sub.n] (e.g., English--"Let's go for a walk") are input or provided to the encoder 31); translating the corpus with a neural network model (¶43 generator 580 looks at the context-adjusted state 570 to determine the word in the second language (e.g., German) to output), and outputting a translation result, wherein the neural network model is a target student model obtained by the method for compressing the neural network model of claim 1.

Regarding claim(s) 19 (drawn to a device):               
The rejection/proposed combination of Kurata, Tulyakov and McCann, explained in the rejection of method claim(s) 10, anticipates/renders obvious the steps of the device of claim(s) 19 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 10 is/are equally applicable to claim(s) 19.
	

Allowable Subject Matter
Claims 4, 6, 15, 17 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Regarding claim 4, and similarly regarding claim 15, the prior art of record, alone or in combination, fails to teach at least “wherein the one or more intermediate teacher models in the set of teacher models are ranked in a descending order of numbers of model parameters thereof so that the number of model parameters of the intermediate teacher model in a subsequent rank is less than that of the intermediate teacher model in a preceding rank, and the complex teacher model used to train an 
Regarding claim 6, and similarly regarding claim 17, the prior art of record, alone or in combination, fails to teach at least “when the training path starts from the original teacher model and directly arrives at the corresponding candidate student model, training the corresponding candidate student model by using the source data as the input and pseudo target data output by the original teacher model that has been trained as the verification data for the corresponding candidate student model; and when the training path starts from the original teacher model, passes at least one of the intermediate teacher models and arrives at a corresponding candidate student model, training the respective intermediate teacher models by using the source data as the input and pseudo target data output by a preceding adjacent complex teacher model on the training path as the verification data, and training the corresponding candidate student model by using pseudo target data output by a preceding intermediate teacher model adjacent to the candidate student model on the training path as the verification data for the candidate student model, wherein the complex teacher model is one of the original teacher model that has been trained, or another intermediate teacher model that has been trained and has a number of model parameters that is greater than that of an intermediate teacher model currently under training.”

Response to Arguments
Applicant's arguments filed 12/08/2021 have been fully considered but they are not persuasive.
Regarding claim 1, the applicant argues that the prior art of record does not teach “teach training multiple candidate student models based on the set of training samples, the original teacher model, and the set of teacher models, the multiple candidate student models forming a set of student models and selecting a candidate student model of the multiple candidate student models as a target student model according to training results of the multiple candidate student models”.

That is, Tulyakov teaches training multiple student models (col 13 lines 34-43 Each of the student neural networks may be trained) based on the set of training samples (col 11 lines 28-50 student training data), the original teacher model (col 11 lines 28-50 the training engine 615), and the set of teacher models (col 11 lines 28-50 teacher GNN).
Tulyakov then teaches selecting a candidate student model of the multiple candidate student models as a target student model according to training results of the multiple candidate student models in col 13 lines 55-59 At operation 1020, the activation engine 625 activates one of the student neural networks associated with the modification instruction received at operation 1015. For example, assuming the user 1100 selects "B1", a student GNN associated with the "B1" button is activated. While applicant argues that the activation of the student network is associated with the modification instruction received, the examiner notes that the claim does not preclude this. Rather, the claim requires the student model to be selected according to training results of the multiple candidate student models. Therefore, based on the training results, a student GNN is selected.
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEVIN KY whose telephone number is (571)272-7648. The examiner can normally be reached Monday-Friday 9-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chan Park can be reached on 571-272-7409. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional 





/KEVIN KY/               Primary Examiner, Art Unit 2669