Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claim(s) 1-4,6-8,12-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Kaushik et al (20210065699).

As per claim 1, Kaushik et al (20210065699) teaches a computing system that performs language model compression (as a phoneme recognize model to train/and update for word spotting – para 0007; which it implemented for language modeling as well – para 0099), the computing system comprising: 
one or more processors; and one or more non-transitory computer-readable media that collectively store:
 a teacher language model, wherein a teacher vocabulary that contains a plurality of teacher sub-words is associated with the teacher language model (examiner notes, that applicants specification, details the teacher language model is a model that is the larger vocabulary model compared to the student language model – see applicants spec, abstract, to start; now, referring to Kaushik et al (20210065699), teaching the use of TIMIT corpus, or alternate corpus, as the ;base; dictionary to train the phoneme recognizer – para 0007) , 
and wherein a plurality of teacher sub-word embeddings are respectively associated with the plurality of teacher sub-words (as, the initial database is a phoneme (and hence subword) database  -- para 0007); 
a student language model, wherein a student vocabulary that contains a plurality of student sub-words is associated with the student language model (as, the phoneme recognizer model, fig. 2, subblock 212 being trainable by the input keywords, and then performing pronunciation augmentation/pruning – fig. 2, subblock 214, eventually leading to the keyword-adapted phoneme recognizer – fig. 2, subblock 220); 
wherein a plurality of student sub-word embeddings are respectively associated with the plurality of student sub-words, and wherein a number of student sub-words contained in the student vocabulary is less than a number of teacher sub-words contained in the teacher vocabulary (as, during the pruning process, the number of pronunciation sets are reduced and eliminating other phonemes – para 0014 – hence, the number of phonemes pairs are less than the database/’teacher subwords’);
 and instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining a natural language training input (As using user input to train the phoneme recognized model – para 0007); 
generating a first sub-word version of the natural language training input that comprises at least one of the teacher sub-word embeddings associated with the teacher vocabulary (as, accessing the keyword dictionary (and hence the ‘teacher’ model, as mapped above; Fig. 2, subblock 216) and at least one of the student sub-word embeddings associated with the student vocabulary (as, the robust phoneme recognizer has a pruned subset fig 2, subblock 220, 212); 
inputting the first sub-word version of the natural language training input into at least the teacher language model (as, offline training of the larger/teacher model – para 0086; and in the embodiment of handling negative samples, updating the teacher language model – para 0105; as well as updating the keyword dictionary from the prebuilt dictionary – fig. 8, subblocks 810,814)); 
receiving a teacher output generated by the teacher language model based on the first sub-word version of the natural language training input (as, the teacher output of the keyword dictionary into keyword adaptation – fig. 2, subblock 216 to 218); 
evaluating a loss function to determine a loss associated with the teacher output (as, in the first step of pruning, measuring a distance length to prune/remove pronunciations that would not fit, and removing from the keyword pronunciation database – (para 0102)); and modifying at least one of the plurality of student sub-word embeddings based at least in part on the loss associated with the teacher output (as, performing a second pruning function, -- para 0103, and pruning the list – para 0103 – this list is used in the smaller/student vocabulary; Kaushik et al (20210065699) further teaches a positive/negative(loss) calculation for keywords – para 0104-0106). 

As per claim 2, Kaushik et al (20210065699) teaches the computing system of claim 1, wherein the operations further comprise: 
generating a second sub-word version of the natural language training input that comprises only student sub-word embeddings associated with the student vocabulary (as utterance input into the adapted phoneme recognition (smaller/student) engine – Fig. 3, subblocks 318/320 into subblock 316); 
inputting the second sub-word version of the natural language training input into at least the student language model (as inputting a version from the phoneme recognition unit – fig. 3, subblock 304, into subblock 316); 
receiving a student output generated by the student language model based on the second sub-word version of the natural language training input (as, generating an output from the adapted phoneme recognition unit – fig. 3, subblock 316);
 evaluating a second loss function to determine a second loss associated with the student output (as, the pronunciation similar measurement, fig. 3, subblock 322, calculates a loss functions associated with the student output – para 0105; the other input to the pronunciation similar measurement block is the modified/pruned word dictionary – fig. 3, subblock 312, as part of the word dictionary fig. 3, subblock 300); 
and modifying, based at least in part on the second loss associated with the student output, one or both of: at least one of the plurality of student sub-word embeddings (as, the feedback into the pronunciation similarity measurement, is also fed into the phoneme recognition unit fig. 3, sunblock 304, to further prune/modify the phoneme mappings, and into the adapted phoneme recognizer – fig. 3, subblock 316); 
and at least one parameter value of at least one student parameter included in the student language model (as, the word dictionary fig. 3, subblock 300 share the pronunciation/phoneme mappings with the adapted phoneme recognition fig. 3, subblock 316). 

As per claim 3, Kaushik et al (20210065699) teaches the computing system of claim 1, wherein the operations further comprise: generating a second sub-word version of the natural language training input that comprises both teacher sub-word embeddings associated with the teacher vocabulary and student sub-word embeddings associated with the student vocabulary (as utterance input into the adapted phoneme recognition (smaller/student) engine – Fig. 3, subblocks 318/320 into subblock 316; and as inputting a version from the phoneme recognition unit – fig. 3, subblock 304, into subblock 316);
 inputting the second sub-word version of the natural language training input into at least the student language model (as inputting a version from the phoneme recognition unit – fig. 3, subblock 304, into subblock 316); 
receiving a student output generated by the student language model based on the second sub-word version of the natural language training input (as, generating an output from the adapted phoneme recognition unit – fig. 3, subblock 316); 
evaluating a second loss function to determine a second loss associated with the student output; and modifying at least one of the plurality of student sub-word embeddings based at least in part on the second loss associated with the teacher output (as, the pronunciation similar measurement, fig. 3, subblock 322, calculates a loss functions associated with the student output – para 0105; the other input to the pronunciation similar measurement block is the modified/pruned word dictionary – fig. 3, subblock 312, as part of the word dictionary fig. 3, subblock 300). 

As per claim 4, Kaushik et al (20210065699) teaches the computing system of claim 1, wherein: generating the first sub-word version of the natural language training input comprises masking at least one word of the natural language training input (as removing/processing on the side, words that contain separate commands – latter half of para 0118, the example of phrases that are not able to activate the device); and the teacher output comprises a prediction of the at least one word of the natural language training input (as performing prediction of the spoken keywords – para 0084) that was masked within a pre-selected one of the teacher or student vocabularies ( as the phrase can be side-processed if it is determined that the phrase does not match/wake the device – latter half of para 0118). 

As per claim 6, Kaushik et al (20210065699) teaches the computing system of claim 1, wherein each of the teacher language model and the student language model comprise one or more transformer layers (as performing the changes with a softmax function and layers – para 0117), and wherein the operations further comprise: modifying at least one parameter value of at least one transformer layer of the student language model to reduce a different between the at least one transformer layer of the student language model and at least one transformer layer of the teacher language model when projected into a shared space (as, using the changes in the phoneme recognizer model (smaller/student model to replace/re-register the initial phoneme recognizer model (bigger/teacher model) with the updated information – para 0121; see also para 0115 teaches the keyword pronunciation dictionary being updated by the smaller phoneme dictionary). 

As per claim 7, Kaushik et al (20210065699) teaches the computing system of claim 1, wherein the teacher language model and the student language model comprise an equal number of transformer layers (as softmax layers on the recognizer model – para 01117, and wherein the student language model has a smaller number of parameters than the teacher language model (where, the teacher/larger dictionary model has the additional initial parameter of a probability of waking the actual device – para 0118 – these probabilities are not considered in the smaller model, which after pruning, phoneme information only is added to the smaller dictionary – para 0120). 

As per claim 8, Kaushik et al (20210065699) teaches the computing system of claim 1, wherein the teacher language model applies two separate softmax layers to respectively make predictions over the student vocabulary and the teacher vocabulary (as updating the phoneme recognizer model, as noted above in claim 1; using multiple layers in a softmax function – para 0117). 

As per claim 12, Kaushik et al (20210065699) teaches the computing system of claim 1, wherein the operations further comprising: deploying the student language model to a mobile or edge device for on-device inference at the mobile or edge device (as deploying the smaller model to a client such as mobile devices – para 0036). 

Claims 13-18 are method claims whose steps are performed by the computer system of claims 1-8, 12 and as such, claims 13-18 are similar in scope and content to claims 1-8, 12 and therefore, claims 13-18 are rejected under similar rationale as presented against claims 1-8,12 above.  Furthermore, as per claim 13, Kaushik et al (20210065699) teaches the modification of the phoneme pronunciations based on the probability/loss models – para 0103-0106).

Claims 19-20 are computer readable medium claims whose steps are performed by the system claims 1-8, 12, and method claims 13-18 and as such, claims 19-20 are similar in scope and content to claims 1-8,12,13-18 above and therefore, claims 19-20 are rejected under similar rationale as presented against claims 1-8,12;13-18 above.
 

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Kaushik et al (20210065699) in view of Toplyn (20190266250).

As per claim 5, Kaushik et al (20210065699) teaches the computing system of claim 1, wherein the teacher language model and the student language model comprise of neural networks/machine learning applications, but does not explicitly teach Bidirectional Encoder Representations from Transformers (BERT) models; Toplyn (20190266250) teaches the use of BERT models for the building language models (para 0220).  Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to implement BERT models in the models of Kaushik et al (20210065699) because it would be advantageous to, on top of the pretrained model, to offer a forward universal language such as BERT, to enhance the recognition process (Toplyn (20190266250), para 0220). 

Claims 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Kaushik et al (20210065699) in view of Lai (20210182662).

As per claims 9,10, Kaushik et al (20210065699) teaches a probability calculation for distance/loss (see mapping in claims 1-4,6-8; as well as para 0103-0106), for the segmented inputs;  but does not explicitly teach a probability hyper parameter, using tokens, as well as a ratio of tokens measured for the ramping of the probability hyperparameter; however, Lai (20210182662) teaches neural network functions for language modeling in distance determining (abstract, para 0010) using hyperparameters of the neural network model (para 0122), calculating probabilities of the tokens (para 0123), and increasing/ramping the probability based on a first and second probability vector (para 0124); wherein random parameters are utilized in the database models (including BERT – para 0038).  Therefore, it would have been obvious to one of ordinary skill in the art of language modeling and subword matching to include in the loss models of Kaushik et al (20210065699), the hyperparameter additive probability models of Lai (20210182662) because it would advantageously improve the speed and accuracy in training the student model -- Lai (20210182662), para 0112.    

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Kaushik et al (20210065699) in view of Prabhavalkar (20200027444).

As per claim 11, Kaushik et al (20210065699) teaches the computing system of claim 1, wherein each of the teacher vocabulary and the student vocabulary (as mapped above in claim 1), including phoneme pronunciation and phonemes’; but does not comprises respective sets of WordPiece tokens; Prabhavalkar (20200027444) teaches token level measurements (as well as word level – par 0041) including wordpieces (para 0073).  Therefore, it would have been obvious to one of ordinary skill in the art of phoneme processing to modify the phoneme-to-text (grapheme) models of Kaushik et al (20210065699) with wordpieces as taught by Prabhavalkar (20200027444) because it would advantageously improve the softmax layer processing via wordpieces in addition to graphemes, thereby improving the beam search processing to determine the transcription (Prabhavalkar (20200027444), para 0073).

Response to Arguments

Applicant's arguments filed 05/24/2022 have been fully considered but they are not persuasive.  Regarding applicants arguments on pp10 bottom half, to pp12 of the response, examiner argues that “language model compression” is located only in the preamble of the claim, and not reinforced into the body of the claim (and no patentable weight); furthermore, and more relevant, applicants arguments toward “language model compression” is toward a result, and not the explicit steps that lead to the result.  The examples of applicants spec, are not included in the claim scope – applicant is arguing the specification, and not the claims.  Furthermore, to this section, the models of Kaushik are created/trained (para 0007), then improved to be more efficient (para 0009).  As to applicants arguments toward “embeddings”, fr. Pp 12 -13 of the response,  applicants arguments fail to define the claimed word “embeddings”.  Looking at para 0035 of applicants spec, embeddings are “trained”, as well as, embeddings are extracted from databases (para 0031, 0032).  Furthermore, Kaushik operates on the phoneme level, and hence, sub-word.  Kaushik’s trained models coexists as well, with the updated/improved model as a secondary offering while the original model is maintained.  On pp 15-16 of the response, regarding applicants arguments towards “evaluating a loss function”, applicant’s explanation of a loss function as ‘how far off the mark the computed output is” is read upon by Kaushiks’ ‘not similar enough’ in para 0102-0103; furthermore, the claims only contain ‘a loss function’ and examiner recommends more detailed claim limitations of the loss function to overcome Kaushik.
 

Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
The following reference was found, showing explicitly, teacher and student models Fukada et al (20190205748).

The following references were found to parallel applicants disclosure towards bert and LSTM with language models:

Battach (20190266236) teaches using BERT and other nlp models (para 0021), with two different size vocabulary models (para 0010-0011)  
Griffiths (20200234694) teaches BERT as language modeling with various size vocabularies (para 0038-0041);.
Steedman (20210141798) teaches natural language processing using layered networks, including BERT, with weight matrices on subword portions (para 0143-0149). 

The following references were found to parallel claim features, such as language models with layered structures and cross-entropy loss functions:

Cui (20200135174) teaches CTC and cross entropy loss using an attention model (para 0002, 0004, 0052, 0056 (LSTM), and 0058)
Shafran (20200111483) teaches layered language models with matrix calculation in cross entropy loss calculations – para 0133, 0136-0138).
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Opsasnick, telephone number (571)272-7623, who is available Monday-Friday, 9am-5pm. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Mr. Richemond Dorvil, can be reached at (571)272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/Michael N Opsasnick/Primary Examiner, Art Unit 2658                                                                                                                                                                                            09/03/2022