Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 26-50 are pending.  Claims 26, 38, and 50 are independent.  Claims 1-25 have been canceled.  This is an RCE after appeal.
This Application was published as U.S. 20180286386.
Pending Claims are allowed.
Note: “Touring Machine” is a typo which should be “Turing Machine.”  (As in Alan Turing.)
Allowable Subject Matter
Pending Claims 26-50 are allowed.
The following is an examiner’s statement of reasons for allowance: In view of each of the particular limitations of the independent Claims when considered in the order established by the Claim language and in the context of the language of the independent Claims when each Claim is considered as a whole, the independent Claims of this Application were not found in the prior art that was viewed.
In particular the manner in which pairs of acoustic and language models are selected according to a “disjointedness measure” and applied to the same input audio speech to generate speech or text recognition results in words and the manner in which the disjointedness of the results of the two different recognizers is defined based on the number of words missed by one of the respective candidate model pairs but not the other one and obtaining and converting the missed words into proper text or acoustic version and using the acoustic or text word for training the acoustic and language models in order to improve the acoustic and language models used in the pairs when considered in the context of the language of the independent Claims as a whole and including each and every limitation of these Claims was not found in the prior art.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Close Art of Record
Note the art applied to the Claims during the prosecution of the instant Application, including Choi (U.S. 2017/0053652), Faisman (U.S. 2008/0319743), and particularly Wang (U.S. 2007/0219798) which pertains to the feature of “WER” (word error rate) and “boosting” used in the training of models and Matsuda (U.S. 2016/0260428) which applies the method of boosting to the training of acoustic models.
Choi is directed to “SPEECH RECOGNITION APPARATUS AND METHOD” and the audio signal includes speech.  In Figure 1, the "First Recognizer 110” is an acoustic model and the “Second Recognizer 120” is a language model which are "selected to recognize spoken words" in the input "audio signal."  The same “Audio Signal” is input to both.  See [0056] where the “recognition units” are taught to include “words.”  The “Selecting … a model-pair” of the Claim is taught in the embodiment shown in Figure 3 where there is more than one AM and there may be more than one LM according to the reference:  "[0077] ... Though only two acoustic models are shown in sequence, embodiments are not limited to the same, as there may be more than the two levels of acoustic modeling (or more than one level of language modeling), and there may be more than one utilized acoustic model (or language model) used in each level, e.g., in parallel and/or as selectively used such as for personalized or idiolect based models or models based on different dialects or languages. …” 
Faisman teaches the measure of accuracy as counting the errors:  “[0015] … An automatic measurement of the ASR accuracy may be performed either by a comparing the automatically generated transcript to the corrected transcript, or by counting the number of corrections made by proofreader 108….” 
See Wang: "[0040] As shown in Table II, the training data sets 1 and 2 include the speech utterance that contains the word "flight". In a first iteration, "flight" is misrecognized as Floyd and the adaptive language model LM1 includes Floyd corresponding to the training utterance. If this language model is used to recognize set 1 again, there is little hope that it will correct the mistake because the newer language model LM1 will increase the probability of seeing "floyd" in the same context. This error-reinforcement is less harmful if LM1 is used to recognize set 2 in a second iteration, where the word "flight" may occur in a different n-gram context and can be properly recognized, so the adaptive language model LM2 will boost the probability for "flight". [0041] Since recognition of set 1 is repeated using the adaptive language model LM2, with boosted probability for the word "flight", the "flight" utterance in data set 1 in iteration 3 is recognized as flight and not Floyd. Thus, the adaptive language model LM3 is generated based upon a correct recognition of the speech utterance "flight" to improve language model development.”  “[0051] In the illustrated embodiment of FIG. 6, the testing component 292 outputs the word-error-rate (WER) and the classification-error-rate (CER) 294 for the adaptive language model 242 and classification model 224 of the current iterations. The WER and/or CER of the current iterations are compared to the WER and/or CER of the previous iterations to evaluate development improvement which as previously described is used to determine whether to pursue further training iterations as previously described with respect to FIG. 4 and FIG. 5.
Matsuda is directed to statistical acoustic model adaptation which considers the error rate (severity is mapped to rate) in the next round of training:  "[0081] If recognition error decreased over the evaluation data, the learning rate was kept the same as in the previous repetition step (epoch). Otherwise, the learning rate was updated to half the last step, and the network parameters (weights and the like) were replaced with those produced the minimum error rate in the preceding training epoch, and the training for these replaced parameters was restarted using the updated learning rate.”  See also [0085] to [0088].  “[0078] … As the acoustic model, a context-dependent acoustic model trained with Boosted MMI (Maximum Mutual Information) was used….”
Regarding the feature of “wherein each of the first disjointedness measures is based on numbers of words missed by the respective candidate model pairs in word recognition processing;” which appeared as “wherein the degree of disjointedness is a function of a number of words in the speech that one model in the model-pair recognizes correctly and the second model in the model-pair recognizes incorrectly” in claim 8 prior to Appeal, Suendermann (U.S. 2010/0268536) was cited which does not do justice to either language.

The Disclosure of the instant Application has one peculiarity:

    PNG
    media_image1.png
    421
    672
    media_image1.png
    Greyscale

	Speech data 502 is input to both the Acoustic Model 504 and the Language Model 506 to generate Speech Recognition Outputs 508/510.
	A normal language model cannot operate on speech as input.  It does not have the capability to process audio.  Rather, it receives the output of an acoustic model and generates recognized text.  Choi (U.S. 2017/0053652) provides an example:

    PNG
    media_image2.png
    358
    407
    media_image2.png
    Greyscale

	Note that audio signal is input to the acoustic model only.
	
Accordingly, the models of the instant Application are interpreted in view of the supporting Specification and Drawings. The Written Description of the instant Application pertaining to the acoustic and language models provides:
[0094] The application configured neural Touring machine 402 such that neural Touring machine 402 can access models library 404. The application further configures neural Touring machine 402 such that neural Touring machine 402 operates to use inputs 406, 408, 410, and 412 as determining factors for selecting at least one acoustic model and at least one language model from models library 404. Language model 414 is an example of the selected at least one language model. Acoustic model 416 is an example of the selected at least one acoustic model.
…
[0100] Neural Touring machine 402 correlates inputs 406-412 with the parameters associated with the models in models library 404. Neural Touring [sic] machine 402 outputs language model 414 and acoustic model 416 as the model-pair that satisfies inputs 406-412.
…
[0102] The application supplies actual speech data 502 to acoustic model 504 and language model 506. Acoustic model 504 produces recognition output 508 and language model 506 produces recognition output 510. Speech recognition outputs 508 and 510 can each be in the form of text or speech signal.
[0103] Component 512, which is an implementation of component 306 in FIG. 3, accepts outputs 508 and 510 as inputs. Component 512 compares speech recognition output 508 from acoustic model 504 and speech recognition output 510 from language model 506 to identify the words that acoustic model 504 missed (A miss 514) and the words that language model 506 missed (L miss 516). 
[0104] Using A miss 514 and L miss 516, component 512 computes disjointedness measure D (518). In one example embodiment, D is a function of a portion of A miss 514 that is absent from L miss 516, a portion of L miss 516 that is absent from A miss 514, or both. For example, A miss 514 may include words [word1, word2, word3, word4] that acoustic model 504 missed from actual speech data 502. However, L miss 516 may include [word2, word4, word5, word6, word7] that language model 506 missed from actual speech data 502. Now D can be a function of [word1, word 3] which acoustic model 504 missed but language model 506 did not, [word5, word6, word7] which language model 506 missed but acoustic model 504 did not, or both of these subsets of words.
Accordingly, the outputs 508 and 510 of both acoustic and language models 504, 506 are “words” that may be in the form of text or speech.
 Further, based on the language of the independent Claim 26, the acoustic model is trained with the an acoustic version of the word (sound) and the language model is trained with a text version of the word (text).
26.  A method comprising: 
providing a selected acoustic model and a selected language model as a selected model pair for word recognition processing of a current speech signal,
wherein the selected model pair is provided from among candidate acoustic and language model pairs in a model library and is selected responsive to values of first disjointedness measures of the respective candidate model pairs, including a value of a first disjointedness measure of the candidate model pair that is selected as the selected model pair,
wherein each of the first disjointedness measures is based on numbers of words missed by the respective candidate model pairs in word recognition processing; 
obtaining, from an output of the selected language model, a text version of a first word missed in a first instance of word recognition processing of the current speech signal by the selected acoustic model but not missed in a first instance of word recognition processing of the current speech signal by the selected language model; 
converting the text version of the first word obtained from the language model text output to an acoustic version of the first word; 
training the selected acoustic model on acoustic model training data including the acoustic version of the first word at least by inputting the acoustic version of the first word to the selected acoustic model with the selected acoustic model in a training mode; 
obtaining, from an output of the selected acoustic model, an acoustic version of a second word missed in the first instance of the word recognition processing of the current speech signal by the selected language model but not missed in the first instance of word recognition processing of the current speech signal by the selected acoustic model; 
converting the acoustic version of the second word to a text version of the second word; and 
training the selected language model on language model training data including the text version of the second word at least by inputting the text version of the second word to the selected language model with the selected language model in a training mode. 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Listing of Claims
26.  A method comprising: 
providing a selected acoustic model and a selected language model as a selected model pair for word recognition processing of a current speech signal,
wherein the selected model pair is provided from among candidate acoustic and language model pairs in a model library and is selected responsive to values of first disjointedness measures of the respective candidate model pairs, including a value of a first disjointedness measure of the candidate model pair that is selected as the selected model pair,
wherein each of the first disjointedness measures is based on numbers of words missed by the respective candidate model pairs in word recognition processing; 
obtaining, from an output of the selected language model, a text version of a first word missed in a first instance of word recognition processing of the current speech signal by the selected acoustic model but not missed in a first instance of word recognition processing of the current speech signal by the selected language model; 
converting the text version of the first word obtained from the language model text output to an acoustic version of the first word; 
training the selected acoustic model on acoustic model training data including the acoustic version of the first word at least by inputting the acoustic version of the first word to the selected acoustic model with the selected acoustic model in a training mode; 
obtaining, from an output of the selected acoustic model, an acoustic version of a second word missed in the first instance of the word recognition processing of the current speech signal by the selected language model but not missed in the first instance of word recognition processing of the current speech signal by the selected acoustic model; 
converting the acoustic version of the second word to a text version of the second word; and 
training the selected language model on language model training data including the text version of the second word at least by inputting the text version of the second word to the selected language model with the selected language model in a training mode. 
 
27.  The method of claim 26, further comprising: 
performing, by the selected model pair, respective second instances of word recognition processing of the current speech signal after the training of the selected acoustic model and after the training of the selected language model; 
calculating, after the training of the selected acoustic model and after the training of the selected language model, a value of a second disjointedness measure for the selected model pair based on words missed in the second instances of word recognition processing of the current speech signal; and 
storing the value of the second disjointedness measure in the model library in association with the selected acoustic model and selected language model of the selected model pair.  

28.  The method of claim 26, wherein training the selected acoustic model further comprises: 
determining a measure of error for the selected acoustic model, based on misidentification of the first word in the first instance of the selected acoustic model word recognition processing; 
modifying a number of occurrences of the acoustic version of the first word in the acoustic model training data based on a function of the measure of error.  

29.  The method of claim 28, wherein the measure of error comprises a ratio of a first number of times the selected acoustic model misidentifies the first word to a second number of times the selected acoustic model does not misidentify the first word.  

30.  The method of claim 29, wherein modifying the number of occurrences of the acoustic version of the first word comprises providing, in response to an increase in the ratio of the first number of times and the second number of times, additional occurrences of the acoustic version of the first word in acoustic model training data.  

31.  The method of claim 26, wherein training the selected language model further comprises: 
determining a measure of error for the selected language model based on misidentification of the second word in the first instance of the selected language model word recognition processing; 
modifying a number of occurrences of the text version of the of the second word in the language model training data based on a function of the measure of error.  

32.  The method of claim 31, wherein the measure of error comprises a ratio of a first number of times the selected language model misidentifies the second word to a second number of times the selected language model does not misidentify the second word.  

33.  The method of claim 32, wherein modifying the number of occurrences of the text version of the second word comprises providing, in response to an increase in the ratio of the first number and the second number, additional occurrences of the text version of the second word in the language model training data.  

34.  The method of claim 26, wherein the degree of disjointedness is calculated as a function of a number of words in the current speech signal that a first one of the models in the selected model pair does not misidentify, and a second one of the models in the selected model pair misidentifies.  

35.  The method of claim 26, wherein selecting the acoustic model and the language model to form the selected model pair comprises: 
configuring a neural Turing machine to correlate a set of inputs to the candidate model pairs in the model library, the set of inputs comprising a vector of words expected to be present in speech to be processed, and an acceptable disjointedness limit input; and 
outputting from the neural Turing machine the selected acoustic model and the selected language model of the selected model pair based on the set of inputs.  

36.  The method of claim 35, 
wherein the set of inputs further comprises a performance specification and a set of sound descriptors, 
wherein the performance specification specifies at least one of a minimum acceptable word recognition rate for a subject-matter domain of the speech to be processed or a maximum acceptable word error rate for a subject-matter domain of the speech input signal, and 
wherein the set of sound descriptors comprises at least one of a prosody of speech, an accent used by a speaker, or a dialect of a language used by a speaker of the speech.  

37.  The method of claim 26, wherein the value of the second disjointedness measure is lower than the value of the first disjointedness measure for the selected model pair prior to the training of the selected acoustic model and the training of the selected language model.  

Claim 38 is a computer usable program product device claim with limitations similar to the limitations of Claim 26.
Claim 39 is a computer usable program product device claim with limitations similar to the limitations of Claim 27.
Claim 40 is a computer usable program product device claim with limitations similar to the limitations of Claim 28.
Claim 41 is a computer usable program product device claim with limitations similar to the limitations of Claim 29.
Claim 42 is a computer usable program product device claim with limitations similar to the limitations of Claim 30.
Claim 43 is a computer usable program product device claim with limitations similar to the limitations of Claim 31.
Claim 44 is a computer usable program product device claim with limitations similar to the limitations of Claim 32.
Claim 45 is a computer usable program product device claim with limitations similar to the limitations of Claim 33.
Claim 46 is a computer usable program product device claim with limitations similar to the limitations of Claim 34.
Claim 47 is a computer usable program product device claim with limitations similar to the limitations of Claim 35.
Claim 48 is a computer usable program product device claim with limitations similar to the limitations of Claim 36.
Claim 49 is a computer usable program product device claim with limitations similar to the limitations of Claim 37.

Claim 50 is a computer usable program product device claim with limitations similar to the limitations of Claim 26.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499.  The examiner can normally be reached on Monday through Thursday 9am to 4pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/FARIBA SIRJANI/
Primary Examiner, Art Unit 2659