Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114 
A request for continued examination under, including the fee set forth in 37 CFR1.17(e), was filed in this application after final rejection. Since this application is eligiblefor continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e)has been timely paid, the finality of the previous Office action has been withdrawnpursuant to 37 CFR 1.114. Applicant's submission filed on 3/26/2021 has beenentered.
Status of the Claims
Claims 1, 4, 6-11, 14, and 16-20 are pending. 
Response to Applicant’s Argument
In view of amendments to the independent claims, rejection under 35 USC 112(a) and 35 USC 103(a) set forth in the previous office action has been withdrawn. 
Upon further search and consideration, a new combination of references has been entered. Please see details below. 
Claim Rejections - 35 USC § 103
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 103 that form the basis for the rejections under this section made in this Office action:



Claims 1, 4, 6, 8-11, 14, 16, and 18-20 are rejected under 35 USC 103(a) as being unpatentable over Senior et al. (US 9786270 B2) in view of Sutskever et al. (“Towards Principled Unsupervised Learning”) and Ward et al. (US 10210860 B1).
Regarding Claims 1, 11, and 20, Senior discloses a device (Fig. 1, Col 4, Rows 40-49, computer system 110), comprising: 
at least one memory configured to store program code (Col 15, Rows 1-4, memory device); 
at least one processor configured to read the program code and operate as instructed by the program code (Col 14, Rows 57-67, computer program instruction; Col 15, Rows 36-44, programmable processors), the program code including: 
first obtaining code to obtain text information (Col 8, Rows 66-67, a transcription of the utterance can be accessed to train a second neural network); 
first determining code to determine a set of phoneme sequences associated with the text information (Col 8, Row 66 – Col 9, Row 2, a phonetic representation  for the transcription can be used to train the second neural network); 
second obtaining code to obtain speech waveform data (Col 6, Rows 63-67, access a set of audio data 140 that includes audio waveform data for training utterances for a first neural network); 
second determining code to determine a set of phoneme boundaries associated with the speech waveform data (Col 8, Rows 62-66 in view of Col 6, Rows 10-14 and Col 12, Row 62 – Col 13, Row 6, the first neural network generates respective labels of phones through forced alignment using an optimal boundary of distinct phones as output to train the second neural network) using a long short-term memory (LSTM) recurrent neural network (RNN) (Col 6, Rows 25-30, a recurrent neural network including long short-term memory layers); and 
generating code to generate an automatic speech recognition (ASR) model (Col 7, Rows 4-10, first neural network uses audio data features to generate output distributions for use as training targets for the second neural network) using unsupervised learning (Col 6, Rows 52-55 in view of Col 5, Rows 11-13, second neural network is trained to produce CTC-type outputs; with CTC, there is generally no time alignment supervision since the network is constantly integrated over all possible alignment) and using a loss / cost function technique (Col 9, Rows 4-11, second neural network may be trained using both output targets and loss functions for output targets can be combined for training; Col 14, Rows 40-45, a loss function uses two or more different output target and constrains the alignment of inputs and outputs) based on the first determining code determining the set of phoneme sequences associated with the text information (Col 8, Row 66 – Col 9, Row 2) and based on the second determining code determining the set of phoneme boundaries associated with the speech waveform data (Col 8, Rows 62-66 in view of Col 6, Rows 10-14 and Col 12, Rows 62-66). 
Senior does not disclose using an output distribution matching (ODM) technique based on the first determining code and based on the second determining code.
Sutskever teaches using unsupervised Output Distribution Matching cost / loss function in the setting of speech recognition (p. 1, Abstract and see 1. Introduction) where p. 3, 3.1 ODM Costs as generative models and see equation (7)).
In particular, Sutskever teaches a method of unsupervised learning with an Output Distribution Matching cost function measuring a divergence between a distribution of predictions and distribution of labels while working on datasets using no (or almost no) labelled training cases (Abstract, p. 1) while operating on the assumption that there is a lack of access to labelled samples (x, y) ~ D and abundant access to unlabeled inputs and unlabeled outputs (p. 2, lines 1-4). 
Therefore, the method of unsupervised learning uses uncorrelated samples from x ~ D and y ~ D to impose a valid constraint on F: Distr [F(x)] = Distr [y] such that if F (xi) = yi is true for every possible training case, then Distr [F(x)] = Distr [y] is satisfied as well (p. 2, lines 5-9, equation (2)). 
Finally, the unsupervised constraint can be turned into the following cost function KL[Distr [y] || Distr [F(x)]] (p. 2, lines 11-14, equation (3)). This cost function may be used to evaluate the quality of a speech recognition system by measuring linguistic plausibility of its typical outputs (p. 1, Introduction, lines 11-15) by using a generative model for aligning two distributions without supervision (p. 2, lines 28-29 and p. 4, 3.2 The Dual Autoencoder “align x with y in the absence of a direct supervised signal”). It is also clear that this model can also be trained on labelled (x, y), whenever such data is available (p. 3, 3.1 ODM Costs as Generative Models).
Senior teaches training the second neural network using output distribution corresponding to input audio feature from first neural network as target outputs of the second Col 8, Rows 3-6) and learning to generate outputs corresponding to the same input audio features with added noise that would match the target outputs (Col 8, Rows 3-16 and Col 12, Row 62 – Col 13, Row 6, training second neural network by classifying frames through forced alignment using an optimal boundary of distinct phones in an inputted sequence of phones (i.e., output probabilities associate with output phones generated by the first neural network) to generate respective labels of the phone).  
However, Sutskever suggests that such unsupervised learning utilizing labels (output probabilities associate with output phones generated by a pre-trained first neural network) as target output may not always be available (p. 2, “while we do not have access to many labelled samples (x, y) ~ D, we often have access to large quantities of unlabeled samples from x ~ D and y ~ D”; Senior, Col 8, Rows 61-67, second neural network can be trained using output targets apart from the output distributions provided by first neural network such as a transcription of utterance and the phonetic representation for the transcription). 
Therefore, assuming that output distributions / labels from the first neural network are not always available, it would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to modify Senior to train the second neural network to determine or predict output labels corresponding to input audio data / speech waveform data without labeled / annotated data of the speech waveform data provided by the first neural network (Senior, Col 6, Rows 52-56, i.e., predicting a set of phoneme boundaries associated with speech waveform data) by formulating the loss / cost function for output targets as ODM cost / loss function as taught by Sutskever in order to generate ASR model p. 3, 3.1 ODM Costs as generative models; “the objective is to find a conditional Pθ(y|x) / “Fθ(x)” so that Pθ(y) matches the marginal distribution y ~ D”) that aligns output distributions of the second neural network (i.e., y) to corresponding input phoneme sequences associated with text information (i.e., x, Senior, Col 8, Row 65 – Col 9, Row 7, using phonetic representation / phoneme sequences associated with text transcription of the utterance as output targets) in the absence of a direct supervised signal (Sutskever, p. 4, 3.2 The Dual Autoencoder).
Senior does not disclose using activation signals of forget gate of the LSTM RNN to determine the set of phoneme boundaries associated with speech waveform data and determine that a number of phoneme boundary refinements satisfies a threshold based on generating the ASR model.
Ward teaches an end to end phoneme recognition system generating a sequence of phonemes from speech waveform data corresponding to spoken words (Col 13, Row 66 – Col 14, Row 6) using activation signals of forget gate of a LSTM RNN to determine a set of phoneme boundaries associated the speech waveform data (Col 14, Rows 25-31 in view of Col 10, Rows 20-24 and Rows 50-54, RNN 804 replacing RNN204 where RNN804 is identical to RNN 204 and RNN204 including LSTM comprising forget gate layer with neural network layer and sigmoid activation function and pointwise multiplication gate for determining which elements of input hidden state to preserve; Col 14, Rows 55-61, outputting probability that the audio input corresponds to phonemes; Col 15, Rows 4-16, map audio features to appropriate text phoneme). Further, the end to end phoneme recognition system generates or refines an ASR model based on determining that a number of Col 15, Rows 36-46, performing iterative beam search where mappings / alignments of audio phonemes to text phonemes from prior iteration are used as a starting point and then changed to create multiple new mappings or alignments “candidates”. In view of Col 15, Rows 62-67, the candidates are scored and the n-best scoring candidates are selected for expansion at the next level of iterative beam search until a stopping condition is reached (i.e., when the number of matching phonemes between the audio phonemes and text phonemes does not change at the next iteration)) and performing post-processing of the ASR model based on determining that the number of phoneme boundary refinements satisfies the threshold (Col 24, Rows 54-65 and Fig. 14, customizing / optimizing a trained neural network to a custom domain; Col 25, Rows 53-61 and Col 26, Rows 50-55, training a general neural network initially skilled on a training set; i.e., iteratively trained until stopping condition is reached). 
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to modify Senior to train / refine the second neural network / ASR model by determining that a number of phoneme refinements satisfies a threshold based on generating / training the ASR model neural network and thereafter performing post-processing of the ASR model in order to train and customize a general ASR phoneme recognition neural network model (Col 26, Rows 50-55).
Regarding Claim 20, Senior discloses a non-transitory computer-readable medium (Col 15, Rows 1-4, memory device) storing instructions, the instructions comprising: one or more instructions that, when executed by one or more processors of a device, cause the one Col 14, Rows 57-67, computer program instruction; Col 15, Rows 36-44, programmable processors).
Regarding Claims 4 and 14, Ward discloses determine another set of phoneme boundaries associated with the speech waveform data based on generating the ASR model and determined that the number of phoneme boundary refinements satisfies the threshold based on determining the another set of phoneme boundaries associated with the speech waveform data (Col 15, Rows 36-67, mapping / alignment of audio phoneme to text phoneme from prior iteration are used as a starting point and then changed to create multiple new mapping and alignment until a number of matching phonemes between audio phonemes and text phonemes does not change at the next iteration). 
Regarding Claims 6 and 16, Senior discloses identifying, by the device, a set of word sequences associated with the text information (Col 8, Rows 66-67, a transcription of the utterance can be accessed); and 
wherein determining, by the device, the set of phoneme sequences associated with the text information comprises: determining, by the device, the set of phoneme sequences based on the set of word sequences (Col 8, Row 67 – Col 9, Row 1, a phonetic representation for the transcription can be used to determine the output CTC output targets).
Regarding Claims 8 and 18, Senior discloses wherein the speech waveform data is unlabeled (Col 6, Rows 63-66 in view of Col 5, Rows 9-11, unsupervised alignment of phones with audio data means audio data is unlabeled). 
Regarding Claim 10, Senior does not disclose training a language model using the text information; and wherein generating, by the device, the ASR model comprises: generating, by the device, the ASR model using the language model. 
Sutskever teaches training a language model using text information (p. 3-4, 3.1 ODM Costs as generative models, equations (5)-(6), where language model P(h) can be trained on labelled (x, y), whenever such data is available).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to modify Senior to train a language model in order to generate the transcription for training the ASR model / second neural network (Senior, Col 8, Rows 66-67).
Regarding Claims 9 and 19, Ward discloses the device performs a beam search technique based on generating the ASR model to generate a set of refined phoneme boundaries (Col 15, Rows 36-47). 
Claims 7 and 17 are rejected under 35 USC 103(a) as being unpatentable over Senior et al. (US 9786270 B2) in view of Sutskever et al. (“Towards Principled Unsupervised Learning”) and Ward et al. (US 10210860 B1) as applied to claims 1 and 11, in further view of McCuller (US 2007/0083369 A1).
Senior does not disclose comparing, by the device, a set of n-gram frequency values associated with the text information and a set of phoneme frequency values associated with the speech waveform data; and wherein generating, by the device, the ASR model using the ODM technique comprises: generating, by the device, the ASR model using the ODM technique in association with the set of n-gram frequency values and the set of phoneme frequency values. 
McCuller teaches a device for converting phonemes into graphemes (i.e., ASR device) (Fig. 8) by comparing set of n-gram frequency values associated with text information and a set of phoneme frequency values associated with speech waveform data to ¶7, decompose a corpus into a sequence of words, to generate a plurality of n-grams of phonemes and a plurality of frequencies of occurrence using the sequence of words and a dictionary, such that each frequency of occurrence of words / n-grams corresponds to a respective pair of phonemes (that indicates the frequency of second phoneme in the pair following the first phoneme in the pair); thereafter, generate a phoneme tree using the n-grams of phonemes and a processor retrieves the phoneme tree to perform a random walk on the phoneme tree using the frequencies of occurrence to generate a sequence of phonemes, and maps the sequence of phonemes into a sequence of graphemes using the phoneme-to-grapheme lookup table).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement ODM generation of ASR model (Sutskever, pp. 3-4, P(h) being speech recognition model, see equations 5-7; p. 4, train neural network ASR model until distribution of G(z) is indistinguishable from target distribution) using the set of n-gram frequency values (frequency of occurrence of words) and the set of phoneme frequency values (frequency of occurrence of phonemes) to map the sequence of phonemes into a sequence of graphemes (McCuller, Abstract) until their distributions converges (Sutskever, p. 4, eventually, if GAN training is successful, G converges to a model such that the distribution G(z) is indistinguishable from the target distribution).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to examiner Richard Z. Zhu whose telephone number is 571-270-1587 or examiner’s supervisor King Poon whose telephone number is 571-272-7440. Examiner Richard Zhu can normally be reached on M-Th, 0730:1700.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or 
/RICHARD Z ZHU/Primary Examiner, Art Unit 2675                                                                                                                                                                                                        04/06/2021