Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1 and 11 are independent.
Subsequent to the Examiner’s Amendment below, Claims 1-5, 7-15, and 17-20 are pending and allowed.
This Application is published as U.S. 20210005184.
Apparent priority 31 December 2015.
This Application is a continuation of 16/258,309 issued as U.S. 10,803,855 which is a continuation of 15/397,327 issued as U.S. 10,229,672.  A Terminal Disclaimer over the term of the both parents was electronically filed on 1/13/2021.
Examiner’s Amendments
Authorization for this examiner’s amendment was granted in an interview with Mr. Brett Krueger on 1/13/2021.
Cancel Claims 6 and 16.
Amend independent Claims 1 and 11 as follows:
1. A method comprising: 
receiving, at data processing hardware of a speech recognition system, audio data for a portion of an utterance spoken by a user; 
dividing, by the data processing hardware, the audio data for the portion of the utterance into a sequence of fixed-length frames; 

generating, by the data processing hardware, using an acoustic model configured to receive the corresponding set of log-Mel frequency cepstral coefficients generated for each frame in the sequence of fixed-length frames as input, a sequence of context-dependent states representing the portion of the utterance; and 
generating, by the data processing hardware, using a language model trained separately from the acoustic model, a streaming speech recognition result for the portion of the utterance spoken by the user, 
wherein an acoustic model training process trains the acoustic model by:
obtaining acoustic model training data that comprises training audio data and word-level transcriptions for the training audio data; 
training a first neural network model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions; 
generating, using the trained neural network model, a context-dependent state inventory from phonetic alignments between the training audio data and the phonetic sequences corresponding to word-level transcriptions; and 
training, using the context-dependent state inventory, a second neural network model to generate outputs corresponding to one or more context-dependent states, the second neural network model having a plurality of long short-term memory layers,
wherein the trained second neural network model comprises the acoustic model. 

11. A system comprising: data processing hardware; memory hardware in communication with the data processing hardware and storing instructions that when executed by the data processing hardware cause the data processing hardware to perform operations comprising: receiving audio data for a portion of an utterance spoken by a user; dividing the audio data for the portion of the utterance into a sequence of fixed-length frames; for each frame in the sequence of fixed-length frames, determining a corresponding set of log-Mel frequency cepstral coefficients; generating, using an acoustic model configured to receive the corresponding set of log-Mel frequency cepstral coefficients generated for each frame in the sequence of fixed-length frames as input, a sequence of context-dependent states representing the portion of the utterance, and generating, using a language model trained separately from the acoustic model, a streaming speech recognition result for the portion of the utterance spoken by the user, wherein an acoustic model training process trains the acoustic model by: obtaining acoustic model training data that comprises training audio data and word-level transcriptions for the training audio data; training a first neural network model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions; generating, using the trained neural network model, a context-dependent state inventory from phonetic alignments between the training audio data and the phonetic sequences corresponding to word-level transcriptions; and training, using the context-dependent state inventory, a second neural network model to generate outputs corresponding to one or more context-dependent states, the second neural network model having a plurality of long short-term memory layers, wherein the trained second neural network model comprises the acoustic model.. 
Allowable Subject Matter
Subject to the Examiner’s Amendment above, the pending Claims are allowed.
The following is an examiner’s statement of reasons for allowance: In view of each of the particular limitations of the independent Claims when considered in the order established by the Claim language and in the context of the language of the independent Claims when each Claim is considered as a whole, the independent Claims of this Application were not found in the prior art that was viewed.
In particular, the particular two-tier method of training of an acoustic model that is claimed, together with all its particulars such using word-level transcriptions and an LSTM second level model and particularly the intermediate-level context-dependent alignment generated by the first model to be used for the training of the second model, is not found in the prior art:
wherein an acoustic model training process trains the acoustic model by: 
obtaining acoustic model training data that comprises training audio data and word-level transcriptions for the training audio data; 
training a first neural network model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions; 
generating, using the trained neural network model, a context-dependent state inventory from phonetic alignments between the training audio data and the phonetic sequences corresponding to word-level transcriptions; and 
training, using the context-dependent state inventory, a second neural network model to generate outputs corresponding to one or more context-dependent states, the second neural network model having a plurality of long short-term memory layers,
wherein the trained second neural network comprises the acoustic model.

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Close Art of Record
Graves (U.S. 9,263,036) (same assignee) and Graves et al.  “Connectionist Temporal Classification: Labeling Unsegmented Sequence Data with Recurrent Neural Networks,” Proceedings of the 23rd International Conference on Machine Learning, June 25, 2006, provide the foundation upon which the method of the instant Application is based, i.e. labeling unsegmented sequence data, such as audio data without using pre-segmented and aligned training data, with an RNN.  The instant Application uses the result of this first round of training to generated labeled and aligned text and audio data and used it to train a second LSTM RNN acoustic model.  This instant Claims then use the second and better trained acoustic model for speech recognition.
Van Kommer (U.S. 6,799,171) closest reference but is not express on generating an alignment that is used to train the next stage.  Deng (U.S. 8,972,253) is on point with respect to creating an alignment as a result of training a GMM-HMM model and using this alignment to train a DBN (neural network) model but the first phase does not use a neural network and uses an HMM and falls within the prior art rejected by Graves et al.
References that use a two-tier method of training acoustic models, generally, train a general acoustic model and then train that model to be applicable to the voice of a particular speaker which is different from the training method that is claimed.
As an example of back to back acoustic model training, see Li (U.S. 20180254034) which is directed to the training of multiple acoustic models.  But Li trains a general acoustic model and then uses this model to train a speaker-dependent acoustic model trained for the voice of a specific speaker.  “A training method for multiple personalized acoustic models ….”  Abstract.  Li teaches “obtaining acoustic model training data that comprises training audio data and (word-level) transcriptions for the training audio data” in Figures 1-2 and 5-6.  Figure 1, S11:  “… The method comprises: training a reference acoustic model, based on first acoustic feature data of training voice data and first text annotation data corresponding to the training voice data (S11) …”  Abstract.  “[0038] In step S11, a reference acoustic model is trained based on first acoustic feature data of training speech data and first text annotation data corresponding to the training speech data.”  For the training data and transcriptions see [0039].  Li does not specify that its annotation in the training data is word-level but in the speech synthesis stages, the segmentation is word level segmentation which suggests that the acoustic models are trained at word-level.  See Figure 3, S302.  Li also teaches “training a first neural network model on the acoustic model training data to generate phonetic sequences corresponding to the (word-level) transcriptions,” in  Figure 1, S11: A reference acoustic model is trained based on the training data.  “[0042] After the first acoustic feature data of the training speech data and the first text annotation data corresponding to the training speech data are obtained, training can be performed on the first acoustic feature data and the first text annotation data via a neural network, and the reference acoustic model is generated according a result of the training.”  The “reference acoustic model” is a neural network model.  Figures 5-6, “First Model Training Module 110.”  “[0098] In detail, the first model training module 110 is configured to train a reference acoustic model based on first acoustic feature data of training speech data and first text annotation data corresponding to the training speech data.”  Li also teaches “training, using the context-dependent state inventory, a second neural network model to generate outputs corresponding to one or more context-dependent states, the second neural network model having a plurality of long short-term memory layers” in Figures 5-6, the “second model training module 130” is trained from the “first model” / “reference acoustic model” and is an LSTM neural network model.  “[0104] In detail, after the obtaining module 120 obtains the speech data of the target user, based on the reference acoustic model, the second model training module 130 can train the first target user acoustic model using the speech data of the target user and via an adaptive technology (for example, via a long short-term memory (LSTM for short) neural network structure or a bidirectional LSTM neural network structure), such that the reference acoustic model is adaptively updated to the first target user acoustic model.”
Li does not teach “generating, using the trained neural network model, a context-dependent state inventory from phonetic alignments between the training audio data and the phonetic sequences corresponding to word-level transcriptions” or the use of this alignment for the training of the second acoustic model.

With respect to Obviousness Double Patenting note that Senior (U.S. 9786270), Senior (U.S. 10,733,979), Bacchiani (U.S. 9620145), Dai (U.S. 10528866), Schalkwyk (U.S. 9728185), all asigned to the same assignee, were also evaluated but ODP was not found.

Note application of art to the top portions of the Claim (the portion different from parents):
Kanda (U.S. 20180204566) teaches:
1. A method comprising: 
receiving, at data processing hardware of a speech recognition system, audio data for a portion of an utterance spoken by a user; [Kanda, Figure 7, “Input Speech 282.”  “[0050] Referring to FIG. 7, a speech recognition device 280 in accordance with the present embodiment has a function of performing speech recognition of an input speech 282 and outputting a text 284 of speech recognition….”]
dividing, by the data processing hardware, the audio data for the portion of the utterance into a sequence of fixed-length frames; [Kanda, Figure 7, “Framing Unit 302.”  “[0050] …a framing unit 302 for dividing the digitized speech signal output from A/D converter circuit 300 into frames with a prescribed length and prescribed shift length allowing partial overlapping ….”]
for each frame in the sequence of fixed-length frames, determining, by the data processing hardware, a corresponding set of log-Mel frequency cepstral coefficients; [Kanda, Figure 7, “Feature Extracting Unit 304.”  “[0050] …a feature extracting unit 304 performing a prescribed acoustic process on each of the frames output by framing unit 302, thereby extracting speech features of each frame and outputting a feature vector. Each frame and each feature vector have information such as relative time, for example, with respect to the head of input speech 282. The features used may include MFCCs (Mel-Frequency Cepstrum Coefficients), its first order differential, second order differential, power and so forth.”]
generating, by the data processing hardware, using an acoustic model configured to receive the corresponding set of log-Mel frequency cepstral coefficients generated for each frame in the sequence of fixed-length frames as input, a sequence of context-dependent states representing the portion of the utterance; and [Kanda, Figure 7, “Acoustic Model (RNN) 308.”  “[0051] …an acoustic model 308 implemented by a RNN, receiving as an input a feature vector stored in feature storage unit 306 and for outputting a vector representing for each phoneme posterior probabilities of each frame at each time point corresponding to the phonemes ….”  The sequence of phonemes in a frame teaches a sequence of context-dependent states:  “…an acoustic model 308 implemented by a RNN (recurrent neural network) for calculating, for each state sequence, the posterior probability of a state sequence in response to an observed sequence consisting of prescribed speech features obtained from a speech….”  Abstract.  For generating a sequence of states see also {0034].  For context-dependency see [0054] and [0055].]
generating, by the data processing hard ware, using a language model trained separately from the acoustic model, a streaming speech recognition result for the portion of the utterance spoken by the user, [Kanda, Figure 7, “Decoder 310” is a “language model” because it generates the speech recognition result in the form of the “Text of Speech Recognition 284.”  “[0051] … a decoder 310 implemented by WFST (Weighted Finite-State Transducer), referred to as S.sup.-1HCLG in the present specification as will be described later, for outputting, using the vectors output from acoustic model 308, a word sequence having the highest probability as a text 284 of speech recognition corresponding to the input speech 282, by means of WFST….”  The acoustic and WFST models of Kanda are trained separately.  See [0055].]
wherein an acoustic model training process trains the acoustic model by: [Kanda teaches the use of pre-trained acoustic and language models and therefore does not teach the training process.]
obtaining acoustic model training data that comprises training audio data and word-level transcriptions for the training audio data; [Kanda uses pre-trained models and does not describe the process of training but does teach that its models generate word sequences (see claim 1 of Kanda) and therefore must be trained on word-level data.]
training a first neural network model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions; [Kanda uses pre-trained models and does not describe the process of training but does teach that its models generate word sequences (see claim 1 of Kanda) transcriptions.  “… performing speech recognition of the 333333speech signal based on a score calculated for each hypothesis of a word sequence corresponding to the speech signal…”  Abstract.]
generating, using the trained neural network model, a context-dependent state inventory from phonetic alignments between the training audio data and the phonetic sequences corresponding to word-level transcriptions; and [Kanda uses pre-trained models and does not describe the process of training but does teach that its models are context dependent with a triphone context level: “[0054] …Further, recently, a phoneme-based triphone HMM comes to be used for representing phoneme context, and it can also be represented by WFST….”]
training, using the context-dependent state inventory, a second neural network model to generate outputs corresponding to one or more context-dependent states, the second neural network model having a plurality of long short-term memory layers. [The second model of Kanda that generates the transcription is a WFST finite state automaton and is not taught to be a second neural network.]

Kanda teaches the use of pre-trained acoustic and language models and therefore does not teach the training process.

Goldenthal (U.S. 5625749) teaches:
1. A method comprising: 
receiving, at data processing hardware of a speech recognition system, audio data for a portion of an utterance spoken by a user; [Goldenthal, Figure 1, continuous “Speech Signal 12” being input to the “Signal Processing 14.”  Figure 2, “speech waveform 12a” input to the “signal processor 16.”]
dividing, by the data processing hardware, the audio data for the portion of the utterance into a sequence of fixed-length frames; [Goldenthal, Figure 1, “Signal Representation 18” output from the “Signal Processing 14” are “observation vectors 18” each of which is a “Frame of speech.”  Col. 1, lines 30-45:  “A block diagram of the major components of a typical ASR system 10 is shown in FIG. 1. Typically, the samples of the continuous speech signal 12 are first processed by a signal processor 14 to form a discreet sequence of observation vectors 18. … Each observation vector 18 is called a frame of speech ….”  Title:  “Segment-based apparatus and method for speech recognition by analyzing multiple speech unit frames and modeling both temporal and spatial correlation.”  “3. The digitized samples are blocked or rectangularly windowed into frames. The frames are typically on the order of 25 or 30 ms.”  Col. 6, lines 35-37.]
for each frame in the sequence of fixed-length frames, determining, by the data processing hardware, a corresponding set of log-Mel frequency cepstral coefficients; [Goldenthal, Figure 2 shows the inventive aspects of Goldenthal over Figure 1 and the “signal representation 18a” in Figure 2 (similar to observation vectors 18 of Figure 1) are generated as MFCCs: “Turning now to the particulars of the present operates in an automatic speech recognition system 40 such as that depicted in FIG. 2 (and similar to that of FIG. 1). As noted earlier, the continuous speech (input) signal is digitally sampled and then processed via a temporal and/or spectral analysis into a sequence of observation frames. In the preferred embodiment, the input signal 12a is preprocessed by signal preprocessor 16 (FIG. 2) as follows. The signal representation 18a to be generated and used throughout the present invention consists of the Mel-frequency cepstral coefficients (MFCC's) …. These coefficients are based on the short-time Fourier transform of the speech signal 12a. The cepstrals provide a high degree of data reduction over using values of the power spectral density directly, since the power spectrum at each frame is represented using relatively few parameters.”  Col. 6, lines 9-29.]
generating, by the data processing hardware, using an acoustic model configured to receive the corresponding set of log-Mel frequency cepstral coefficients generated for each frame in the sequence of fixed-length frames as input, a sequence of context-dependent states representing the portion of the utterance; and [Goldenthal, Figure 2, the “Acoustic Attributes (Signal Representations 18s)} which are in MFCC form and represent the frames are provided to the “Acoustic-Phonetic Models 30” and the acoustic model 30 generates a set of states that represent the phonetic recognition of the frame:  “Phonetic recognition methods tend to fall into two categories. The first, and most widely used, is "frame" based. … An example of a frame-based phonetic recognition method is the Hidden Markov Models (HMM's). HMM's consists of a set of states connected to each other via transition probabilities. …”  Col. 2, 17-31.  This is from the Background but the method of the invention of Goldenthal also generates context-dependent states from the phonemes of the speech:  “Tracks …  are computed from training data by mapping the training tokens for each phone to a sequence of M states. Each state is recorded as a vector, the sequence of vectors forming the track. The mapping function is known as a generation function f. When all the tokens in the training set for a particular phone have been mapped, the phone-dependent track is calculated from the maximum likelihood estimate of each state.” Col. 8, lines 17-24.]
generating, by the data processing hardware, using a language model trained separately from the acoustic model, a streaming speech recognition result for the portion of the utterance spoken by the user, [Goldenthal, Figure 1, “language model 22.”  “Acoustic and language models 20, 22 are then used to score the frame sequence O, search a lexicon and hypothesize word sequences.”  Col. 1, lines 41-44.  Figure 2 also shows the “language model” but does not discuss it because the focus in Goldenthal is its innovative acoustic model.  Goldenthal discusses the training of the acoustic models 30 but not the language model which is mentioned only once.  Accordingly, the “language model 22” or the language model of Figure 2 are trained separately from the “acoustic models 20, 30.”]
wherein an acoustic model training process trains the acoustic model by: [Goldenthal, “The acoustic models are generally trained to recognize some set of phones (the exact set being a design decision).”  Col. 2, 7-8.  Goldenthal refers to “Further, other methods for phonetic recognition include template-based approaches, statistical approaches and more recently approaches based on dynamic modeling and neural networks. A recursive error propagation neural network approach has been used with the TIMIT speech corpus. See T. Robinson, "Several Improvements to a Recurrent Error Propagation Phone Recognition System", Technical Report CUED/TINFENG/TR. 82, 1991. An inherent drawback of neural networks is a large amount of time needed to train the models.”  Col. 3, 40-50.  ]
obtaining acoustic model training data that comprises training audio data and word-level transcriptions for the training audio data; 
training a first neural network model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions; 
generating, using the trained neural network model, a context-dependent state inventory from phonetic alignments between the training audio data and the phonetic sequences corresponding to word-level transcriptions; and 
training, using the context-dependent state inventory, a second neural network model to generate outputs corresponding to one or more context-dependent states, the second neural network model having a plurality of long short-term memory layers. 

Robinson, Figure 1, teaches that the “Recurrent Error Propagation Network” is a RNN for recognition of u(t) which consists of frames of pre-processed speech.  P. 2.

Regarding Claim 2,Kanda teaches that:  “[0005] … Conventionally, it is assumed that a word sequence 30 (word sequence W) is influenced by various noises and observed as an observed sequence 36 (observed sequence X), and a word sequence that is expected to have the highest likelihood of generating the finally observed sequence X is output as a result of speech recognition. …”  But it does not teach perturbation by noise during the training process.
Pereze-Mendez (U.S. 5,754,978)  teaches:
2. The method of claim 1, wherein the acoustic model training process further trains the acoustic model by synthetically distorting, using a room simulator, the training audio data with noise obtained from various noise sources. [Perez-Mendez teaches in its Background that applying different amounts of noise to the training speech in order to simulate various noise environments is known in the art as of the date of Perez Mendez which is 1995.  “Previously, others have employed certain aspects of parallel processing into speech recognition techniques, as opposed to rejection methods. Particularly, some have focused on improving speech recognition in adverse environments. For example, U.S. Pat. No. 5,182,765 (Ishii) discloses a speech registration/recognition system in which, as part of the registration process, the speech input signal is stored as recognition data. Subsequently, after registration and as part of the recognition process, the speech input signal is compared to the stored recognition data. In the embodiment shown in FIG. 8 of the '765 patent, and described at column 7, line 19 through column 8, line 42 thereof, a plurality of parallel speech recognition circuits each receive and store slightly altered versions of the input speech as part of the registration process. The intent is to improve the ability to recognize speech in adverse speech environments, such as high noise, etc. This is done by recording or registering speech or "training data" from many different environments (different types of background noise, for example). These slightly altered versions are created by different electrical characteristics in each of the variable characteristic circuits associated with each of the speech recognition circuits….”  Col. 2, 47 to Col. 3, 10.  Figures 2 and 6 of Perez-Mendez teach “signal perturbations” which is adding different amounts of noise to the training speech.  “A perturbation may be applied directly to any of the given speech representations. When the values in the speech representation refer to physical values, such as in the original digitized signal, a perturbation may be applied directly by adding small values to the signal values. In the original digitized signal this may take the form of adding a "one" to every sample value, or choosing a "one" or "zero" randomly to add to every sample value. This may be considered to represent to represent a small amount of noise. Higher levels of speech representation, those closer to the final output, are more difficult to perturb directly. The sequence of vector quantized values could be perturbed by selecting alternate prototype vectors, say the second nearest, for every frame or for every other frame, etc.”  Col. 7, 1-3.]
Perez-Mendez does not teach using room simulation to generate the noisy speech signals.
Kern (U.S. 7,043,427) teaches:
wherein the acoustic model training process further trains the acoustic model by synthetically distorting, using a room simulator, the training audio data with noise obtained from various noise sources. [Kern Figure 1 shows the “correction unit 15, for example simulates room reverberation and/or sound reflections from nearby objects within the speech transmission path. Acoustic reflections of this sort can for example, originate from a desktop, a display screen, or from other objects.”  Col. 3, 6-10. And “) In operation of the apparatus shown in FIG. 1, during a training speech samples are stored in the data processing device 17. Which could be used, for example, to construct a personal telephone directory.”  Col. 3, lines 31-35.]

For Claim 2, Room Simulators:
Rumsey (U.S. 20090238370) teaches the use of “Real or Simulated Room Acoustics” ([0142-[0146]) or the combination of both approaches to obtain the various acoustics under which a listening device must operate.
Opitz (U.S. 5,544,249) discloses a method of simulating a room and/or sound impression.
Flanagan (U.S. 5,737,485) teaches that  “In addition to the speech source, inputs to the room simulation may include a competing noise source of variable intensity, to produce different signal-to-competing-noise ratios (SCNR's).

Regarding Claim 3, Kanda teaches:
3. The method of claim 1, wherein the context-dependent states comprise labels for triphones or scores for labels for triphones. [Kanda teaches that its WFST model 320 is trained for triphones which teaches that the labels are generated for triphones.  “[0054] Various models are used in speech recognition. HMM, a word pronunciation dictionary and a language model can all be represented by WFST. Further, recently, a phoneme-based triphone HMM comes to be used for representing phoneme context, and it can also be represented by WFST….”  “[0055] WFST involves an operation referred to as "composition." Composition of two WFSTs enables processing of tasks that otherwise require application of two successive WFSTs, by one composed WFST. Therefore, it is possible to compose WFSTs for the HMM, the word pronunciation dictionary, the language model and the triphone HMM to one WFST. Decoder 310 uses such a pre-trained and composed WFST. The WFST used here is a graph built in advance by language knowledge, and it employs a knowledge source referred to as HCLG. HCLG stands for a composition of four WFSTs (H, C, L, G). H stands for HMM, C context, L lexicon and G grammar….”]

Regarding Claim 4, Erdogan (U.S. 20160111107) teaches:
4. The method of claim 1, wherein the acoustic model training process uses maximum mutual information (MMI) to train the second neural model to generate the outputs corresponding to the one or more context-dependent states. [Erdogan “[0032] The joint objective function is a weighted sum of enhancement and recognition task objective functions. For the enhancement task, the objective function can be mask approximation (MA), magnitude spectrum approximation (MSA) or phase-sensitive spectrum approximation (PSA). For the recognition task, the objective function can simply be a cross-entropy cost function using states or phones as the target classes or possibly a sequence discriminative objective function such as minimum phone error (MPE), boosted maximum mutual information (BMMI) that are calculated using a hypothesis lattice.”  “2. The method of claim 1, wherein the enhancement network is a Deep Recurrent Neural Network (DRNN).”]

Regarding Claim 5, Kanda teaches:
5. The method of claim 1, wherein the acoustic model training process uses alignment data indicating alignments between the training audio data and the word-level transcriptions for the training audio data when training the second neural network model. [Kanda mentions that the training is done by using an aligned corpus:  “[0035] … In Equation (6), P(xt) is common to each HMM state and, therefore, it is negligible in arg max operation. P(st) can be estimated by counting the number of each state in aligned training data….”  This pertains to the DNN of Figure 3 nevertheless teaches that training data includes aligned audio and text.  The recognition is word-level which suggests that the training data should be also.  See [0005].]

Regarding Claim 7, Kanda teaches:
7. The method of claim 1, wherein generating the sequence of context-dependent states representing the portion of the utterance comprises generating output values indicating likelihoods corresponding to different context-dependent states. [Kanda, Background.  This is definition of speech recognition:  “[0005] Basic concept of speech recognition of a conventional speech recognition device will be described with reference to FIG. 1. Conventionally, it is assumed that a word sequence 30 (word sequence W) is influenced by various noises and observed as an observed sequence 36 (observed sequence X), and a word sequence that is expected to have the highest likelihood of generating the finally observed sequence X is output as a result of speech recognition. Let P(W) represent the probability of a word sequence W being generated. Further, let P(S|W) represent the probability of a state sequence S (state sequence 34) of HMM being generated from the word sequence W through a phoneme sequence 32 as an intermediate product. Further, let P(X|S) represent the probability of observed X being obtained from the state sequence S.”]  “[0005] Basic concept of speech recognition of a conventional speech recognition device will be described with reference to FIG. 1. Conventionally, it is assumed that a word sequence 30 (word sequence W) is influenced by various noises and observed as an observed sequence 36 (observed sequence X), and a word sequence that is expected to have the highest likelihood of generating the finally observed sequence X is output as a result of speech recognition. Let P(W) represent the probability of a word sequence W being generated. Further, let P(S|W) represent the probability of a state sequence S (state sequence 34) of HMM being generated from the word sequence W through a phoneme sequence 32 as an intermediate product. Further, let P(X|S) represent the probability of observed X being obtained from the state sequence S…..”]

Regarding Claim 8, Kanda teaches:
8. The method of claim 1, further comprising, while generating the streaming speech recognition result for the portion of the utterance, concurrently receiving, at the data processing hardware, audio data for an additional portion of the utterance spoken by the user. [Kanda, Figure 7, the process of “Input Speech 282” to “Text of Speech Recognition 284” is continuous as long as speech is coming in and does not stop such that additional audio data keeps coming in.]

Regarding Claim 9, recording prior to recognition is not taught by Kanda other than what buffering is inherent in the process.
Li teaches:
9. The method of claim 1, wherein the audio data is recorded by a user device associated with the user. [ Li uses pre-recorded data for training:  “[0039] In detail, in order to make the trained reference acoustic model have a well ability a well ability of phone coverage and prosody coverage and can describe a variety of speech phenomena, a certain number of recording text corpuses can be pre-designed. Then, appropriate speakers are selected to obtain larger-scale training speech data of a non-target speaker, the first acoustic feature data of training speech data is extracted, and the recording text corpuses corresponding to the training speech data are annotated to obtain first text annotation data corresponding to the training speech data.”]

Regarding Claim 10, Kanda teaches:
10. The method of claim 1, wherein training the first neural network model comprises training the first neural network model to recognize multiple different pronunciations for a word in the word-level transcriptions using a pronunciation model that indicates multiple different phonetic sequences as valid pronunciations of the word. [Kanda teaches the use of a “word pronunciation dictionary” which generally includes variations on the pronunciation of a word but does not expressly teach this.]
Speciner (U.S. 11,062,615):  “…In step 3895, optionally wherein the pronunciation dictionary supports multiple dialects.”  Col. 55, lines 56-58.	
Claims 11-20 are system claims with limitations similar to the limitations of method Claims 1-10.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499.  The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659