Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1 and 11 are independent.
This Application is published as U.S. 2022-0262350.
Apparent priority 31 December 2015.
This Application is a continuation of 17/022,224 issued as U.S. 11341958 which is a continuation of 16/258,309 issued as U.S. 10,803,855 which is a continuation of 15/397,327 issued as U.S. 10,229,672.  
A Terminal Disclaimer over the term of the all parents is required.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims of U.S. Patent No. 11341958 as shown below.  Obviousness double patenting is not shown for the other parents but can be easily observed.
Instant Application
 Reference U.S. 11341958
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: 
1. A method comprising: 









for each frame in a sequence of frames divided from audio data for a portion of a spoken utterance, determining a corresponding set of log-Mel frequency cepstral coefficients; 
receiving, at data processing hardware of a speech recognition system, audio data for a portion of an utterance spoken by a user; 
dividing, by the data processing hardware, the audio data for the portion of the utterance into a sequence of fixed-length frames; 
for each frame in the sequence of fixed-length frames, determining, by the data processing hardware, a corresponding set of log-Mel frequency cepstral coefficients; 
generating, 

using an acoustic model configured to receive the corresponding set of log-Mel frequency cepstral coefficients generated for each frame in the sequence of frames as input, a sequence of context-dependent states representing the portion of the spoken utterance; and 
generating, by the data processing hardware, 
using an acoustic model configured to receive the corresponding set of log-Mel frequency cepstral coefficients generated for each frame in the sequence of fixed-length frames as input, a sequence of context-dependent states representing the portion of the utterance; and 
generating, 

using a language model trained separately from the acoustic model, a streaming speech recognition result for the portion of the spoken utterance, 
generating, by the data processing hardware, 
using a language model trained separately from the acoustic model, a streaming speech recognition result for the portion of the utterance spoken by the user, 

wherein an acoustic model training process trains the acoustic model by: 
wherein an acoustic model training process trains the acoustic model by:

obtaining acoustic model training data that comprises training audio data and word-level transcriptions for the training audio data; 
obtaining acoustic model training data that comprises training audio data and word-level transcriptions for the training audio data; 

training a first neural network model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions; 
training a first neural network model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions; 

generating, using the trained first neural network model, alignment data indicating phonetic alignments between the training audio data and the phonetic sequences corresponding to word-level transcriptions; and 
generating, using the trained neural network model, a context-dependent state inventory from phonetic alignments between the training audio data and the phonetic sequences corresponding to word-level transcriptions; and 

training, using the alignment data, a second neural network model to generate outputs corresponding to one or more context-dependent states, the second neural network model having a plurality of long short-term memory layers.
training, using the context-dependent state inventory, a second neural network model to generate outputs corresponding to one or more context-dependent states, the second neural network model having a plurality of long short-term memory layers,

wherein the trained second neural network model comprises the acoustic model. 

Claim 2 is obvious over claim 2 of U.S. 11,341,958.
Claim 3 is obvious over claim 3 of U.S. 11,341,958.
Claim 4 is obvious over claim 4 of U.S. 11,341,958.
Claim 5 is obvious over claim 5 of U.S. 11,341,958.
Claim 6 is obvious over the last limitation of claim 1 of U.S. 11,341,958.
Claim 7 is obvious over claim 6 of U.S. 11,341,958.
Claim 8 is obvious over claim 7 of U.S. 11,341,958.
Claim 9 is obvious over claim 8 of U.S. 11,341,958.
Claim 10 is obvious over claim 9 of U.S. 11,341,958.
These Claims have generally the exact same language as the claims of the reference.
Claims 11-20 are system counterparts of Claims 1-10 and are rejected under similar rationale.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 3, 5-8, 13, and 15-18 are rejected under 35 U.S.C. 103 as being unpatentable over Kanda (U.S. 20180204566) in view of Catanzaro (U.S. 20170148433).
Kanda teaches:
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
for each frame in a sequence of frames divided from audio data for a portion of a spoken utterance, determining a corresponding set of log-Mel frequency cepstral coefficients; [Kanda, Figure 7, “Input Speech 282” and “Framing Unit 302” followed by “Feature Extracting Unit 304.”  “[0050] Referring to FIG. 7, a speech recognition device 280 in accordance with the present embodiment has a function of performing speech recognition of an input speech 282 and outputting a text 284 of speech recognition….a framing unit 302 for dividing the digitized speech signal output from A/D converter circuit 300 into frames with a prescribed length and prescribed shift length allowing partial overlapping  …a feature extracting unit 304 performing a prescribed acoustic process on each of the frames output by framing unit 302, thereby extracting speech features of each frame and outputting a feature vector. Each frame and each feature vector have information such as relative time, for example, with respect to the head of input speech 282. The features used may include MFCCs (Mel-Frequency Cepstrum Coefficients), its first order differential, second order differential, power and so forth.”]
generating, using an acoustic model configured to receive the corresponding set of log-Mel frequency cepstral coefficients generated for each frame in the sequence of frames as input, a sequence of context-dependent states representing the portion of the spoken utterance; and [Kanda, Figure 7, “Acoustic Model (RNN) 308.”  “[0051] …an acoustic model 308 implemented by a RNN, receiving as an input a feature vector stored in feature storage unit 306 and for outputting a vector representing for each phoneme posterior probabilities of each frame at each time point corresponding to the phonemes ….”  The sequence of phonemes in a frame teaches a sequence of context-dependent states:  “…an acoustic model 308 implemented by a RNN (recurrent neural network) for calculating, for each state sequence, the posterior probability of a state sequence in response to an observed sequence consisting of prescribed speech features obtained from a speech….”  Abstract.  For generating a sequence of states see also {0034].  For context-dependency see [0054] and [0055].]
generating, using a language model trained separately from the acoustic model, a streaming speech recognition result for the portion of the spoken utterance, [Kanda, Figure 7, “Decoder 310” is a “language model” because it generates the speech recognition result in the form of the “Text of Speech Recognition 284.”  “[0051] … a decoder 310 implemented by WFST (Weighted Finite-State Transducer), referred to as S.sup.-1HCLG in the present specification as will be described later, for outputting, using the vectors output from acoustic model 308, a word sequence having the highest probability as a text 284 of speech recognition corresponding to the input speech 282, by means of WFST….”  The acoustic and WFST models of Kanda are trained separately.  See [0055].]
wherein an acoustic model training process trains the acoustic model by: [Kanda teaches the use of pre-trained acoustic and language models and therefore does not teach the training process.]
obtaining acoustic model training data that comprises training audio data and word-level transcriptions for the training audio data; [Kanda uses pre-trained models and does not describe the process of training but does teach that its models generate word sequences (see claim 1 of Kanda) and therefore must be trained on word-level data.]
training a first neural network model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions; [Kanda uses pre-trained models and does not describe the process of training but does teach that its models generate word sequences (see claim 1 of Kanda) transcriptions.  “… performing speech recognition of the speech signal based on a score calculated for each hypothesis of a word sequence corresponding to the speech signal…”  Abstract.]
generating, using the trained neural network model, a context-dependent state inventory from phonetic alignments between the training audio data and the phonetic sequences corresponding to word-level transcriptions; and [Kanda uses pre-trained models and does not describe the process of training but does teach that its models are context dependent with a triphone context level: “[0054] …Further, recently, a phoneme-based triphone HMM comes to be used for representing phoneme context, and it can also be represented by WFST….”]
training, using the context-dependent state inventory, a second neural network model to generate outputs corresponding to one or more context-dependent states, the second neural network model having a plurality of long short-term memory layers. [The second model of Kanda that generates the transcription is a WFST finite state automaton and is not taught to be a second neural network.]

Kanda teaches the use of pre-trained acoustic and language models and therefore does not teach the training process but does teach that its models generate word sequences (see claim 1 of Kanda) and therefore must be trained on word-level data.  The second model of Kanda that generates the transcription is a WFST finite state automaton and is not taught to be a second neural network.

Catanzaro teaches:
wherein an acoustic model training process trains the acoustic model by: [Catanzaro, [0012] FIG. 2 depicts methods for training the deep learning model ….”  The model of Catanzaro is being trained for speech recognition and begins by receiving “utterances” and therefore must include acoustic models.  See [0043] discussing the use of CNNs and LSTMs as acoustic models:  “[0043] Feed-forward neural network acoustic models were explored more than 20 years ago. … Convolutional networks have also been found beneficial for acoustic models. Recurrent neural networks, typically LSTMs, are just beginning to be deployed in state-of-the art recognizers and work well together with convolutional layers for the feature extraction….” ]
obtaining acoustic model training data that comprises training audio data and word-level transcriptions for the training audio data; [Catanzaro, Figure 2, step 205 where the utterance and its label are input from a “training set.”  “[0056] In embodiments, the utterance, x, comprising a time-series of spectrogram frames, x(t), is inputted (205) into a recurrent neural network (RNN) model, wherein the utterance, x, and an associated label, y, are sampled from a training set.”]
training a first neural network model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions; [[Catanzaro, Figure 2 shows the steps of training and ends with step 230 where the parameters of RNN are updated.  “[0057] The RNN model outputs of graphemes of each language….”  Thus, the training is at the grapheme level which is finer than word-level and word-level transcriptions can be obtained from it.  (As support for this proposition see Nissan (U.S. 20160365090) Figure 3.) ]
generating, using the trained neural network model, a context-dependent state inventory from phonetic alignments between the training audio data and the phonetic sequences corresponding to word-level transcriptions; and [Catanzaro, Figure 13, “use an existing bidirectional RNN model trained with CTC to align transcriptions to frames of an audio clip 1305.”  The alignment is done at the word-level transcriptions ([0165] below).  Because the alignment is generated from an aligned transcript of the audio, the data is context-dependent.  “[0162] Some of the internal English (3,600 hours) and Mandarin (1,400 hours) datasets were created from raw data captured as long audio clips with noisy transcriptions. The length of these clips ranged from several minutes to more than hour, making it impractical to unroll them in time in the RNN during training. To solve this problem, an alignment, segmentation, and filtering pipeline was developed that can generate a training set with shorter utterances and few erroneous transcriptions. FIG. 13 depicts a method of data acquisition for speech transcription training according to embodiments of the present disclosure.”  “[0165] In embodiments, following the alignment is a segmentation step 1310 that splices the audio and the corresponding aligned transcription whenever it encounters a long series of consecutive blank labels, since this usually denotes a stretch of silence. By tuning the number of consecutive blanks, the length of the utterances generated can be tuned. In embodiments, for the English speech data, a space token is required to be within the stretch of blanks in order to segment on word boundaries….”]
training, using the context-dependent state inventory, a second neural network model to generate outputs corresponding to one or more context-dependent states, the second neural network model having a plurality of long short-term memory layers. [Catanzaro, Figure 5 shows different epochs of training, each epoch using new training data and generating an updated model.  Figure 1, the RNN whose parameters are updated at Figure 2, 230, and the “Recurrent or GRU (bidirectional) 115” layers in Figure 1 can be LSTM.  See “[0061] … In embodiments, the function g(•) can also represent more complex recurrence operations, such as Long Short-Term Memory (LSTM) units and gated recurrent units (GRUs).” See also [0090] saying that the RNN can be a LSTM.]

Kanda and Catanzaro pertain to speech recognition and use of RNNs for speech recognition and it would have been obvious to combine the training steps of Catanzaro with the system of Kanda which uses a pre-trained model as a tandem combination of steps that follow one another.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 3, Kanda teaches:
3. The method of claim 1, wherein the context-dependent states comprise labels for triphones or scores for labels for triphones. [Kanda teaches that its WFST model 320 is trained for triphones which teaches that the labels are generated for triphones.  “[0054] Various models are used in speech recognition. HMM, a word pronunciation dictionary and a language model can all be represented by WFST. Further, recently, a phoneme-based triphone HMM comes to be used for representing phoneme context, and it can also be represented by WFST….”  “[0055] WFST involves an operation referred to as "composition." Composition of two WFSTs enables processing of tasks that otherwise require application of two successive WFSTs, by one composed WFST. Therefore, it is possible to compose WFSTs for the HMM, the word pronunciation dictionary, the language model and the triphone HMM to one WFST. Decoder 310 uses such a pre-trained and composed WFST. The WFST used here is a graph built in advance by language knowledge, and it employs a knowledge source referred to as HCLG. HCLG stands for a composition of four WFSTs (H, C, L, G). H stands for HMM, C context, L lexicon and G grammar….”]

Regarding Claim 5, Kanda teaches:
5. The method of claim 1, wherein the acoustic model training process uses alignment data indicating alignments between the training audio data and the word-level transcriptions for the training audio data when training the second neural network model. [Kanda mentions that the training is done by using an aligned corpus:  “[0035] … In Equation (6), P(xt) is common to each HMM state and, therefore, it is negligible in arg max operation. P(st) can be estimated by counting the number of each state in aligned training data….”  This pertains to the DNN of Figure 3 nevertheless teaches that training data includes aligned audio and text.  The recognition is word-level which suggests that the training data should be also.  See [0005].]

Regarding Claim 6, Kanda teaches second model that generates the transcription is a WFST finite state automaton and is not taught to be a second neural network.
Catanzaro teaches:
6. The method of claim 1, wherein the trained second neural network model comprises the acoustic model.[Catanzaro, as discussed with respect to Claim 1, keeps retraining and updating its acoustic model which is implemented in RNN and particularly LSTM may be used.  Figure 2, 230, Figure 1, recurrent layers 115.  Figure 5, step 510 additional training.]
Rationale for combination as provided for Claim 1.

Regarding Claim 7, Kanda teaches:
7. The method of claim 1, wherein generating the sequence of context-dependent states representing the portion of the utterance comprises generating output values indicating likelihoods corresponding to different context-dependent states. [Kanda, Background.  This is definition of speech recognition:  “[0005] Basic concept of speech recognition of a conventional speech recognition device will be described with reference to FIG. 1. Conventionally, it is assumed that a word sequence 30 (word sequence W) is influenced by various noises and observed as an observed sequence 36 (observed sequence X), and a word sequence that is expected to have the highest likelihood of generating the finally observed sequence X is output as a result of speech recognition. Let P(W) represent the probability of a word sequence W being generated. Further, let P(S|W) represent the probability of a state sequence S (state sequence 34) of HMM being generated from the word sequence W through a phoneme sequence 32 as an intermediate product. Further, let P(X|S) represent the probability of observed X being obtained from the state sequence S.”]  “[0005] Basic concept of speech recognition of a conventional speech recognition device will be described with reference to FIG. 1. Conventionally, it is assumed that a word sequence 30 (word sequence W) is influenced by various noises and observed as an observed sequence 36 (observed sequence X), and a word sequence that is expected to have the highest likelihood of generating the finally observed sequence X is output as a result of speech recognition. Let P(W) represent the probability of a word sequence W being generated. Further, let P(S|W) represent the probability of a state sequence S (state sequence 34) of HMM being generated from the word sequence W through a phoneme sequence 32 as an intermediate product. Further, let P(X|S) represent the probability of observed X being obtained from the state sequence S…..”]

Regarding Claim 8, Kanda teaches:
8. The method of claim 1, further comprising, while generating the streaming speech recognition result for the portion of the utterance, concurrently receiving, at the data processing hardware, audio data for an additional portion of the utterance spoken by the user. [Kanda, Figure 7, the process of “Input Speech 282” to “Text of Speech Recognition 284” is continuous as long as speech is coming in and does not stop such that additional audio data keeps coming in.]

Claim 13 is a system Claim with limitations similar to the limitations of Claim 3 and is rejected under similar rationale.
Claim 15 is a system Claim with limitations similar to the limitations of Claim 5 and is rejected under similar rationale.
Claim 16 is a system Claim with limitations similar to the limitations of Claim 6 and is rejected under similar rationale.
Claim 17 is a system Claim with limitations similar to the limitations of Claim 7 and is rejected under similar rationale.
Claim 18 is a system Claim with limitations similar to the limitations of Claim 8 and is rejected under similar rationale.

Claims 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Kanda and Catanzaro in view of Erdogan (U.S. 20160111107).
Regarding Claim 4, Kanda and Catanzaro don’t teach the use of MMI.
Erdogan teaches:
4. The method of claim 1, wherein the acoustic model training process uses maximum mutual information (MMI) to train the second neural model to generate the outputs corresponding to the one or more context-dependent states. [Erdogan “[0032] The joint objective function is a weighted sum of enhancement and recognition task objective functions. For the enhancement task, the objective function can be mask approximation (MA), magnitude spectrum approximation (MSA) or phase-sensitive spectrum approximation (PSA). For the recognition task, the objective function can simply be a cross-entropy cost function using states or phones as the target classes or possibly a sequence discriminative objective function such as minimum phone error (MPE), boosted maximum mutual information (BMMI) that are calculated using a hypothesis lattice.”  “2. The method of claim 1, wherein the enhancement network is a Deep Recurrent Neural Network (DRNN).”]

Kanda/Catanzaro and Erdogan pertain to speech recognition and it would have been obvious to combine the use of the particular method of Erdogan with the system of combination as an equivalent substitute.  This combination falls under simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 14 is a system Claim with limitations similar to the limitations of Claim 4 and is rejected under similar rationale.

Claims 9 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Kanda and Catanzaro in view of Li (U.S. 20180254034).
Regarding Claim 9, recording prior to recognition is not taught by Kanda other than what buffering is inherent in the process.
Catanzaro does not teach that the training set comes from the device of the user.
Li teaches:
9. The method of claim 1, wherein the audio data is recorded by a user device associated with the user. [ Li uses pre-recorded data for training:  “[0039] In detail, in order to make the trained reference acoustic model have a well ability a well ability of phone coverage and prosody coverage and can describe a variety of speech phenomena, a certain number of recording text corpuses can be pre-designed. Then, appropriate speakers are selected to obtain larger-scale training speech data of a non-target speaker, the first acoustic feature data of training speech data is extracted, and the recording text corpuses corresponding to the training speech data are annotated to obtain first text annotation data corresponding to the training speech data.”]
Kanda/Catanzaro and Li pertain to speech recognition and it would have been obvious to combine the particular training set of Li with the system of combination as one alternative among many.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 19 is a system Claim with limitations similar to the limitations of Claim 9 and is rejected under similar rationale.

Claims 10 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Kanda and Catanzaro in view of Speciner (U.S. 11,062,615).
Regarding Claim 10, Kanda teaches and therefore suggests:
10. The method of claim 1, wherein training the first neural network model comprises training the first neural network model to recognize multiple different pronunciations for a word in the word-level transcriptions using a pronunciation model that indicates multiple different phonetic sequences as valid pronunciations of the word. [Kanda teaches the use of a “word pronunciation dictionary” which generally includes variations on the pronunciation of a word but does not expressly teach this.]
Catanzaro teaches it does not need to use pronunciation variations in its training set.  “[0124] The techniques described herein can be used to build an end-to-end Mandarin speech recognition system that outputs Chinese characters directly. This precludes the need to construct a pronunciation model, which is often a fairly involved component for porting speech systems to other languages. Direct output to characters also precludes the need to explicitly model-language specific pronunciation features. For example, Mandarin tones do not need to be modeled explicitly, as some speech systems must do.”
Speciner expressly teaches:  “…In step 3895, optionally wherein the pronunciation dictionary supports multiple dialects.”  Col. 55, lines 56-58.	Kanda/Catanzaro and Speciner pertain to speech recognition and it would have been obvious to combine the use of the dictionary of Speciner for training with the system of combination as an equivalent substitute.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 20 is a system Claim with limitations similar to the limitations of Claim 10 and is rejected under similar rationale.

Claims 2 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Kanda and Catanzaro in view of Pereze-Mendez (U.S. 5,754,978) and  Kern (U.S. 7,043,427).
Regarding Claim 2, Kanda teaches that:  “[0005] … Conventionally, it is assumed that a word sequence 30 (word sequence W) is influenced by various noises and observed as an observed sequence 36 (observed sequence X), and a word sequence that is expected to have the highest likelihood of generating the finally observed sequence X is output as a result of speech recognition. …”  But it does not teach perturbation by noise during the training process.
Catanzaro, Figure 5, teaches that it uses minibatches of training data in a particular order.
Pereze-Mendez teaches:
2. The method of claim 1, wherein the acoustic model training process further trains the acoustic model by synthetically distorting, using a room simulator, the training audio data with noise obtained from various noise sources. [Perez-Mendez teaches in its Background that applying different amounts of noise to the training speech in order to simulate various noise environments is known in the art as of the date of Perez Mendez which is 1995.  “Previously, others have employed certain aspects of parallel processing into speech recognition techniques, as opposed to rejection methods. Particularly, some have focused on improving speech recognition in adverse environments. For example, U.S. Pat. No. 5,182,765 (Ishii) discloses a speech registration/recognition system in which, as part of the registration process, the speech input signal is stored as recognition data. Subsequently, after registration and as part of the recognition process, the speech input signal is compared to the stored recognition data. In the embodiment shown in FIG. 8 of the '765 patent, and described at column 7, line 19 through column 8, line 42 thereof, a plurality of parallel speech recognition circuits each receive and store slightly altered versions of the input speech as part of the registration process. The intent is to improve the ability to recognize speech in adverse speech environments, such as high noise, etc. This is done by recording or registering speech or "training data" from many different environments (different types of background noise, for example). These slightly altered versions are created by different electrical characteristics in each of the variable characteristic circuits associated with each of the speech recognition circuits….”  Col. 2, 47 to Col. 3, 10.  Figures 2 and 6 of Perez-Mendez teach “signal perturbations” which is adding different amounts of noise to the training speech.  “A perturbation may be applied directly to any of the given speech representations. When the values in the speech representation refer to physical values, such as in the original digitized signal, a perturbation may be applied directly by adding small values to the signal values. In the original digitized signal this may take the form of adding a "one" to every sample value, or choosing a "one" or "zero" randomly to add to every sample value. This may be considered to represent to represent a small amount of noise. Higher levels of speech representation, those closer to the final output, are more difficult to perturb directly. The sequence of vector quantized values could be perturbed by selecting alternate prototype vectors, say the second nearest, for every frame or for every other frame, etc.”  Col. 7, 1-3.]
Kanda/Catanzaro and Perez-Mendez pertain to training of speech recognition systems and it would have been obvious to combine the use of the particular training data of Perez-Mendez with the system of combination as an equivalent substitute.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Perez-Mendez does not teach using room simulation to generate the noisy speech signals.
Kern teaches:
wherein the acoustic model training process further trains the acoustic model by synthetically distorting, using a room simulator, the training audio data with noise obtained from various noise sources. [Kern Figure 1 shows the “correction unit 15, for example simulates room reverberation and/or sound reflections from nearby objects within the speech transmission path. Acoustic reflections of this sort can for example, originate from a desktop, a display screen, or from other objects.”  Col. 3, 6-10. And “) In operation of the apparatus shown in FIG. 1, during a training speech samples are stored in the data processing device 17. Which could be used, for example, to construct a personal telephone directory.”  Col. 3, lines 31-35.]
Kanda/Catanzaro/Perez-Mendez and Kernpertain to training of speech recognition systems and it would have been obvious to combine the use of the particular training data of Kern with the system of combination as an equivalent substitute.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 12 is a system Claim with limitations similar to the limitations of Claim 2 and is rejected under similar rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
For Claim 2, Room Simulators:
Rumsey (U.S. 20090238370) teaches the use of “Real or Simulated Room Acoustics” ([0142-[0146]) or the combination of both approaches to obtain the various acoustics under which a listening device must operate.
Opitz (U.S. 5,544,249) discloses a method of simulating a room and/or sound impression.
Flanagan (U.S. 5,737,485) teaches that “In addition to the speech source, inputs to the room simulation may include a competing noise source of variable intensity, to produce different signal-to-competing-noise ratios (SCNR's).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499.  The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659