DETAILED ACTION
This action is in response to the initial filing of Application no. 16/684,483 on 11/24/2019.
Claims 1 – 20 are still pending in this application, with claims 1, 19 and 20 being independent.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Allowable Subject Matter
Claims 4 and 14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims since the prior art fails to teach or suggest in reasonable combination the recited limitations.
Aside from the non-prior art rejections, the prior art fails to teach or suggest in reasonable combination the limitations recited in claim 18.

Claim Objections
Claim 18 is objected to because of the following informalities: “using the using” should recite -- using the 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):




Claim 18 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claim 18 recites the following limitation: embedding as input to a first layer of an encoder of the speech recognition model, as input to a first layer of a decoder of the speech recognition model, or as input to both a first layer of an encoder of the speech recognition model and a first layer of a decoder of the speech recognition model. It is unclear how to interpret this limitation due to missing terms or words. Clarification or correction is required.

Claim 12 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 12 recites “the decoder.” There is insufficient antecedent basis for this limitation in the claims.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the 

Claims 1- 3, 5 - 7, 9, 15, 17, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Catanzaro et al. (US 2017/0148433) (“Catanzaro”) in view of Müller (Phonemic and Graphemic Multilingual CTC Based Speech Recognition), and further in view of  Watanabe et al. (US 2019/0189111) (“Watanabe”).
For claims 1, 19 and 20, Catanzaro a system (Abstract) comprising: one or more computers (Fig.18; [0249  - 0252]); and one or more computer readable media storing instructions that, when execute by the one or more computers, cause the one or more computer to perform the following operations ([0254] [0255]): receiving, by the one or more computers of the automated speech recognition system, audio data indicating audio characteristics of an utterance  receiving an input audio from a user, claim 1; [0252]); providing, by the one or more computers of the automated speech recognition system, input features (spectrogram, Fig.1, 105; [0052] [0055]) determined based on the audio data (generating a set of spectrogram frames for each utterance, claim 1, [0252]) to a speech recognition model  (inputting the set of spectrogram frames to recurrent neural network model, claim 1, [0052] [0252] ) that has been trained  ([0055 – 0066] ; claim 1) to output score indicating the likelihood of linguistic units for a language (English or Mandarin, [0053]) (obtaining probabilities outputs for one or more predicted characters from the RNN model (claim 1) or dialect, wherein the speech recognition model has been trained using adaptive training (the model is trained using a CTC loss function, wherein the derivative of the loss function is used to update network parameters through the back propagations through time algorithm, [0066]); receiving, by the one or more computers of the automated speech recognition system, output that the speech recognition model generated in response to receiving the input features determined based on the audio data (performing a search 
However, Müller discloses a method for phonemic and graphemic multilingual CTC based speech recognition (Abstract), wherein a speech recognition model (RNN/CTC network architecture based on Baidu’s Deepspeech2, Figure 2 and 3.4 Network architecture) is trained to output information regarding a global set of graphemes using cluster adaptive training (multilingual bottleneck features and language feature vectors are used to train the RNN/CTC system, 3.1 Multilingual Systems, 3.2 Language Feature Vectors, 3.3 Input Features, 4.3 Input Features, 4.4 LFV Network Training, 4.5 CTC RNN  and Network Training, 5.Results, 5.3 Multilingual Grapheme Based System), with each of the multiple languages or dialects corresponding to a separate cluster (language feature vectors which encode properties of a language, wherein the language feature vectors are based on training a language feature vector network using the language, 2.3 Neural Network Adaptation, 3.2 Language Feature Vectors and 4.4 LFV Network Training), and configured to receive different identifiers (language feature vectors) as input to the speech recognition model to specify the different clusters corresponding 
	Moreover, Watanabe discloses a method and apparatus for multi-lingual end-to end speech recognition (Abstract), wherein a single speech recognition system (Fig.1, 117 and Fig.2, 202 – 206; [0035 – 0039]) is trained as a language independent recognition system using multiple languages as input ([0004 -0006] [0032] [0068 – 0070]), wherein a union of grapheme sets of different languages is used as a set of output labels ([0063]) so that likelihoods of character sequences can be computed for any language ([0063]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve Catanzaro’s automated speech recognition system in the same way that Müller’s and Watanabe’s systems have been improved to achieve the following predictable results for the purpose of increasing the ability of the end-to-end speech recognition system to recognize speech from multiple languages simultaneously (Müller, I. Introduction): the speech recognition model (RNN model) is additionally trained using cluster adaptive training (CTC loss training using multilingual BNFs and language feature vectors) and a global set of graphemes (a union of graphemes from multiple languages) to further output scores to indicate the likelihood of graphemes for each of multiple different languages or dialects, wherein the language feature vectors (identifiers) are features generated by a bottleneck layer of an additional language feature vector network (Müller, 3.2 Language Feature Vectors); and the speech recognition model is configured to receive the different language feature vectors (identifies) as input to the speech recognition model, wherein the different language feature vectors specify the different clusters corresponding to the respective languages or dialects.


 
For claim 3, Catanzaro, Müller and Watanabe further disclose, wherein the linguistic units are graphemes (Catanzaro, [0045] [0057]) (Muller, 5.Results) (Watanabe, 0063]), and the speech recognition model is configured to provide output indicating a probability distribution over a predetermined set of graphemes (Catanzaro, claim 1) (Muller, 5.3 Multilingual Grapheme Based System) (Watanabe, [0063]).

	For claim 5, Catanzaro, Müller and Watanabe further disclose wherein the speech recognition model is trained to output scores indicative of labels representing different languages or dialects  (Catanzaro, end-to-end speech recognition to predict graphemes can be a RNN 

For claim 6, Watanabe further discloses, wherein the labels for the language or dialect are included in the output sequences (Watanabe, Fig.5; [0039]).

For claim 7, Catanzaro and Müller further disclose: determining a language or dialect of the utterance (Catanzaro, claim 1) (Müller, the language feature vectors generated by the language feature vector network indicates an identified language, 2.3 Neural Network Adaptation and 3.2 Language Feature Vectors); and providing, as input to the speech recognition model, data indicating the language or dialect (Müller, language feature vectors) as input to one 

For claim 9, Müller further discloses, wherein the data comprises an embedding corresponding to the language or dialect, wherein the embedding has been learned through training (Müller, 3.2 Language Feature Vectors and 4.4 LFV Network Training).
	
	For claim 15, Catanzaro and Müller further disclose, wherein the speech recognition model has been trained using cluster adaptive training (Catanzaro, the model is trained using a CTC loss function, wherein the derivative of the loss function is used to update network parameters through the back propagations through time algorithm, [0066] (Müller, multilingual bottleneck features and language feature vectors are used to train the RNN/CTC system, 3.1 Multilingual Systems, 3.2 Language Feature Vectors, 3.3 Input Features, 4.3 Input Features, 4.4 LFV Network Training, 4.5 CTC RNN  and Network Training, 5.Results, 5.3 Multilingual Grapheme Based System) with each language or dialect corresponding to a separate cluster (Müller, language feature vectors which encode properties of a language, wherein the language feature vectors are based on training a language feature vector network using the language, 2.3 Neural Network Adaptation, 3.2 Language Feature Vectors and 4.4 LFV Network Training)

	
	For claim 17, Catanzaro and Müller is further disclose wherein the speech recognition model has been trained using cluster adaptive training with each language or dialect corresponding to a separate cluster Catanzaro, the model is trained using a CTC loss function, wherein the derivative of the loss function is used to update network parameters through the back propagations through time algorithm, [0066] (Müller, multilingual bottleneck features and language feature vectors are used to train the RNN/CTC system, 3.1 Multilingual Systems, 3.2 Language Feature Vectors, 3.3 Input Features, 4.3 Input Features, 4.4 LFV Network Training, 4.5 CTC RNN  and Network Training, 5.Results, 5.3 Multilingual Grapheme Based System) with each language or dialect corresponding to a separate cluster (Müller, language feature vectors which encode properties of a language, wherein the language feature vectors are based on training a language feature vector network using the language, 2.3 Neural Network Adaptation, 3.2 Language Feature Vectors and 4.4 LFV Network Training), and wherein language or dialect embedding vectors learned though training are used as weights to combine clusters (Müller, multilingual bottleneck feature based on training a network to extract  bottleneck features on all languages, 2.2 Multilingual Bottleneck Features, 3.3 Input Features and 4.3 Input Features

Claims 8 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Catanzaro et al. (US 2017/0148433) (“Catanzaro”) in view of Müller (Phonemic and Graphemic Multilingual CTC Based Speech Recognition), and further in view of Watanabe et al. (US 2019/0189111) (“Watanabe”) and further in view of Waibel (“Using Language Adaptive Deep Neural Networks for Improved Multilingual Speech Recognition”).
For claim 8, the combination of Catanzaro, Müller, Watanabe fails to teach wherein providing data indicating the language or dialect comprises providing a 1-hot vector having a value corresponding to each of a predetermined set of languages or dialects.
	However, Waibel discloses a method for speech recognition (Abstract), wherein a 1-hot vector corresponding to each of a predetermined set of languages or dialects is provided as input to a speech recognition model (Figure 2; Language Adaptive Deep Neural Networks).
 Therefore, it would have been obvious to one ordinary skill in the art at the time of applicant’s filing to modify the combined teachings of Catanzaro, Müller, Watanabe with Waibel’s teachings so that the providing data indicating the language or dialect further comprises providing a 1-hot vector having a value corresponding to each of a predetermined set of languages or dialects for the purpose of increasing the ability of the end-to-end speech recognition system to recognize speech from multiple languages simultaneously (Müller, I. Introduction).

For claim 16, the combination of Catanzaro, Müller, Watanabe fails to teach wherein the language or dialect comprises identifiers are 1-hot vectors.

 Therefore, it would have been obvious to one ordinary skill in the art at the time of applicant’s filing to modify the combined teachings of Catanzaro, Müller, Watanabe with Waibel’s teachings so that the language or dialect comprises identifiers are 1-hot vectors for the purpose of increasing the ability of the end-to-end speech recognition system to recognize speech from multiple languages simultaneously (Müller, I. Introduction).

Claims 10 – 13 ae rejected under 35 U.S.C. 103 as being unpatentable over Catanzaro et al. (US 2017/0148433) (“Catanzaro”) in view of Müller (Phonemic and Graphemic Multilingual CTC Based Speech Recognition), and further in view of Watanabe et al. (US 2019/0189111) (“Watanabe”) and further in view of Waibel (“Multilingual Adaptation of RNN Based ASR Systems”).
For claim 10, the combination of Catanzaro, Müller and Watanabe discloses wherein the speech recognition model has an encoder and decoder (Catanzaro, end-to-end speech recognition to predict graphemes can be a RNN encoder-decoder with attention or a RNN-CTC , [0044] [0045]) (Muller, the use of bottleneck features as input features is common for speech recognition systems so that the speech recognition systems can discriminate between phones, and neural networks can be adapted to various conditions such as different languages using parameters e.g. language feature vectors; 2.3 Neural Network Adaptation, 2.4 CTC Based ASR Systems, 3.3 Input Features and 3.4 Network Architecture) (Watanabe,  [0035 – 0039] [0043]), 
However, Waibel discloses a method for speech recognition (Abstract), wherein a language feature vector is appended to acoustic features used as input to a speech recognition model (Figure 1; 2. Language Adaptation).
Therefore, it would have been obvious to one ordinary skill in the art at the time of applicant’s filing to modify the combined teachings of Catanzaro, Müller, Watanabe with Waibel’s teachings so that the that data indicating the language or dialect (language feature vectors) is provided as input (a part of the acoustic features) to one or more neural network layers of an encoder of the speech recognition model for the purpose of increasing the ability of the end-to-end speech recognition system to recognize speech from multiple languages simultaneously (Müller, I. Introduction).

For claim 11, the combination of Catanzaro, Müller and Watanabe discloses wherein the speech recognition model has an encoder and decoder (Catanzaro, end-to-end speech recognition to predict graphemes can be a RNN encoder-decoder with attention or a RNN-CTC , [0044] [0045]) (Muller, the use of bottleneck features as input features is common for speech recognition systems so that the speech recognition systems can discriminate between phones, and neural networks can be adapted to various conditions such as different languages using parameters e.g. language feature vectors; 2.3 Neural Network Adaptation, 2.4 CTC Based ASR Systems, 3.3 Input Features and 3.4 Network Architecture) (Watanabe,  [0035 – 0039] [0043]), yet fails to teach that data indicating the language or dialect is provided as input to one or more neural network layers of the decoder of the speech recognition model.

Therefore, it would have been obvious to one ordinary skill in the art at the time of applicant’s filing to modify the combined teachings of Catanzaro, Müller, Watanabe with Waibel’s teachings so that the that data indicating the language or dialect (language feature vectors) is provided as input to one or more neural network layers including one or more neural network layers of the decoder by modulating the outputs of a previous network layer of the speech recognition model for the purpose of increasing the ability of the end-to-end speech recognition system to recognize speech from multiple languages simultaneously (Müller, I. Introduction).

For claim 12, the combination of Catanzaro, Müller and Watanabe discloses wherein the speech recognition model has an encoder and decoder (Catanzaro, end-to-end speech recognition to predict graphemes can be a RNN encoder-decoder with attention or a RNN-CTC , [0044] [0045]) (Muller, the use of bottleneck features as input features is common for speech recognition systems so that the speech recognition systems can discriminate between phones, and neural networks can be adapted to various conditions such as different languages using parameters e.g. language feature vectors; 2.3 Neural Network Adaptation, 2.4 CTC Based ASR Systems, 3.3 Input Features and 3.4 Network Architecture) (Watanabe,  [0035 – 0039] [0043]), yet fails to teach that data indicating the language or dialect is provided as input to one or more neural network layers of an encoder of the speech recognition model and to one or more neural network layers of the decoder of the speech recognition model.

Therefore, it would have been obvious to one ordinary skill in the art at the time of applicant’s filing to modify the combined teachings of Catanzaro, Müller, Watanabe with Waibel’s teachings so that the that data indicating the language or dialect (language feature vectors) is provided as input to one or more neural network layers including one or more neural network layers of the encoder and decoder by modulating the outputs of a previous network layer of the speech recognition model for the purpose of increasing the ability of the end-to-end speech recognition system to recognize speech from multiple languages simultaneously (Müller, I. Introduction).

For claim 13, the combination of Catanzaro, Müller and Watanabe discloses wherein the speech recognition model has an encoder and decoder (Catanzaro, end-to-end speech recognition to predict graphemes can be a RNN encoder-decoder with attention or a RNN-CTC , [0044] [0045]) (Muller, the use of bottleneck features as input features is common for speech recognition systems so that the speech recognition systems can discriminate between phones, and neural networks can be adapted to various conditions such as different languages using parameters e.g. language feature vectors; 2.3 Neural Network Adaptation, 2.4 CTC Based ASR Systems, 3.3 Input Features and 3.4 Network Architecture) (Watanabe,  [0035 – 0039] [0043]), yet fails to teach that data indicating the language or dialect is provided as input to each neural network layer of the encoder and each neural network layer of the decoder of the speech recognition model.

Therefore, it would have been obvious to one ordinary skill in the art at the time of applicant’s filing to modify the combined teachings of Catanzaro, Müller, Watanabe with Waibel’s teachings so that the that data indicating the language or dialect (language feature vectors) is provided as input to each neural network layer of the encoder and the decoder by appending it to the acoustic features as input and by modulating the outputs of a previous network layer of the speech recognition model for the purpose of increasing the ability of the end-to-end speech recognition system to recognize speech from multiple languages simultaneously (Müller, I. Introduction).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SONIA L GAY whose telephone number is (571)270-1951.  The examiner can normally be reached on Monday-Friday 9-5 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SONIA L GAY/Primary Examiner, Art Unit 2657