DETAILED ACTION
This action is in response to the initial filing of Application no. 17/572,238 on 01/20/2022.
Claims 1 – 20 are still pending in this application, with claims 1 and 11 being independent.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.

Claims 1 –  20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1 – 20 of U.S. Patent No 11,238,845. Although the claims at issue are not identical, they are not patentably distinct from each other.

The claim mapping is as follows.

Current Application

1. A computer-implemented method of performing speech recognition, the method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving audio data indicating audio characteristics of an utterance; providing, as input to a speech recognition model, speech features determined based on the audio data, wherein the speech recognition model has been trained, using cluster adaptive training: to recognize linguistic units for each of multiple different languages or dialects, with each of the multiple languages or dialects corresponding to a separate cluster; and to receive, as input, different identifiers that specify the different clusters corresponding to the respective languages or dialects; based on the speech features provided as input to the speech recognition model, generating, as output from the speech recognition model at each of a plurality of time steps, an output vector at the corresponding time step indicating a probability distribution over a predetermined set of linguistic units for each of the multiple different languages or dialects the speech recognition model has been trained to recognize; and providing, as an output of the automated speech recognition system, a transcription of the utterance generated based on the output vectors generated as output from the speech recognition model at each of the plurality of time steps.

2. The method of claim 1, wherein: the speech recognition model comprises an encoder, a decoder, and an attention model that learns alignments between outputs of the encoder and the decoder; and the encoder, the decoder, and the attention model each comprise one or more neural network layers that have parameters learned through training using training examples representing speech in the multiple languages or dialects.

3. The method of claim 1, wherein the linguistic units are graphemes.

4. The method of claim 1, wherein the linguistic units are word pieces.

5. The method of claim 1, wherein the speech recognition model is further trained to: output scores indicative of labels representing the different languages or dialects; and generate output sequences that include one of the labels representing the different languages or dialects.

6. The method of claim 1, wherein the operations further comprise: determining a language or dialect of the utterance; and providing, as input to the speech recognition model, data indicating the language or dialect of the utterance, wherein the output vector generated as output from the speech recognition model at the corresponding time step is generated based on the speech features and the data indicating the language or dialect of the utterance provided as input to the speech recognition model.

7. The method of claim 6, wherein providing the data indicating the language or dialect comprises providing a 1-hot vector having a value corresponding to each of a predetermined set of languages or dialects.

8. The method of claim 6, wherein: the data comprises an embedding corresponding to the language or dialect; and the embedding has been learned through training.

9. The method of claim 6, wherein the data indicating the language or dialect is provided as input to one or more neural network layers of an encoder of the speech recognition model.

10. The method of claim 6, wherein the data indicating the language or dialect is provided as input to one or more neural network layers of a decoder of the speech recognition model.

11. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations comprising: receiving audio data indicating audio characteristics of an utterance; providing, as input to a speech recognition model, speech features determined based on the audio data, wherein the speech recognition model has been trained, using cluster adaptive training: to recognize linguistic units for each of multiple different languages or dialects, with each of the multiple languages or dialects corresponding to a separate cluster; and to receive, as input, different identifiers that specify the different clusters corresponding to the respective languages or dialects; based on the speech features provided as input to the speech recognition model, generating, as output from the speech recognition model at each of a plurality of time steps, an output vector at the corresponding time step indicating a probability distribution over a predetermined set of linguistic units for each of the multiple different languages or dialects the speech recognition model has been trained to recognize; and providing, as an output of the automated speech recognition system, a transcription of the utterance generated based on the output vectors generated as output from the speech recognition model at each of the plurality of time steps.

12. The system of claim 11, wherein: the speech recognition model comprises an encoder, a decoder, and an attention model that learns alignments between outputs of the encoder and the decoder; and the encoder, the decoder, and the attention model each comprise one or more neural network layers that have parameters learned through training using training examples representing speech in the multiple languages or dialects.

13. The system of claim 11, wherein the linguistic units are graphemes.

14. The system of claim 11, wherein the linguistic units are word pieces.

15. The system of claim 11, wherein the speech recognition model is further trained to: output scores indicative of labels representing the different languages or dialects; and generate output sequences that include one of the labels representing the different languages or dialects.

16. The system of claim 11, wherein the operations further comprise: determining a language or dialect of the utterance; and providing, as input to the speech recognition model, data indicating the language or dialect of the utterance, wherein the output vector generated as output from the speech recognition model at the corresponding time step is generated based on the speech features and the data indicating the language or dialect of the utterance provided as input to the speech recognition model.

17. The system of claim 16, wherein providing the data indicating the language or dialect comprises providing a 1-hot vector having a value corresponding to each of a predetermined set of languages or dialects.

18. The system of claim 16, wherein: the data comprises an embedding corresponding to the language or dialect; and the embedding has been learned through training.

19. The system of claim 16, wherein the data indicating the language or dialect is provided as input to one or more neural network layers of an encoder of the speech recognition model.

20. The system of claim 16, wherein the data indicating the language or dialect is provided as input to one or more neural network layers of a decoder of the speech recognition model.
US 11,238,845

1. A method of performing speech recognition using an automated speech recognition system comprising one or more computers, the method comprising: receiving, by the one or more computers of the automated speech recognition system, audio data indicating audio characteristics of an utterance; providing, by the one or more computers of the automated speech recognition system, input features determined based on the audio data to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different languages or dialects, wherein the speech recognition model has been trained using cluster adaptive training, with each of the multiple languages or dialects corresponding to a separate cluster, and wherein the speech recognition model is configured to receive different identifiers as input to the speech recognition model to specify the different clusters corresponding to the respective languages or dialects; receiving, by the one or more computers of the automated speech recognition system, output that the speech recognition model generated in response to receiving the input features determined based on the audio data; and providing, as an output of the automated speech recognition system, a transcription of the utterance generated based on the output of the speech recognition model, wherein the speech recognition model has been trained using multi-task learning using: a first objective function corresponding to grapheme prediction; and a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction.

2. The method of claim 1, wherein: the speech recognition model comprises an encoder, a decoder, and an attention model that learns alignments between outputs of the encoder and the decoder; and the encoder, the decoder, and the attention model each comprise one or more neural network layers that have parameters learned through training using the using training examples representing speech in multiple languages or dialects.

3. The method of claim 1, wherein the linguistic units are graphemes, and the speech recognition model is configured to provide output indicating a probability distribution over a predetermined set of graphemes.

4. The method of claim 1, wherein the speech recognition model is trained to output scores indicative of labels representing different languages or dialects, and wherein the speech recognition model is trained to generate output sequences that include one of the labels representing the different languages or dialects.

5. The method of claim 4, wherein the labels for the language or dialect are included in the output sequences.

6. The method of claim 1, further comprising: determining a language or dialect of the utterance; and providing, as input to the speech recognition model, data indicating the language or dialect as input to one or more neural network layers of the speech recognition model, wherein the output of the speech recognition model is generated based on input features determined from the audio data for the utterance and the data indicating the language or dialect of the utterance.

7. The method of claim 6, wherein providing data indicating the language or dialect comprises providing a 1-hot vector having a value corresponding to each of a predetermined set of languages or dialects.

8. The method of claim 6, wherein the data comprises an embedding corresponding to the language or dialect, wherein the embedding has been learned through training.

9. The method of claim 6, wherein the data indicating the language or dialect is provided as input to one or more neural network layers of an encoder of the speech recognition model.

10. The method of claim 6, wherein the data indicating the language or dialect is provided as input to one or more neural network layers of the of a decoder of the speech recognition model.

11. The method of claim 6, wherein the data indicating the language or dialect is provided as input to one or more neural network layers of an encoder of the speech recognition model and to one or more neural network layers of the decoder of the speech recognition model.

12. The method of claim 11, wherein the data indicating the language or dialect is provided as input to each neural network layer of the encoder and to each neural network layer of the decoder.

13. The method of claim 12, wherein at each neural network layer of the encoder and the decoder, a vector indicative of the language or dialect is linearly transformed by the weight matrices of the neural network layer and added to the original hidden activations before a nonlinearity is applied.

14. The method of claim 1, wherein the speech recognition model has been trained using cluster adaptive training, with each language or dialect corresponding to a separate cluster, and wherein each language or dialect has a corresponding language or dialect identifier provided as input to the speech recognition model to specify the use of the language or dialect.

15. The method of claim 14, wherein the language or dialect identifiers are one-hot vectors.

16. The method of claim 1, wherein the speech recognition model has been trained using cluster adaptive training, with each language or dialect corresponding to a separate cluster, and wherein language or dialect embedding vectors learned through training are used as weights to combine clusters.

17. The method of claim 1, wherein: the speech recognition model comprises an encoder, a decoder, and an attention model that learns alignments between outputs of the encoder and the decoder; the encoder, the decoder, and the attention model each comprise one or more neural network layers that have parameters learned through training using the training examples representing speech in multiple languages or dialects; the speech recognition model has been trained using cluster adaptive training, with each language or dialect corresponding to a separate cluster; for each cluster, a single LSTM layer is used with output projection to match the dimension of a particular layer of the speech recognition model; a weighted sum of all the cluster adaptive trained bases using dialect vectors as interpolation weights is added back to the outputs of the particular layer to generate an aggregated output vector; and the aggregated output vector is provided as input to a last layer of the encoder of the speech recognition model.

18. A system comprising: one or more computers; and one or more computer-readable media storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the following operations: receiving, by the one or more computers of the automated speech recognition system, audio data indicating audio characteristics of an utterance; providing, by the one or more computers of the automated speech recognition system, input features determined based on the audio data to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different languages or dialects, wherein the speech recognition model has been trained using cluster adaptive training, with each of the multiple languages or dialects corresponding to a separate cluster, and wherein the speech recognition model is configured to receive different identifiers as input to the speech recognition model to specify the different clusters corresponding to the respective languages or dialects; receiving, by the one or more computers of the automated speech recognition system, output that the speech recognition model generated in response to receiving the input features determined based on the audio data; and providing, as an output of the automated speech recognition system, a transcription of the utterance generated based on the output of the speech recognition model, wherein the speech recognition model has been trained using multi-task learning using: a first objective function corresponding to grapheme prediction; and a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction.

19. One or more non-transitory computer-readable media storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the following operations: receiving, by the one or more computers of the automated speech recognition system, audio data indicating audio characteristics of an utterance; providing, by the one or more computers of the automated speech recognition system, input features determined based on the audio data to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different languages or dialects, wherein the speech recognition model has been trained using cluster adaptive training, with each of the multiple languages or dialects corresponding to a separate cluster, and wherein the speech recognition model is configured to receive different identifiers as input to the speech recognition model to specify the different clusters corresponding to the respective languages or dialects; receiving, by the one or more computers of the automated speech recognition system, output that the speech recognition model generated in response to receiving the input features determined based on the audio data; and providing, as an output of the automated speech recognition system, a transcription of the utterance generated based on the output of the speech recognition model, wherein the speech recognition model has been trained using multi-task learning using: a first objective function corresponding to grapheme prediction; and a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction.


	As shown above, claims 1 – 20 of US 11,238,845 either anticipate or render obvious claims 1 – 20 of the currently pending application. Therefore, claims 1 – 20 of the currently pending application are obvious variants of claims 1 – 20 of US 11, 238,845.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1 – 4, 6, 9 – 14, 16, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Chan et al. (US 9,799,327) in view of Toshniwal et al. (“Multilingual Speech Recognition with a Single End-to-End Model”) and further in view of Xue (US 10,891,944).
For claims 1 and 11, Chan discloses a system performing a method (Abstract), comprising: data processing hardware (column 4 lines 1 – 30; column 12 lines 10 – 31) and a memory hardware in communication with the data processing hardware storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations (column 12 lines 10 – 31) comprising: receiving audio data indicating audio characteristics of an utterance (Fig.1, 102; column 4 lines 31 – 42); providing, as input to a speech recognition model (Fig.1, 100), speech features (filter bank spectra feature vectors) determined based on the audio data (column 4 lines 45 – 50; column 7 lines 15 - 30 ), wherein the speech recognition model has been trained, using cluster adaptive training: to recognize linguistic units for each of multiple different languages or dialects, with each of the multiple languages or dialects corresponding to a separate cluster; and to receive, as input, different identifiers that specify the different clusters corresponding to the respective languages or dialects; based on the speech features provided as input to the speech recognition model, generating, as output from the speech recognition model at each of a plurality of time steps, scores an output vector at the corresponding time step indicating a probability distribution over a predetermined set of linguistic units (substrings including a set of alphabetic letters which is used to write one or more natural languages) (column 4 lines 65 – column 5 line 22; column 6 lines 13 – column 7 line 6) for each of the multiple different languages or dialects the speech recognition model has been trained to recognize; and providing, as an output of the automated speech recognition system, a transcription of the utterance generated based on the scores output vectors generated as output from the speech recognition model at each of the plurality of time steps (The system generates a sequence of substrings that represent a transcription of the utterance (step 408). The generated sequence of substrings may begin with a start of sequence token <sos> and end with an end of sequence token <eos>, column 7 lines 53- column 8 line 19). Yet, Chan fails to teach the following: the speech recognition model has been trained, using cluster adaptive training: to recognize linguistic units for each of multiple different languages or dialects, with each of the multiple languages or dialects corresponding to a separate cluster; and to receive, as input, different identifiers that specify the different clusters corresponding to the respective languages or dialects; and the output scores further comprise;  and the output scores further comprise an output vector at the corresponding time step indicating a probability distribution over a predetermined set of linguistic units for each of the multiple different languages or dialects the speech recognition model has been trained to recognize.
However, Toshniwal discloses a method for performing multilingual speech recognition with a single end-to-end model (Abstract), wherein a speech recognition model (Listen, Attend and Spell attention-based sequence-to-sequence ASR model, 2.1 LAS Model, pg.1 and 2) is trained to recognize linguistic units for each of multiple different languages (2.2 Multilingual Models- 2,2,1 Joint and 2.2.2 Multitask and 3.2 Model and Training, pg.2 and 3) and to receive, as input, different identifies that correspond to the respective languages (2.2.3 Conditioned, pg. 2). Furthermore, the speech recognition model outputs a vector at a corresponding time step indicating a probability distribution over a predetermined set of linguistic units (characters/graphemes) for each of multiple different language (2.1 LAS Model and 2.2 Multilingual Models, pg. 1 and 2).
Furthermore, Xue discloses an adaptive and compensatory speech recognition method (Abstract), wherein a speech recognition model is adaptively trained using clustered feature vectors (column 7 lines 44 – 49; column 9 lines 5 – 29; column 10 lines 35 - 44).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve Chan’s invention in the same way that Toshniwal’s invention has been improved to achieve the following predictable results for the purpose of enhancing the sequence-to-sequence speech recognition model (Chan, LAS, column 3 lines 15 – 45) (Sainath, 2.1 LAS Model, pg. 1 and 2) to recognize multiple languages (Sainath, 1.Introduction, pg. 1):  the speech recognition model has been trained to recognize linguistic units for each of multiple different languages or dialects; the speech recognition model has been trained to receive, as input, different identifiers corresponding to respective languages or dialects; and the scores output by the recognition model further comprise an output vector at the corresponding time step indicating a probability distribution over a predetermined set of linguistic units for each of the multiple different languages or dialects the speech recognition model has been trained to recognize.
Furthermore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to apply the cluster adaptive training technique disclosed by Xue to the speech recognition model disclosed by the combination of Chan and Toshniwal to achieve the predictable results of the speech recognition model being further trained using cluster adaptive training, wherein the speech features are clustered according to the type of speech recognition, e.g. language for language adaptation or dialect (column 7 lines 44 – 49) for the purpose of improving the performance of the speech recognition model by modifying it to efficiently and accurately recognize multilingual speech (Xue, column 6 lines 57 – 65).

For claims 2 and 12, Chan and Toshniwal further disclose wherein: the speech recognition model comprises an encoder (Chan, column 5 lines 15 – column 6 line 12) (Toshniwal, 2.1 LAS Model, pg. 1 and 2) , a decoder (Chan, column 6 lines 13 – column 7 line 7) (Toshniwal, 2.1 LAS Model, pg. 1 and 2), and an attention model that learns alignments between outputs of the encoder and the decoder (Chan, column 6 lines 13 – column 7 line 7) (Toshniwal, 2.1 LAS Model, pg. 1 and 2); and the encoder, the decoder, and the attention model each comprise one or more neural network layers that have parameters learned through training using training examples representing speech in the multiple languages  (Chan, column 5 lines 15 – column 7 line 7) (Toshniwal, 2.1  LAS Model, 2.2 Multilingual Models, 3.1  Data and 3.2 Model and Training Details, pg. 2 and 3) or dialects.

For claims 3 and 13, Chan and Toshniwal further disclose, wherein the linguistic units are graphemes (Chan, column 4 lines 31 – 40, 51 – 64) (Toshniwal, 2.1 LAS Model, pg. 1).
 
For claims 4 and 14, Chan further discloses wherein the linguistic units are word pieces (Chan, linguistic units are substrings comprising one or more characters in an alphabet, column 4 lines 31 – 40, 51 – 64).

For claims 6 and 16, Toshniwal further discloses, herein the operations further comprise: determining a language (Toshniwal, 2.3.3 Conditioned, pg.2) or dialect of the utterance ; and providing, as input to the speech recognition model, data indicating the language (Toshniwal, 2.2.3 Conditioned, pg.2) or dialect of the utterance, wherein the output vector generated as output from the speech recognition model at the corresponding time step is generated based on the speech features and the data indicating the language  (Toshniwal, 2.1 LAS mode, 2. 2.3, Conditioned, pg. 2) or dialect of the utterance provided as input to the speech recognition model.
	For claims 9 and 19, Toshniwal further discloses, wherein the data indicating the language or dialect is provided as input to one or more neural network layers of an encoder of the speech recognition model (Toshniwal, 2.3.3 Conditioned, pg.2)
.
	For claims 10 and 20, Toshniwal further discloses, wherein the data indicating the language or dialect is provided as input to one or more neural network layers of a decoder of the speech recognition model (Toshniwal, 2.3.3 Conditioned, pg.2)

Claim(s) 5 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Chan ( et al. (US 9,799,327) in view of Toshniwal et al. (“Multilingual Speech Recognition with a Single End-to-End Model”), and further in view of Xue (US 10,891,944) and further in view of Watanabe et al. (US 2019/0189111)(“Watanabe”).
For claims 5 and 15, the combination of Chan, Toshniwal and Xue fails to teach, wherein the speech recognition model is further trained to: output scores indicative of labels representing the different languages or dialects; and generate output sequences that include one of the labels representing the different languages or dialects.
However, Watanabe discloses a system and method for multi-lingual end-to-end speech recognition (Abstract), wherein a speech recognition model (End-to-end Speech Recognition Module, Fig.1, 200 and Fig.2, 200) is trained to output scores indicative of labels representing different languages (posterior probability distributions of labels which include language identifiers output from the attention decoder,Fig.2, 204;  [0037] [0038] [0051 – 0055] [0063] [0064]) and generate output sequences that include one of the labels representing the different languages (Fig.2, 206 and 207; [0038] [0039] [0063] [0064]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Chan, Toshniwal and Xue in the same way the Watanabe’s invention has been improved to achieve the following  predictable results for the purpose of  automatically recognizing multilingual utterances and jointly identifying the language of each utterance using a single end-to-end speech recognition model (Watanabe, [0004 – 0008]): the speech recognition model is further trained to: output scores indicative of labels representing the different languages or dialects; and generate output sequences that include one of the labels representing the different languages or dialects.
Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Chan ( et al. (US 9,799,327) in view of Toshniwal et al. (“Multilingual Speech Recognition with a Single End-to-End Model”), and further in view of Xue (US 10,891,944) and further in view of Waibel (“Using Language Adaptive Deep Neural Networks for Improved Multilingual Speech Recognition”).
For claims 7 and 17, the combination of Chan, Toshniwal and Xue fails to teach, wherein providing the data indicating the language or dialect comprises providing a 1-hot vector having a value corresponding to each of a predetermined set of languages or dialects.
However, Waibel discloses a method for speech recognition (Abstract), wherein a 1-hot vector corresponding to each of a predetermined set of languages is provided as input to a speech recognition model (Figure 2; Language Adaptive Deep Neural Networks).
Therefore, it would have been obvious to one ordinary skill in the art at the time of applicant’s filing to modify the combined teachings of Chan, Toshniwal and Xue with Waibel’s teachings so that providing data indicating the language or dialect further comprises providing a 1-hot vector having a value corresponding to each of a predetermined set of languages or dialects or the purpose of improving the performance of the speech recognition model by modifying it to efficiently and accurately recognize multilingual speech (Xue, column 6 lines 57 – 65).

Claims 8 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Chan ( et al. (US 9,799,327) in view of Toshniwal et al. (“Multilingual Speech Recognition with a Single End-to-End Model”), and further in view of Xue (US 10,891,944), and further in view of Waibel (“Using Language Adaptive Deep Neural Networks for Improved Multilingual Speech Recognition”) and further in view of Garman et al. (US 2021/0256961) (“Garman”).
For claims 8 and 18, the combination of Chan, Toshniwal and Xue further discloses wherein the data comprises an embedding corresponding to the language (Toshniwal, 2.2.3 Conditioned, pg.3)  or dialect. Yet, the combination of Chan, Toshniwal and Xue fails to teach that the embedding has been learned through training.
However, Garman discloses a method for synthesizing speech (Abstract), wherein language embeddings are learned through training ([0033] [0035]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to modify the combined teachings of Chan, Toshniwal and Xue  with Garman’s teachings so that the language embeddings are learned through training for the  purpose of enabling a model to be accurately and explicitly conditioned on speech language to better allocate its capacity appropriately across languages (Toshniwal, 2.2.3 Conditioned, pg.2).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SONIA L GAY whose telephone number is (571)270-1951. The examiner can normally be reached Monday-Friday 9-5 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SONIA L GAY/Primary Examiner, Art Unit 2657                                                                                                                                                                                                        1