Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
2.	In response to the office action mailed on 02/28/2022, applicant filed an amendment on 5-26, amending claims 1 and 13.  The pending claims are 1-24. 


	EXAMINER’S AMENDMENT
3.	An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an interview with Grant Griffith on 06/02/2022.
The application has been amended as follows:
In the claims:
1.	(Amended) A method comprising:
receiving, at data processing hardware, audio data for an utterance spoken in a particular native language;
obtaining, by the data processing hardware, a language vector identifying the particular native language;
processing, by the data processing hardware, using a multilingual end-to-end (E2E) speech recognition model that uses a recurrent neural network-transducer (RNN-T) architecture comprising an encoder network, the language vector and acoustic features derived from the audio data to generate a transcription for the utterance, the multilingual E2E speech recognition model comprising a plurality of language-specific adaptor modules that include one or more adaptor modules specific to the particular native language and one or more other adaptor modules specific to at least one other native language different than the particular native language, wherein processing the language vector causes the multilingual E2E speech recognition model to activate the one or more adaptor modules specific to the particular native language so that the multilingual E2E speech recognition model only applies the activated one or more adaptor modules specific to the particular native language without applying the one or more other adaptor modules specific to the at least one other native language when processing the acoustic features derived from the audio data to generate the transcription; and
providing, by the data processing hardware, the transcription for output, 
wherein the encoder network of the RNN-T architecture comprises:
a plurality of stacked Long Short-Term Memory (LSTM) layers; and
after each LSTM layer, a respective layer comprising a respective subset of the plurality of language-specific adaptor modules, each language-specific adaptor module in the respective layer specific to a different respective native language, wherein one of the language-specific adaptor modules in the respective layer is specific to the particular native language.

7.	(Amended) The method of claim 1, wherein the  comprises
[[an ]]the encoder network configured to generate, at each of a plurality of time steps, a higher-order feature representation from an input vector, the input vector comprising a concatenation of the language vector and the acoustic features derived from the audio data; 
a prediction network configured to process a sequence of previously output non-blank symbols into a dense representation; and
a joint network configured to predict, at each of the plurality of time steps, a probability distribution over possible output labels based on the higher-order feature representation output by the encoder network and the dense representation output by the prediction network.  

8.	(Canceled) 

13.	(Amended) A system comprising:
data processing hardware; and
memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: 
receiving audio data for an utterance spoken in a particular native language;
obtaining a language vector identifying the particular native language;
processing, using a multilingual end-to-end (E2E) speech recognition model that uses a recurrent neural network-transducer (RNN-T) architecture comprising an encoder network, the language vector and acoustic features derived from the audio data to generate a transcription for the utterance, the multilingual E2E speech recognition model comprising a plurality of language-specific adaptor modules that include one or more adaptor modules specific to the particular native language and one or more other adaptor modules specific to at least one other native language different than the particular native language, wherein processing the language vector causes the multilingual E2E speech recognition model to activate the one or more adaptor modules specific to the particular native language so that the multilingual E2E speech recognition model only applies the activated one or more adaptor modules specific to the particular native language without applying the one or more other adaptor modules specific to the at least one other native language when processing the acoustic features derived from the audio data to generate the transcription; and
providing the transcription for output, 
wherein the encoder network of the RNN-T architecture comprises:
a plurality of stacked Long Short-Term Memory (LSTM) layers; and
after each LSTM layer, a respective layer comprising a respective subset of the plurality of language-specific adaptor modules, each language-specific adaptor module in the respective layer specific to a different respective native language, wherein one of the language-specific adaptor modules in the respective layer is specific to the particular native language.

19.	(Amended) The system of claim 13, wherein the  comprises
[[an ]]the encoder network configured to generate, at each of a plurality of time steps, a higher-order feature representation from an input vector, the input vector comprising a concatenation of the language vector and the acoustic features derived from the audio data; 
a prediction network configured to process a sequence of previously output non-blank symbols into a dense representation; and
a joint network configured to predict, at each of the plurality of time steps, a probability distribution over possible output labels based on the higher-order feature representation output by the encoder network and the dense representation output by the prediction network.  

20.	(Canceled) 



Allowable Subject Matter
4.	Claims 1-7, 9-19, and 21-24 are allowed.
The following is an examiner’s statement of reasons for allowance: 
The prior art does not teach or suggest processing, using a multilingual end-to-end (E2E) speech recognition model that uses a recurrent neural network-transducer (RNN-T) architecture comprising an encoder network, the language vector and acoustic features derived from the audio data to generate a transcription for the utterance, the multilingual E2E speech recognition model comprising a plurality of language-specific adaptor modules that include one or more adaptor modules specific to the particular native language and one or more other adaptor modules specific to at least one other native language different than the particular native language, wherein processing the language vector causes the multilingual E2E speech recognition model to activate the one or more adaptor modules specific to the particular native language so that the multilingual E2E speech recognition model only applies the activated one or more adaptor modules specific to the particular native language without applying the one or more other adaptor modules specific to the at least one other native language when processing the acoustic features derived from the audio data to generate the transcription; and providing the transcription for output, wherein the encoder network of the RNN-T architecture comprises: a plurality of stacked Long Short-Term Memory (LSTM) layers; and after each LSTM layer, a respective layer comprising a respective subset of the plurality of language-specific adaptor modules, each language-specific adaptor module in the respective layer specific to a different respective native language, wherein one of the language-specific adaptor modules in the respective layer is specific to the particular native language, as claimed by independent claims 1 and 13.
Dependent claims 2-7, 9-12, 14-19, and 21-24 are allowed for being dependent and further limiting independent claims 1 and 13.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee, and to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
5.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO 892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDELALI SERROU whose telephone number is (571)272-7638.  The examiner can normally be reached on M-F 9 Am - 5 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

	/ABDELALI SERROU/            Primary Examiner, Art Unit 2659