Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Claim Rejections - 35 USC § 103
2.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103, which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-24 are rejected under 35 U.S.C. 103 as being unpatentable over Gong (US 2021/0233517) in view of Bo (MULTI-DIALECT SPEECH RECOGNITION WITH A SINGLE SEQUENCE-TO-SEQUENCE MODEL, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018).
As per claim 1, Gong teaches receiving, at data processing hardware, audio data for an utterance spoken in a particular native language ([0041], the computing device 110 may capture a voice input 204 of the user);  
[0023]-[0025], [0035], obtaining spectrograms or  other alternative representations such as xyz-axis to identify a particular language); 
processing, by the data processing hardware, using a multilingual end-to-end (E2E) speech recognition model, the language vector and acoustic features derived from the audio data to generate a transcription for the utterance ([0035]-[0036], [0025], [0029], processing acoustic feature and alternative representations such as xyz-axis to identify a particular language and generating messages in the form of text in the identified language), 
the multilingual E2E speech recognition model comprising a plurality of language-specific adaptor modules that include one or more adaptor modules specific to the particular native language and one or more other adaptor modules specific to at least one other native language different than the particular native language (training the multilingual E2E speech recognition system using a plurality of models corresponding to different languages, [0028], [0023], Each word or phrase of each language as represented in the spectrogram or an alternative representation (e.g., intensity, pitch and intensity, etc.) may correspond to a unique pattern. The training of the model can lead to an accurate identification of the language of the speech based on the pattern analysis. ); and	
providing, by the data processing hardware, the transcription for output (providing messages in the form of text in the identified language, [0035]-[0036], [0025], [0029]) .
Gong may not explicitly disclose obtaining a language vector.  However, the prior art teaches this feature.  Bo, in the same field of endeavor teaches a multilingual E2E speech recognition system obtaining language/dialect information as a vector (Section 2.2, page 4750 and Section 4.2.2, page 4751).  Therefore, it would have been obvious at the time the application 
As per claim 2, Gong may not explicitly disclose wherein the language vector comprises a one-hot vector.  Bo, in the same field of endeavor teaches a multilingual E2E speech recognition system obtaining language/dialect information as a vector, wherein the language vector comprises a one-hot vector (Section 2.2, page 4750 and Section 4.2.1, page 4751).  Therefore, it would have been obvious at the time the application was filed to use Bo’s vector feature with the multilingual E2E speech recognition system of Gong.  This would ease the process of language identification and provide more realistic recognition results.
As per claim 3, Gong teaches prior to processing the language vector and the acoustic features, generating, by the data processing hardware, using the audio data, a vector representation of the utterance, the vector representation of the utterance comprising the acoustic features derived from the audio data ([0023]-[0025], [0035], obtaining spectrograms or  other alternative representations such as xyz-axis to identify a particular language. More, Bo, Section 2.2, page 4750 and Section 4.2.1, page 4751, using dialect/language audio information of an utterance to generate corresponding vectors), 
As per claim 4, Gong teaches concatenating, by the data processing hardware, the language vector and the vector representation of the utterance to generate an input vector, wherein processing the language vector and the acoustic features comprises processing, using the multilingual E2E speech recognition model, the input vector to generate the transcription for the utterance (Necessarily disclosed within the process of generating a transcript in the identified language, as in [0035]-[0036], [0025], [0029], processing acoustic feature and alternative representations such as xyz-axis to identify a particular language and generating messages in the form of text in the identified language), 
As per claim 5, Gong teaches wherein obtaining the language vector comprises: identifying the particular native language for the utterance based on a language preference for a user that spoke the utterance; and generating the language vector based on the identified particular native language (Gong, [0042]-[0043], determining the preferred language of the user based on a variety extracted features associated with the input audio.  The generating of the language vector is performed by Bo, as detailed by claim 1).
As per claim 6, wherein obtaining the language vector comprises executing a language identification system configured to identify the particular native language by processing the audio data; and generating the language vector based on the identified particular native language (Bo, Section 2.2, page 4750 and Section 4.2.1, page 4751, wherein dialect/language audio information of an utterance is used to generate corresponding vectors.  See also Section 4.2).
As per claim 7, Gong teaches wherein the multilingual E2E speech recognition model uses a recurrent neural network-transducer (RNN-T) architecture, the RNN-T architecture comprising: an encoder network configured to generate, at each of a plurality of time steps, a higher-order feature representation from an input vector, the input vector comprising a concatenation of the language vector and the acoustic features derived from the audio data; a prediction network configured to process a sequence of previously output non- blank symbols into a dense representation; and a joint network configured to predict, at each of the plurality of time steps, a probability distribution over possible output labels based on the higher-order feature representation output by the encoder network and the dense representation output by the prediction network (Fig. 3, [0005], [0023], Gong teaches network model may comprise various examples, such as a convoluted neural network (CNN), a recurrent neural network, or a combination of both, wherein feature vectors representing acoustic and language information are derived and applying multiple layers of the network to determine the language).
Gong may not explicitly show an encoder in its architecture.  Bo, in the same field of endeavor teaches a multilingual E2E speech recognition system, wherein a recurrent neural network comprising an encoder and a decoder (Section 2, page 4750). Therefore, it would have been obvious at the time the application was filed to use Bo’s above feature with the multilingual E2E speech recognition system of Gong.  This would improve the performance of language identification models and provide more realistic recognition results.
As per claim 8, Gong may not explicitly wherein the encoder network comprises: a plurality of stacked Long Short-Term Memory (LSTM) layers; and after each LSTM layer, a respective layer comprising a respective subset of the plurality of language-specific adaptor modules, each language-specific adaptor module in the respective layer specific to a different respective native language, wherein one of the language-specific adaptor modules in the respective layer is specific to the particular native language.  Bo, in the same field of endeavor teaches a multilingual E2E speech recognition system, wherein the encoder network comprises a plurality of stacked Long Short-Term Memory (LSTM) layers; and after each LSTM layer, a respective layer comprising a respective subset of the plurality of language-specific adaptor modules, each language-specific adaptor module in the respective layer specific to a different respective native language, wherein one of the language-specific adaptor modules in the respective layer is specific to the particular native language (Section 2 - Section 2.3, page 4750, wherein the encoder  uses 5 layers of unidirectional long short term memory (LSTM), and a decoder as a neural language model, consisting of 2 LSTM layers with a  language model specific to each language ). Therefore, it would have been obvious at the time the application was filed to use Bo’s above encoder architecture with the multilingual E2E speech recognition system of Gong, in order to improve the performance of language identification models and provide more realistic recognition results.
As per claim 9, Gong teaches wherein the generated transcription for the utterance is in a respective native script representing the particular native language ([0003], communicating a message, as a text, in the identified language).
As per claim 10, Gong teaches wherein the data processing hardware and the multilingual speech recognition model reside on a user device associated with a user that spoke the utterance ([0033], [0010]).
As per claim 11, Gong teaches wherein the multilingual E2E speech recognition model is trained by a training process, the training process comprises: obtaining a plurality of training data sets each associated with a respective native language that is different than the respective native languages of the other training data sets, each training data set comprising a plurality of respective training data samples, each training data sample comprising audio data for an utterance spoken in the respective native language, a language identifier identifying the respective native language, and a corresponding transcription of the utterance in a respective native script representing the respective native language (Gong, [0028], [0023], Each word or phrase of each language as represented in the spectrogram or an alternative representation (e.g., intensity, pitch and intensity, etc.) may correspond to a unique pattern. The training of the model can lead to an accurate identification of the language of the speech based on the pattern analysis),  during a first stage of the training process, training the multilingual E2E speech recognition model on a union of all of the training data sets using a stochastic optimization (Gong teaches, at Fig. 4, [0048]-[0050], a plurality of speech samples are obtained, each speech sample comprising one or more words spoken in a language, and a neural network model may be trained with the speech samples to obtain a trained model for predicting the languages of speeches and providing speech transcripts).  Furthermore, Bo in the same field of endeavor teaches training a multi-dialect/language end-to-end speech recognition system wherein a plurality of training data sets each associated with a respective native language is obtained (Section 2 - Section 2.3, page 4750).  All networks are trained with the cross-entropy criterion, using asynchronous stochastic gradient descent (ASGD) optimization (Section 3, page 4751); and modifying the multilingual E2E speech recognition model to include the plurality of language-specific adaptor modules; and for each of the one or more adaptor modules that are specific to the particular native language, learning values for a respective set of weights by training the multilingual E2E speech recognition model only on the training data set that is associated with the respective particular native language (Introduction and Sections 2.2, 2.3, wherein adaptive training is performed).  Therefore, it would have been obvious at the time the application was filed to combine Bo’s above features with the multilingual E2E speech recognition system of Gong, in order to improve the performance of language identification models and provide more realistic recognition results.
As per claim 12, Gong teaches wherein the training process executes on a remote computing device in communication with the data processing hardware, the data processing hardware residing on a user device associated with a user that spoke the utterance and configured to execute the multilingual E2E speech recognition model after the training process is complete, ([0031]-[0032]).
As per claims 13-24, More Gong teaches data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations ([0027]-[0029]).  The rest is similarly rejected under the same rationale as applied above with respect to method claims 1-12, as system claims 13-24and method claims 1-12 are related as apparatus and the method of using same, with each claimed element's function corresponding to the claimed method step.  

Conclusion
3.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO-892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDELALI SERROU whose telephone number is (571)272-7638.  The examiner can normally be reached on M-F 9 Am - 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/ABDELALI SERROU/            Primary Examiner, Art Unit 2659