DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 06/14/2021.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Objections
Claim 13 objected to because of the following informalities:  
“The computer-implemented method of claim 11, wherein at least one of the plurality of user devices is further configured to transliterate the corresponding speech recognition result in the target script into a transliterated script.” should be a dependent of claim 12. Hence, read: “The computer-implemented method of claim 12, …”
Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



Claims 1-5, 7, 11, 14-18, 20, and 24, are rejected under 35 U.S.C. 103 as being unpatentable over J. Emond et al., "Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance," 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 448-455, doi: 10.1109/SLT.2018.8639699 (https://ieeexplore.ieee.org/document/8639699; Emond et al.), and further in view of Li; Jinyu et al. (US 10964309 B2; Li et al.). 


As to independent claim 1, Emond et al. teaches:
A computer-implemented method that when executed on data processing hardware causes the data processing hardware (see ¶ 1 of  3.1. General Transliteration Approach and ¶ 1 of 3.2. Optimizations: “3.1. General Transliteration Approach: In this paper, we propose the use of transliteration via a weighted finite state transducer as proposed for keyboard decoding in [19]. […] 3.2. Optimizations: In order to get performance of transliteration at a good operating point with respect to memory, speed and latency considerations for building large-scale language models, we explored several optimizations.”) to perform operations comprising:
obtaining a plurality of training data sets each associated with a respective native language that is different than the respective native language of the other training data sets, each training data set comprising a plurality of respective training data samples, each training data sample comprising audio spoken in the respective native language and a corresponding transcription of the audio in a respective native script representing the respective native language (see ¶ 1 of  4.1 Training Data: “4.1 Training Data: All our experiments were conducted on training and test sets that were anonymized and hand-transcribed utterances representative of Google’s voice [i.e., audio spoken data samples] search traffic in Indic languages. […] and Table 4 (different Indic languages [i.e., native languages])”);
for each respective training data sample of each training data set in the respective native language:
transliterating the corresponding transcription in the respective native script into corresponding transliterated text representing the respective native language of the corresponding audio in a target script (see ¶ 1 of 3.1. General Transliteration Approach and Figure 1: “3.1. General Transliteration Approach: Transcribers were asked to transcribe spoken utterances in the native writing script (Devanagari, in this case) [i.e., native script] with exceptions for certain commonly used English words to be written in Latin script [i.e., target script]. […] The transliteration transducer, T is a composition of three transducers: language model, P is a bigram pair language model that maps between symbols in the two writing scripts, English and Devanagari [i.e., native script], and O maps the pair language model symbols to the target output Devanagari symbols (illustrated in Figure 1) [Latin script; i.e., target script]].”); and
associating the corresponding transliterated text in the target script with the corresponding audio in the respective native language to generate a respective normalized training data sample, the respective normalized training data sample comprising the audio spoken in the respective native language and the corresponding transliterated text in the target script (see ¶ 1 of 4.3. Impact of transliteration on Language Modeling and ¶ 3 of 7. Summary: “4.3. Impact of transliteration on Language Modeling: It can be seen that retraining LMs with all the data transliterated to Devanagari provides a nice gain on the Voice Search and dictation test sets [i.e., association of transliterated text and native language audio] (row 5). Thus, building LMs by transliterating all the training data to Devanagari, thereby introducing consistent text normalization [i.e., normalized training data], results in gains of 3 to 8% relative improvements in WER on the two test sets. 7. Summary: Consistent normalization of training transcripts for both language and acoustic modeling with significant gains of up to 10% relative across several code-switched Indic languages using Google voice search and dictation traffic.”); 
training, using the normalized training data samples generated from each respective training data sample of each training data set and without providing any language information, a multilingual (see ¶ 1 of 4.3. Impact of transliteration on Language Modeling and ¶ 3 of 7. Summary citations as in limitation above: Here, “without providing any language information” is interpreted as the transliteration performed on all training data to Devanagari (i.e., no specific language information as in other cases taught in Emond et al. (Table 6), such as: Devanagari-only, Hindi and Latin, etc.). and Table 6 (row 5), 3.1. General Transliteration Approach (P is a bigram pair language model that maps between symbols in the two writing scripts, English and Devanagari (i.e., multilingual), and O maps the pair language model symbols to the target output Devanagari symbols (illustrated in Figure 1).), and Figure 1: Illustration of transliteration between Devanagari and Latin scripts for the English (i.e., multilingual) word browser),

However, Emond et al. does not explicitly teach, but Li et al. does teach:
training, (see Col. 1, lines 33-48: “Systems, methods, and computer readable storage devices embodying instructions for providing an attention model trained as an end-to-end system for ASR are provided herein. A CS (code-switching) CTC (connectionist temporal classification) model may be initialed from a major language CTC model by keeping network hidden weights and replacing output tokens with a union of major and secondary language output tokens. The initialized CS CTC model may be trained by updating parameters with training data from both languages, and a Language Identification (“LID”) model may also be trained with the training data.”).
Emond et al. and Li et al. are both considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Emond et al. to incorporate the teachings of Li et al. of training a multilingual end-to-end speech recognition model to predict speech recognition results in the target script for corresponding speech utterances spoken in any of the different native languages associated with the plurality of training data sets which provides the benefit of accurately and efficiently improving CS ASR via an E2E CTC model (Col. 13, lines 61-63 of Li et al.).

As to independent claim 14, Emond et al. in combination with Li et al. teach all of the limitations as in claim 1, above.
Emond et al. further teaches:
A system comprising:
data processing hardware (see ¶ 1 of 3.2. Optimizations: “In order to get performance of transliteration at a good operating point with respect to memory [i.e., memory hardware], speed and latency [i.e., data processing hardware]considerations for building large-scale language models, we explored several optimizations.”) and
memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware (see ¶ 1 of 3.2. Optimizations citation in limitation above.) to perform operations comprising:
[limitations as in independent claim 1, above]
Regarding claims 2 and 15, Emond et al. in combination with Li et al. teach all of the limitations as in claim 1 and 14, above.
Emond et al. further teaches:
wherein transliterating the corresponding transcription in the respective native script comprises using a finite state transducer (FST) network to transliterate the corresponding transcription in the respective native script into the corresponding transliterated text (see ¶ 1 of 3.1. General Transliteration Approach: “In this paper, we propose the use of transliteration via a weighted finite state transducer as proposed for keyboard decoding in [19].

Regarding claims 3 and 16, Emond et al. in combination with Li et al. teach all of the limitations as in claim 1 and 14, above.
Emond et al. further teaches:
wherein transliterating the corresponding transcription in the respective native script into the corresponding transliterated text comprises using a respective transliteration transducer associated with the respective native script to transliterate the corresponding transcription in the respective native script into the corresponding transliterated text in the target script (see ¶ 1 of 3.1. General Transliteration Approach and Figure 1: “The transliteration transducer, T is a composition of three transducers: I P O, where I maps input unicode symbols to symbols in a pair language model, P is a bigram pair language model that maps between symbols in the two writing scripts, English and Devanagari [i.e., native script], and O maps the pair language model symbols to the target output [i.e., target script] Devanagari symbols (illustrated in Figure 1).”).

Regarding claims 4 and 17, Emond et al. in combination with Li et al. teach all of the limitations as in claim 3 and 16, above.
Emond et al. further teaches:
wherein the transliteration transducer associated with the respective native script comprises:
an input transducer configured to input Unicode symbols in the respective native script to symbols in a pair language model (see ¶ 1 of 3.1. General Transliteration Approach: “… three transducers: I, P, O, where I maps input unicode symbols to symbols in a pair language model…
a bigram pair language model transducer configured to map between symbols in the respective native script and the target script (see ¶ 1 of 3.1. General Transliteration Approach: “… three transducers: I, P, O, … P is a bigram pair language model that maps between symbols in the two writing scripts…”); and
an output transducer configured to map the symbols in the pair language model to output symbols in the target script see ¶ 1 of 3.1. General Transliteration Approach: “… three transducers: I, P, O, …O maps the pair language model symbols to the target output Devanagari symbols (illustrated in Figure 1).”).

Regarding claims 5 and 18, Emond et al. in combination with Li et al. teach all of the limitations as in claim 3 and 16, above.
Emond et al. further teaches:
wherein the operations further comprise, prior to transliterating the corresponding transcription in the respective native language (see ¶ 2 and 5 of 6. Analysis: “… some of the errors caused by the transliteration process can be corrected by training a model [i.e., prior to transliteration] on matched data [i.e., agreement-based data pre-processing].”).,
training, using agreement-based data pre-processing, each respective transliteration transducer to only process transliteration pairs that have at least one spelling in the target script of the transliterated text for a given native word that is common across each of the respective native languages associated with the training data sets (see ¶ 2 and 5 of 6. Analysis: “In the first example, the utterance in Latin reads as Tiger zinda hai full movie. The reference contained the first three words in Latin and the last two in Devanagari. As designed, the ASR hypothesis was in Devanagari. The result of transliterating both the reference and the hypothesis to a common Devanagari writing system, introduced an error Zinda vs Jinda. […] The improvement in grapheme error rate is a good indication that transliterated LMs are still useful. We hypothesize that some of the errors caused by the transliteration process can be corrected by training a model [i.e., prior to transliteration] on matched data [i.e., agreement-based data pre-processing].”).

Regarding claims 7 and 20, Emond et al. in combination with Li et al. teach all of the limitations as in claim 1 and 14, above.
Emond et al. further teaches:
wherein transliterating the corresponding transcription in the respective native script into the corresponding transliterated text comprises using a language-independent transliteration transducer to transliterate the corresponding transcription in the respective native script into the corresponding transliterated text in the target script (see ¶ 1. Introduction and 3.1 General Transliteration Approach: “1. Introduction: In this paper, we propose an alternative strategy based on transliteration for improving ASR performance. WFSTs have been used extensively for speech recognition decoding, where WFSTs representing a context-dependent phone sequence model (C) […] 3.1 General Transliteration Approach: In this paper, we propose the use of transliteration via a weighted finite state transducer as proposed for keyboard decoding in [19] […] Thus, the context and range of input from the two writing systems was restricted to what was said in the utterance unlike unrestricted text entry via the keyboard. […] The transliteration transducer, T is a composition of three transducers: I P O, where I maps input unicode symbols to symbols in a pair language model [i.e., it is interpreted that any language can be mapped as Unicode, hence language-independent], P is a bigram pair language model that maps between symbols in the two writing scripts, English and Devanagari, and O maps the pair language model symbols to the target output Devanagari symbols (illustrated in Figure 1).” ).

Regarding claims 11 and 24, Emond et al. in combination with Li et al. teach all of the limitations as in claim 1 and 14, above.
Emond et al. further teaches:
wherein the operations further comprise, prior to training the multilingual E2E ASR model, shuffling the normalized training data samples generated from each respective training data sample of each training data set (see ¶ 1 and 4.3. Impact of transliteration on Language Modeling: “[…] we explored normalization of training data for language models using transliteration. […] The normalized scripts in Devanagari were subsequently used to train 5-gram LMs for the firstpass and class-based maximum entropy based models for the second pass.  […] In order to compare with various writing systems as inputs to the language model, we define the Devanagari-only data based LM as an LM that was built with all utterances containing Devanagari script only. Any utterance containing bilingual text in Devanagari and Latin scripts was not used in the language model builds. As expected, this resulted in a loss of contextual modeling, lesser data and introduced mismatches between training and test set distributions.” 
[Here, the step of excluding bilingual text (i.e., training data) is interpreted as a type of sorting out (i.e., shuffling) the normalized training data (i.e., normalized scripts in Devenagari).]).

Claims 6 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over J. Emond et al., "Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance," 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 448-455, doi: 10.1109/SLT.2018.8639699 (https://ieeexplore.ieee.org/document/8639699; Emond et al.), further in view of Li; Jinyu et al. (US 10964309 B2; Li et al.) as applied to claims 3 and 16 above, and further in view of Saravanan, K., and A. Kumaran. "Some experiments in mining named entity transliteration pairs from comparable corpora." Proceedings of the 2nd workshop on Cross Lingual Information Access (CLIA) Addressing the Information Need of Multilingual Societies. 2008. (https://aclanthology.org/I08-6004.pdf; Saravanan et al.)

Regarding claims 6 and 19, Emond et al. in combination with Li et al. teach all of the limitations as in claim 3 and 16, above.
However, Emond et al. in combination with Li et al. do not explicitly teach, but Saravanan et al. teaches:
wherein the operations further comprise, prior to transliterating the corresponding transcription in the respective native language, training, using frequency-based data pre-processing, each respective transliteration transducer to only process transliteration pairs that have spellings in the target script of the transliterated text for a given native word that satisfy a frequency threshold (see 3.1 Mining of Transliteration Pairs: “We start with comparable corpora in English and Tamil, similar in size to that used in [Klementiev and Roth, 2006], and using the English side of this corpora, first, we extract all the NEs that occur more than a given threshold parameter, FE [i.e., frequency threshold], using a standard NER tool. The higher the threshold is, the more will be the evidence for legitimate transliteration pairs, in the comparable corpora, which may be captured by the mining methodology. […] Again, we consider only those signatures that have occurred more than a threshold parameter,  FT, in the Tamil side of the comparable corpora, in order to strengthen support for a meaningful similarity in their frequency of occurrence. […] 4.2 Classifier for Transliteration Pair Identification: This makes sense as the classifier for identifying potential transliterations is trained with sizable corpora and is hence accurate; but, as the thresholds increase, it has less data to work with, and possibly a fraction of legitimate transliterations also gets filtered with noise [i.e., transliteration performed after training]. […] and 4.4 Overall Performance of Transliteration Pairs Mining: In addition, these metrics were computed, corresponding to different frequency thresholds for the occurrence of a English  NE (FE) and a Tamil signature (FT).”).
Emond et al. in combination with Li et al.  and Saravanan et al.  are both considered to be analogous to the claimed invention because they are in the same field of endeavor in data processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Emond et al. in combination with Li et al.  to incorporate the teachings Saravanan et al. of training, using frequency-based data pre-processing, each respective transliteration transducer to only process transliteration pairs that have spellings in the target script of the transliterated text for a given native word that satisfy a frequency threshold which provides the benefit of strengthening support for a meaningful similarity in their (i.e., transliteration pairs) frequency of occurrence.( 3.1 Mining of Transliteration Pairs of Saravanan et al.).

Claims 8 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over J. Emond et al., "Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance," 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 448-455, doi: 10.1109/SLT.2018.8639699 (https://ieeexplore.ieee.org/document/8639699; Emond et al.) further in view of Li; Jinyu et al. (US 10964309 B2; Li et al.) as applied to claims 1 and 14, above, and further in view of Zhou, Shiyu, Shuang Xu, and Bo Xu. "Multilingual end-to-end speech recognition with a single transformer on low-resource languages." arXiv preprint arXiv:1806.05059 (2018) (https://arxiv.org/abs/1806.05059; Zhou et al.). 

Regarding claims 8 and 21, Emond et al. in combination with Li et al. teach all of the limitations as in claim 1 and 14, above.
However, Emond et al. in combination with Li et al. do not explicitly teach, but Zhou et al. teaches:
wherein the multilingual E2E ASR model comprises a sequence-to-sequence neural network (see ¶ 3 of 1. Introduction and ¶ 1 of 3.1. ASR Transformer model architecture, and Figure 1: “1. Introduction: In this paper, we concentrate on multilingual ASR on low-resource languages. Building on our work [9], we employ sub-words generated by byte pair encoding (BPE) [11] as the multilingual modeling unit, which do not need any pronunciation lexicon. The ASR Transformer is chosen to be the basic architecture of sequence-to-sequence attention-based model [9, 12]. 3.1. ASR Transformer model architecture: The ASR Transformer architecture used in this work is the same as our work [9, 12] which is shown in Figure 1. It stacks multihead attention (MHA) [17] and position-wise, fully connected layers for both the encode and decoder. The encoder is composed of a stack of N identical layers. Each layer has two sublayers. The first is a MHA, and the second is a position-wise fully connected feed-forward network. […] and Figure 1 (architecture).”).
Emond et al. in combination with Li et al. and Zhou et al. are both considered to be analogous to the claimed invention because they are in the same field of endeavor in data processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Emond et al. in combination with Li et al. to incorporate the teachings of Zhou et al. of wherein the multilingual E2E ASR model comprises a sequence-to-sequence neural network which provides the benefit of performing well on low-resource languages (abstract of Zhou et al.).

Claims 9 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over J. Emond et al., "Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance," 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 448-455, doi: 10.1109/SLT.2018.8639699 (https://ieeexplore.ieee.org/document/8639699; Emond et al.) further in view of Li; Jinyu et al. (US 10964309 B2; Li et al.) as applied to claims 1 and 14, above, and further in view of Bates; James Stewart (US 20190214134 A1; Bates). 

Regarding claims 9 and 22, Emond et al. in combination with Li et al. teach all of the limitations as in claim 1 and 14, above.
However, Emond et al. in combination with Li et al. do not explicitly teach, but Bates teaches:
wherein the multilingual E2E ASR model comprises a recurrent neural network transducer (RNN-T) (see ¶ [0058]: “…The speech recognition module may comprise artificial neural network such as recurrent neural network (RNN) [i.e., speech recognition module as transducer, RNN], or deep neural network (DNN), end-to-end automatic speech recognition module, connectionist temporal classification (CTC) based speech recognition module, etc. In embodiments, the speech recognition module may further comprise one or more language model to enable speech recognition in one or more languages [i.e., multilingual].”).
Emond et al. in combination with Li et al.   and Bates are both considered to be analogous to the claimed invention because they are in the same field of endeavor in data processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Emond et al. in combination with Li et al. to incorporate the teachings of Bates of wherein the multilingual E2E ASR model comprises a recurrent neural network transducer (RNN-T)  which provides the benefit of improving in efficiency and consistency for automated applications (abstract of Bates)

Claims 10 and 23 are rejected under 35 U.S.C. 103 as being unpatentable over J. Emond et al., "Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance," 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 448-455, doi: 10.1109/SLT.2018.8639699 (https://ieeexplore.ieee.org/document/8639699; Emond et al.) further in view of Li; Jinyu et al. (US 10964309 B2; Li et al.) as applied to claims 1 and 14; and further in view of Catanzaro; Bryan et al.( US 20170148433 A1; Catanzaro).

Regarding claims 10 and 23, Emond et al. in combination with Li et al. teach all of the limitations as in claim 1 and 14, above.
However, Emond et al. in combination with Li et al. do not explicitly teach, but Catanzaro teaches:
wherein training the multilingual E2E ASR model comprises using a stochastic optimization algorithm to train the multilingual E2E ASR model (see ¶ [0035 and 0039]: “[0035] …Embodiments of the systems (which may be referred to generally as Deep Speech 2, Deep Speech 2 ASR, Deep Speech 2 ASR pipeline, or DS2) approach or exceed the accuracy of Amazon Mechanical Turk human workers on several benchmarks, work in multiple languages [i.e., multilingual] with little modification, and are deployable in a production setting. These embodiments represent a significant step towards a single ASR system that addresses the entire range of speech recognition contexts handled by humans. Since embodiments are built on end-to-end deep learning, a spectrum of deep learning techniques can be deployed. […] [0039] Training on large quantities of data usually requires the use of larger models. Indeed, embodiments presented herein have many more parameters than those used in some previous systems. Training a single model at these scales can involve tens of exaFLOPs, where 1 exaFLOPs=1018 Floating-point Operations, that would require 3-6 weeks to execute on a single graphics processing unit (GPU). This makes model exploration a very time-consuming exercise, so a highly optimized training system that uses 8 or 16 GPUs was built to train one model. In contrast to previous large-scale training approaches that use parameter servers and asynchronous updates, a synchronous stochastic gradient descent (SGD) was used because it was easier to debug while testing new ideas, and also converged faster for the same degree of data parallelism.”).
Emond et al. in combination with Li et al. and Catanzaro are both considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Emond et al. in combination with Li et al.  to incorporate the teachings of Catanzaro of wherein training the multilingual E2E ASR model comprises using a stochastic optimization algorithm to train the multilingual E2E ASR model which provides the benefit of being easier to debug while testing new ideas ([0035] of Catanzaro).

Claims 12-13 and 25-26 are rejected under 35 U.S.C. 103 as being unpatentable over J. Emond et al., "Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance," 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 448-455, doi: 10.1109/SLT.2018.8639699 (https://ieeexplore.ieee.org/document/8639699; Emond et al.) further in view of Li; Jinyu et al. (US 10964309 B2; Li et al.) as applied to claims 1 and 14; and further in view of Wu; Shuang et al. (US 10872599 B1; Wu).

Regarding claims 12 and 25, Emond et al. in combination with Li et al. teach all of the limitations as in claim 1 and 14, above.
However, Emond et al. in combination with Li et al. do not explicitly teach, but Wu teaches:
wherein the operations further comprise, after training the multilingual E2E ASR model, pushing the trained multilingual E2E ASR model to a plurality of user devices (see Col. 2, line 66 – Col. 3, line 20 of Wu: “(18) Embodiments of the present disclosure improve speech processing systems by reducing or eliminating false-positive and/or false-negative detection of wakewords. In various embodiments, false positives and/or false negatives are detected using one or more of the various techniques described herein, and a trained model is updated based on the detection of the false positive and/or false negative. The updated trained model may thus reduce the number of future false positives and/or false negatives. In various embodiments, the updating of the model is performed at the device by, for example, back-propagating differences between a stored, expected wakeword and a wakeword represented in captured audio. Each device may thus include a trained model updated one or more times to account for how a particular user or users speaks the wakeword, which may include differences due to an accent, speech impediment, background noise, or other such differences. In some embodiments, information related to the update to the trained model may be sent from one or more devices to one or more server devices, which may aggregate the update information, use it to update a trained model, and send the updated trained model to some or all other devices.”) 
each user device configured to: 
generate, using the trained multilingual E2E ASR model, a corresponding speech recognition result in the [ target script (disclosed by Emond et al. [i.e., output script; e.g., Latin script] in citations from claim 1: ¶ 1 of 3.1. General Transliteration Approach and Figure 1: “3.1. General Transliteration Approach of Emond et al.) ] for the captured utterance spoken by the respective user (see Col. 3, lines 30-38 and Col. 4, lines 8-17 of Wu: ““(20) The device 110 captures (130) audio 11 corresponding to an utterance of the user 5 (or other source of sound or speech). The device 110 may include one or more microphones that are enabled to continuously receive the first audio 11. The device 110 generates (132) first audio data [i.e., speech recognition result] corresponding to the first audio [i.e., spoken utterance] 11. (23) In some embodiments, the server(s) 120 receives (140) first model-update data (e.g., error data) from a first device 110 and receives (142) second model-update data from a second device 110. As explained in greater detail below, any number of devices 110 may send any number of model-update data to the server(s) 120. The servers(s) 120 generates (144) an updated model using the received model-update, sends (146) the updated model to the first device 110, and sends (148) the updated model to the second device 110.”).
capture, using at least one microphone in communication with the user device, an utterance spoken by a respective user of the user device [ in any combination of the respective native languages (disclosed by Emond et al. i.e, English and Devanagari [i.e., native scripts] in citations from claim 1: ¶ 1 of 3.1. General Transliteration Approach and Figure 1 of Emond et al.)  ] associated with the training data sets (see Col. 3, lines 30-38 of Wu: “(20) The device 110 captures (130) audio 11 corresponding to an utterance of the user 5 (or other source of sound or speech). The device 110 may include one or more microphones that are enabled to continuously receive the first audio 11. The device 110 generates (132) first audio data corresponding to the first audio 11.”);
Emond et al. in combination with Li et al. and Wu are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Emond et al. in combination with Li et al. to incorporate the teachings of Wu of wherein the operations further comprise, after training the multilingual E2E ASR model, pushing the trained ASR model to a plurality of user devices, each user device configured to: capture, using at least one microphone in communication with the user device, an utterance spoken by a respective user of the user device associated with the training data sets; and generate, using the trained multilingual E2E ASR model, a corresponding speech recognition result for the captured utterance spoken by the respective user. which provides the benefit of improving speech processing systems by reducing or eliminating false-positive and/or false-negative detection of words (Col. 2, line 66 – Col. 3, line 20 of Wu).

Regarding claims 13 and 26, Emond et al. in combination with Li et al. and Wu teach all of the limitations as in claim 12 [note that (as noted in the claim objections) the correct dependency should be claim 12, instead of claim 11 as currently drafted in the claims] and 25, above.
Emond et al. further teaches:
wherein at least one of the plurality of user devices is further configured to transliterate the corresponding speech recognition result in the target script into a transliterated script (see ¶ .1. General Transliteration Approach and 3.2. Optimizations: “3.1. General Transliteration Approach: In this paper, we propose the use of transliteration via a weighted finite state transducer as proposed for keyboard decoding in [19]. […] The transliteration transducer, T is a composition of three transducers: I ◦ P ◦ O, where I maps input unicode symbols to symbols in a pair language model, P is a bigram pair language model that maps between symbols in the two writing scripts, English and Devanagari, and O maps the pair language model symbols to the target output Devanagari symbols (illustrated in Figure 1). 3.2. Optimizations: In order to get performance of transliteration at a good operating point with respect to memory, speed and latency considerations for building large-scale language models, we explored several optimizations.”). 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Keisha Y Castillo-Torres whose telephone number is (571)272-3975. The examiner can normally be reached Monday - Friday, 9:00 am - 4:00 pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Keisha Y. Castillo-Torres
Examiner
Art Unit 2659



/Keisha Y. Castillo-Torres/Examiner, Art Unit 2659                                                                                                                                                                                                        
/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        
07/28/2022