DETAILED ACTION
This Office Action is in response to the correspondence filed by the applicant on 07/26/2021.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
Claims 1-6, 8-16, and 18-20 are pending in this application.
Claims 7 and 17 are canceled.

Response to Arguments
Regarding Rejection under 35 U.S.C. 103
Applicant’s amendment and arguments with respect to rejections have been fully considered but are moot because the arguments do not apply to any of the references being used in the current rejection. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  


A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 8-11, and 18-20 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Aleksic et al., (US Pub. 2016/0104482, hereinafter Aleksic) in view of Wirsching et al., (EP 1975923, hereinafter Wirsching).
Regarding claim 1, Aleksic discloses a method comprising: 
processing, by the data processing hardware (Fig. 5 and [0058][0059] processor and memory), using a speech recognition model trained on training utterances in the first language only, acoustic features derived from the audio data to generate, as output from the speech recognition model, speech recognition scores for both wordpieces and corresponding [phoneme sequences] in the first language (Fig. 4, [0029] “recordings and transcriptions are compiled into statistical representations of the acoustics that constitute words and phrases”; [0044] “generates a recognition lattice of the utterances by performing speech recognition on the audio data using a first pass speech recognizer (404). Generating the recognition lattice can include generating one or more text phrases that acoustically match the utterances according to a first language model. The recognition lattice can include a score for each of the text phrases”); 
rescoring, by the data processing hardware, the speech recognition scores for the phoneme sequences generated as output from the speech recognition model in the first language based on the one or more terms in the second language from the biasing term list (Fig. 4, [0045]-[0048] “determines that the recognition lattice defines a specific context (406). … Using the second pass speech recognizer can include selecting a context language model for the specific context and supplying the context language model, a general language model, and the audio data to an ASR engine); and 

Aleksic does not explicitly teach, however Wirsching does teach including the bracketed limitation:
receiving, at data processing hardware, audio data encoding an utterance spoken by a native speaker of a first language, the utterance comprising at least one word in the first language and at least one word in a second language different than the first language ([0036][0040]-[0044] a navigation system receives utterance from a user who may now travel in a country in which another language is spoken than the language of the user); 
receiving, at the data processing hardware, a biasing term list comprising one or more terms in the second language ([0015][0018][0036][0042]-[0044] a list elements_ destination location_ having another language (i.e., German, French and Italian) than the user language, i.e., English);
speech recognition scores for both wordpieces and corresponding [phoneme sequences] in the first language ([0010] When the subword unit is a phoneme, a sequence of phonemes is determined as string of subword units that best matches the speech input at speech recognition system)..

Regarding claim 6, Aleksic in view of Wirsching discloses the method of claim 1, and Aleksic further discloses: 
wherein, during executing of the decoding graph, the decoding graph biases the transcription to favor any of the one or more terms in the biasing term list (see Aleksic [0005][0006][0030] which notes these and other implementations can each optionally include one or more of the following features. Generating/EXECUTING the recognition lattice/DECODING GRAPH comprises generating one or more text phrases that acoustically match the utterances according to a first language model).
Regarding claim 8, Aleksic in view of Wirsching discloses the method of claim 1. 
Aleksic does not explicitly teach, however Wirsching does teach:
wherein none of the terms in the biasing term list were used to train the speech recognition model (Wirsching, [0012] using a subword unit speech recognition unit trained to recognize subword units of a first language in order to recognize the speech input of a language other than the first language).
Regarding claim 9, Aleksic in view of Wirsching discloses the method of claim 1, and Aleksic further discloses:

Regarding claim 10, Aleksic in view of Wirsching discloses the method of claim 1, and Aleksic further discloses wherein: 
the data processing hardware and the speech recognition model reside on a remote computing device; and receiving the audio data encoding the utterance comprises receiving the audio data encoding the utterance from a user device in communication with the remote computing device (Aleksic, Fig. 1, [0016] receiving audio data by a user device and processing hardware and a speech recognition model reside on a remote computing system over a data communications network 112).
Regarding claim 11, Aleksic discloses a system comprising: 
data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising (Fig. 5 and [0058][0059][0064] processor and memory):
processing, using a speech recognition model trained on training utterances in the first language only, acoustic features derived from the audio data to generate, as output from the speech recognition model, speech recognition scores for both wordpieces and corresponding phoneme sequences in the first language (Fig. 4, [0029] “recordings and transcriptions are compiled into statistical representations of the 
rescoring the speech recognition scores for the phoneme sequences generated as output from the speech recognition model in the first language based on the one or more terms in the second language from the biasing term list (Fig. 4, [0045]-[0048] “determines that the recognition lattice defines a specific context (406). … Using the second pass speech recognizer can include selecting a context language model for the specific context and supplying the context language model, a general language model, and the audio data to an ASR engine); and 
executing, using the speech recognition scores for the wordpieces and the rescored speech recognition scores for the phoneme sequences, a decoding graph to generate a transcription for the utterance (Fig. 4, [0045]-[0048] “generates a transcription of the utterances by performing speech recognition on the audio data using a second pass speech recognizer biased towards the specific context defined by the recognition lattice (408)”).  
Aleksic does not explicitly teach, however Wirsching does teach including the bracketed limitation:
receiving audio data encoding an utterance spoken by a native speaker of a first language, the utterance comprising at least one word in the first language and at least 
receiving a biasing term list comprising one or more terms in the second language ([0015][0018][0036][0042]-[0044] a list elements_ destination location_ having another language (i.e., German, French and Italian) than the user language, i.e., English);
speech recognition scores for both wordpieces and corresponding [phoneme sequences] in the first language ([0010] When the subword unit is a phoneme, a sequence of phonemes is determined as string of subword units that best matches the speech input at speech recognition system)..
Therefore, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to incorporate the method of recognizing speech using Dynamically biasing language models as taught by Markovich with Multilingual non-native speech recognition as taught by Wirsching to recognize the speech input of a language other than the first language and to provide appropriate responding corresponding to query (Wirsching, [0012][0020]).

Regarding claims 16 and 18-20, Claims 16 and 18-20 are the corresponding system claims to method claims 6 and 8-10. Therefore, claims 16 and 18-20 are rejected using the same rationale as applied to claims 6 and 8-10 above.

Claims 2 and 12 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Aleksic et al., (US Pub. 2016/0104482, hereinafter Aleksic) in view of Wirsching et al., (EP 1975923, hereinafter Wirsching) and further in view of Corfield (US 9966066 B1).
	Regarding claims 2 and 12, Aleksic in view of teaches all of the limitations of claims 1 and 11 above.
	Aleksic in view of Wirsching does not explicitly teach, however Corfield does teach:
wherein rescoring the speech recognition scores for the phoneme sequences comprises using a biasing finite-state transducer (FST) to rescore the speech recognition scores for the phoneme sequences (see Corfield, col. 4, lines 16-27, which notes since we are using computers rather than humans to interpret the audio and turn it into words, we can use a finer grained set of phonemes, which may have several thousand members, which the computer has been trained to identify. The actual number of phonemes, or sub-phonemes, does not matter, as the computer training algorithms will adapt themselves to whatever phoneme set is specified. Phonemes have variable durations, which means a recognition system has to consider many different ways a given sequence of words can be enunciated with varying lengths of individual phonemes. Additionally, it is not known, a priori, what words were said to produce a given audio signal; and see Corfield, col. 4, lines 28-47, which notes from all the possible word sequences and possible durations of phonemes, the speech recognition system searches for the word sequence(s) and phoneme-to-audio-frame alignment(s) which best matches the acoustic signal. A common technique in speech recognition is 
Therefore, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to incorporate the method of recognizing speech as taught by Aleksic in view of Wirsching with the combining of language models as taught by Corfield in order to rapidly combine language models and deliver a real-time user experience (see Corfield, col. 8, lines 44-53, which notes the technology of the present application provides technological improvements to allow for the combination of the topic and user structures in a way which allows the recognition engine to recognize audio whose utterances draw freely upon (and intermingle) words and phrases from both the topic and the user specific words and phrases. Also, while generally described as combining two language models, the technology as described herein can combine more than two language models; and see Corfield, col. 9, lines 3-8, which notes the technology of the present application outlines two exemplary and novel approaches which allow topic and user recognizers to be combined rapidly and deliver a real-time 
The combination of Aleksic in view of Wirsching with Corfield includes predictable results, such as recognizing words from different language models.

Claims 3 and 13 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Aleksic et al., (US Pub. 2016/0104482, hereinafter Aleksic) in view of Wirsching et al., (EP 1975923, hereinafter Wirsching) and Corfield (US 9966066 B1), and further in view of Reinhard (US 20050197835 A1).
Regarding claims 3 and 13, Aleksic in view of Wirsching and further in view of Corfield teaches all of the limitations of claims 2 and 12 above. 
	Aleksic in view of Wirsching does not explicitly teach, however, corfield does teach:
	tokenizing, by the data processing hardware, each term in the biasing term list into a corresponding phoneme sequence in the second language (see Corfield, col. 8, lines 13-29, which notes the goal of customization is to support the recognition of words and phrases from two different recognizers (decoders and rescorers as described herein, but in a generalized form any two different recognizers). The simplest example might be to extend a given recognizer to recognize one new word. At the other end of the spectrum would be a combination of two extensive recognizers, such as one for oncology with one for business e-mail. Without any loss of generality, we can frame up the usage scenario as a desire to combine a general purpose/FIRST LANGUAGE recognizer for a group of similarly situated users, with a user-specific recognizer which 
mapping, by the data processing hardware, each corresponding phoneme sequence in the second language; and generating, by the data processing hardware, the biasing FST based on each corresponding phoneme sequence in the first language (see Corfield, col. 5, lines 32-42, which notes the C.fst performs a one for one mapping from context dependent phonemes to their dictionary equivalents (it does not calculate any probabilities). One for one mapping means that every context dependent phoneme maps to one, and only one, dictionary phoneme. However, several context dependent 
Aleksic in view of Wirsching and further in view of Corfield does not explicitly teach, however Reinhard does teach: 
mapping, by the data processing hardware, each corresponding phoneme sequence in the second language to a corresponding phoneme sequence in the first language (see Reinhard Abstract, which notes Acoustic models for speech recognition are automatically generated utilizing trained acoustic models from a native language and a foreign language. A phoneme-to-phoneme mapping is utilized to enable the description of foreign/second language words with native/first language phonemes. The phoneme-to-phoneme mapping is used for training foreign language words, described by native language phonemes on foreign language speech material. A new phonetic lexicon is created containing foreign/second language words and native/first language words transcribed by native language phonemes. Robust native language acoustic models can be derived utilizing foreign language and native language training material. The mapping may be used for training a grapheme to phoneme transducer (i.e., foreign/second language to native/first language) to generate native language pronunciations for new foreign language words).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as .

Claims 4-5 and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Aleksic et al., (US Pub. 2016/0104482, hereinafter Aleksic) in view of Wirsching et al., (EP 1975923, hereinafter Wirsching) and further in view of Kanda (US 20190139540 A1).
Regarding claims 4 and 14, Aleksic in view of Wirsching teaches all of the limitations of claims 1 and 10 above. 
Aleksic in view of Wirsching does not explicitly teach, however, Kanda does teach:
wherein the speech recognition model comprises an end- to-end, wordpiece-phoneme model (see Kanda Abstract, which notes a speech recognition device includes: an acoustic model based on an End-to-End neural network responsive to an observed sequence formed of prescribed acoustic features obtained from a speech 

Regarding claims 5 and 15, Aleksic in view of Wirsching teaches all of the limitations of claims 1 and 10 above. 
Aleksic in view of Wirsching does not explicitly teach, however, Kanda does teach:
wherein the end-to-end, wordpiece-phoneme model comprises a recurrent neural network-transducer (RNN-T) (see Kanda [0020], which notes an End-to-End RNN/transducer learns direct mapping from an input observed sequence X to a sub-word sequence/state sequence s. A model called Connectionist Temporal Classification (CTC) is a typical example of End-to-End RNN. In CTC, an observed sequence X is far longer than a sub-word sequence s and, therefore, in order to compensate for the difference of the length, a blank label ϕ is added to an output of RNN. Specifically, a node corresponding to the blank label ϕ is provided in the output layer. As a result, a frame-wise sub-word sequence c={c.sub.1, . . . , c.sub.T} (including blank label ϕ) is obtained at the output of RNN. This sub-word sequence c is converted to a sub-word sequence s that is independent of the number of frames, by introducing mapping function Φ. The mapping function Φ removes blank labels ϕ from the frame-wise sub-
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Aleksic in view of Wirsching with the End-to-End RNNs as taught by Kanda in order to realize highly accurate realize highly accurate speech recognition speech recognition (see Kanda [0059], which notes is necessary to use a framework other than the DNN-HMM hybrid method to realize highly accurate speech recognition while making full use of the End-to-End RNN characteristics).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Please see attached form PTO-892.
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEONG-AH A. SHIN whose telephone number is (571)272-5933.  The examiner can normally be reached on 9 AM-3PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access 


Seong-ah A. Shin
Primary Examiner
Art Unit 2659



/SEONG-AH A SHIN/           Primary Examiner, Art Unit 2659