DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on 01/21/2021. Claims 1-24 are pending in the application and have been examined.
	
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .	

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-7 and 13-19 are rejected under 35 U.S.C. 103 as being unpatentable over Watanabe et. al., US Patent Application Publication 2019/0189111 in view of Boyer, F., & Rouas, J. L. (2019). End-to-end speech recognition: A review for the French language. arXiv preprint arXiv:1910.08502.
Regarding claim 1, Watanabe teaches a computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations  (see Watanabe, [0031])comprising: receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model (see Watanabe, [0071]  where X and Y are training data including acoustic feature sequences and label sequences,  [0036] The end-to-end speech recognition module 200 includes an encoder network module 202, an attention decoder module 204); determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence (see Watanabe, [0032] indicates the multi-lingual speech recognition module 100 constructs a language independent network and the language label interpreted as determining a supervised or unpaired text sequence) and updating the LAS decoder and the context vector based on the determined cross entropy loss ( see Watanabe[0071], In the end-to-end network training module 117, Encoder Network Parameters 203, Decoder Network Parameters 205, and CTC Network Parameters 209 are jointly optimized so that the loss function is reduced (equation 40), where X and Y are training data including acoustic feature sequences and label sequences; interpreted as cross entropy loss). However, fails to teach  when the training example corresponds to an unpaired text sequence, determining a cross  entropy loss based on a log probability associated with a context vector of the training example.
However, Watanabe fails to teach when the training example corresponds to an unpaired text sequence, determining a cross  entropy loss based on a log probability associated with a context vector of the training example.
However, Boyer teaches when the training example corresponds to an unpaired text sequence, determining a cross  entropy loss based on a log probability associated with a context vector of the training example (see Boyer, pg. 3, sect 2.4 , The first one is a two-pass decoding process where the complete hypotheses from the attention model are computed and then rescored according to the following equation, where pctc(Y |x) is computed using the standard CTC forwardbackward algorithm).
Watanabe and Boyer are considered to be analogous to the claimed invention because they relate to end to end speech recognition in which neural architectures are trained to directly
model sequences of features as characters. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Watanabe on the language-independent neural network architecture that can recognize speech and identify language jointly in multiple different languages with two-pass decoding process teachings of Boyer to reduce the irregular alignments on the same frame in the attention-based model (see Boyer, pg. 3, sect. 2.4).
Regarding claim 2, Watanabe in view of Boyer teach the computer-implemented method of claim 1. Watanabe further teaches receiving a second training example for the LAS decoder of the two-pass streaming neural network (see Watanabe, [0071]  where X and Y are training data including acoustic feature sequences and label sequences; interpreted as second training data for the LAS decoder); determining that the second training example corresponds to the supervised audio- text pair (see Watanabe, [0032] indicates the multi-lingual speech recognition module 100 constructs a language independent network and the language label interpreted as determining a supervised audio-text pair) and updating the LAS decoder and acoustic context vector parameters associated with an acoustic context vector based on a log probability for the acoustic context vector (see Watanabe, [0071] In the end-to-end network training module 117, Encoder Network Parameters 203, Decoder Network Parameters 205, and CTC Network Parameters 209 are jointly optimized so that the loss function is reduced (equation 40), where X and Y are training data including acoustic feature sequences and label sequences). 
Regarding claim 3, Watanabe in view of Boyer teach the computer-implemented method of claim 1. Watanabe further teaches wherein determining whether the training example corresponds to the supervised audio-text pair or the unpaired text sequence comprises identifying a domain identifier that indicates whether the training example corresponds to the supervised audio-text pair or the unpaired text sequence (see Watanabe, [0032] indicates the multi-lingual speech recognition module 100 constructs a language independent network and the language label interpreted as domain identifier).
Regarding claim 4, Watanabe in view of Boyer teach the computer-implemented method of claim 1. Boyer further teaches wherein updating the LAS decoder reduces a word error rate (WER) of the two-pass streaming neural network model with respect to long tail entities (see Boyer, pg. 8, sect. 6, For subword units, classic RNN-transducer, RNNtransducer with attention and joint CTC-attention show comparable performance on subword error rate and WER, with the first one being slightly better on WER (17.4%) and the last one having a lower error rate on subword (14.5%) ).
Regarding claim 5, Watanabe in view of Boyer teach the computer-implemented method of claim 1. Boyer further teaches wherein the log probability is defined by an interpolation of a first respective log probability generated from an acoustic context vector and a second respective log probability generated from a text context vector (see Boyer, pg. 3 sect 2.4 teaches  two-pass decoding process where the complete hypotheses from the attention model are computed and then rescored according to the following equation, where pctc(Y |x) is computed using the standard CTC forwardbackward algorithm); interpreted as the interpolation of the log probability of the context vectors).
Regarding claim 6, Watanabe in view of Boyer teach the computer-implemented method of claim 1. Boyer further teaches wherein the LAS decoder operates in a beam search mode based on a hypothesis generated by a recurrent neural network transducer (RNN-T) decoder during a first pass of the two-pass streaming neural network model (see Boyer, pg. 3 sect 2.4, The RNN transducer architecture augmented with attention mechanisms was first mentioned, to the best of our knowledge, in [14]).
Regarding claim 7, Watanabe in view of Boyer teach the computer-implemented method of claim 1. Boyer further teaches wherein the operations further comprise generating the context vector of the training example with an attention mechanism configured to summarize encoder features from an encoded acoustic frame (see Boyer, pg. 2 sect 2.2, The decoder output is conditioned by the previous output yl−1, a hidden vector dl−1 and a context vector cl ).
Regarding claim 13, is directed to a system claim corresponding to the method claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Regarding claim 14, is directed to a system claim corresponding to the method claim presented in claim 2 and is rejected under the same grounds stated above regarding claim 2.
Regarding claim 15, is directed to a system claim corresponding to the method claim presented in claim 3 and is rejected under the same grounds stated above regarding claim 3.
Regarding claim 16, is directed to a system claim corresponding to the method claim presented in claim 4 and is rejected under the same grounds stated above regarding claim 4.
Regarding claim 17, is directed to a system claim corresponding to the method claim presented in claim 5 and is rejected under the same grounds stated above regarding claim 5.
Regarding claim 18, is directed to a system claim corresponding to the method claim presented in claim 6 and is rejected under the same grounds stated above regarding claim 6.
Regarding claim 19, is directed to a system claim corresponding to the method claim presented in claim 7 and is rejected under the same grounds stated above regarding claim 7.
Claims 8-12 and 20-24 are rejected under 35 U.S.C. 103 as being unpatentable over Watanabe et. al., US Patent Application Publication 2019/0189111 in view of Renduchintala, A., Ding, S., Wiesner, M., & Watanabe, S. (2018). Multi-modal data augmentation for end-to-end asr. arXiv preprint arXiv:1803.10299.
Regarding claim 8, Watanabe teaches a computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations  (see Watanabe, [0031])comprising: receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model (see Watanabe, [0071]  where X and Y are training data including acoustic feature sequences and label sequences,  [0036] The end-to-end speech recognition module 200 includes an encoder network module 202, an attention decoder module 204); determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence (see Watanabe, [0032] indicates the multi-lingual speech recognition module 100 constructs a language independent network and the language label interpreted as determining a supervised or unpaired text sequence). However, fails to teach when the training example corresponds to the unpaired training data, generating a missing portion of the unpaired training data to form a generated audio-text pair; and updating the LAS decoder and a context vector associated with the unpaired data based on the generated audio-text pair.
However, Renduchintala teaches when the training example corresponds to the unpaired training data, generating a missing portion of the unpaired training data to form a generated audio-text pair (see Renduchintala, pg. 2 Sect. 2.2 A desirable synthetic input should be easy to construct from plain text corpora, and should be as similar as possible to  acoustic input; interpreted as generating missing portion of the unpaired training data ); and updating the LAS decoder and a context vector associated with the unpaired data based on the generated audio-text pair (see Renduchintala, pg. 1 sect 2.1 The attention mechanism takes the output of the encoder and generates a context vector (gray cross hatching) which is utilized by the decoder (red cross hatching) to generate each token in the output sequence {y0, y1, . . .}).
Watanabe and Renduchintala are considered to be analogous to the claimed invention because they relate to end to end speech recognition in neural machine translation. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Watanabe on the language-independent neural network architecture that can recognize speech and identify language jointly in multiple different languages with r Multi-modal Data Augmentation model teachings of Renduchintala to improve end to end model performance trained with limited data(see Renduchintala, pg. 1, sect. 1).
Regarding claim 9, Watanabe in view of Renduchintala teach the computer-implemented method of claim 8. Renduchintala further teaches determining an acoustic context vector based on the generated audio-text pair(see Renduchintala, pg. 1 sect 2.1 The attention mechanism takes the output of the encoder and generates a context vector (gray cross hatching) which is utilized by the decoder (red cross hatching) to generate each token in the output sequence {y0, y1, . . .}); and determining an interpolation of a first respective log probability generated from the acoustic context vector and a second respective log probability generated from a text context vector, wherein updating the LAS decoder is further based on the interpolation of the first respective log probability and the second respective log probability (see Renduchintala, pg. 2 sect 2.3, Note that in both cases the attention and decoder parameters (denoted by θatt and θdec, see equation 1) are shared, while the acoustic encoder parameters (θenc) and augmenting encoder parameters (θaug) are only updated in their respective training batches; interpreted as updating LAS decoder based on the interpolation of the first and second log probability).
Regarding claim 10, Watanabe in view of Renduchintala teach the computer-implemented method of claim 8. Watanabe further teaches wherein determining whether the training example corresponds to the supervised audio-text pair or the unpaired training data comprises identifying a domain identifier that indicates whether the training example corresponds to the supervised audio-text pair or the unpaired training data (see Watanabe, [0032] indicates the multi-lingual speech recognition module 100 constructs a language independent network and the language label interpreted as domain identifier ).
Regarding claim 11, Watanabe in view of Renduchintala teach the computer-implemented method of claim 8. Renduchintala further teaches wherein updating the LAS decoder reduces a word error rate (WER) of the two-pass streaming neural network model with respect to long tail entities (see Renduchintala, pg. 4, sect. 5.4 The best performing synthetic input scheme was applied to Spanish and Italian, where a similar trend was observed. MMDA consistently  achieved better WER and obtained small improvements in CER (see table 2, parts 2 and 3)).
Regarding claim 12, Watanabe in view of Renduchintala teach the computer-implemented method of claim 8. Renduchintala further teaches wherein the operations further comprise generating the context vector of the training example with an attention mechanism configured to summarize encoder features from an encoded acoustic frame (see Renduchintala, Figure 1: Overview of our Multi-modal Data Augmentation (MMDA) model. Figure 1a highlights the network engaged when acoustic features are given as input to an acoustic encoder (shaded blue). Alternatively, when synthetic input is supplied the network (Figure 1b) uses an augmenting encoder (green). In both cases a shared attention mechanism and decoder are used to predict the output sequence. For simplicity we show 2 layers without down-sampling in the acoustic encoder and omit the input embedding layer in the augmenting encoder).
Regarding claim 20, is directed to a system claim corresponding to the method claim presented in claim 8 and is rejected under the same grounds stated above regarding claim 8.
Regarding claim 21, is directed to a system claim corresponding to the method claim presented in claim 9 and is rejected under the same grounds stated above regarding claim 9.
Regarding claim 22, is directed to a system claim corresponding to the method claim presented in claim 10 and is rejected under the same grounds stated above regarding claim 10.
Regarding claim 23, is directed to a system claim corresponding to the method claim presented in claim 11 and is rejected under the same grounds stated above regarding claim 11.
Regarding claim 24, is directed to a system claim corresponding to the method claim presented in claim 12 and is rejected under the same grounds stated above regarding claim 12.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Ma, J., & Schwartz, R. (2008). Unsupervised versus supervised training of acoustic models. In Ninth Annual Conference of the International Speech Communication Association teaches comparisons of unsupervised with supervised training of acoustic models (see Ma, pg. 2 sect. 3).
Sainath, T. N., Pang, R., Rybach, D., He, Y., Prabhavalkar, R., Li, W., ... & Chiu, C. C. (2019). Two-pass end-to-end speech recognition. arXiv preprint arXiv:1908.10992 teaches a two-pass architecture in which an RNN-T decoder and a LAS decoder share an encoder network (see Sainath, pg. 1 sect. 1).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NANDINI SUBRAMANI whose telephone number is (571)272-3916. The examiner can normally be reached Monday - Friday 12:00pm - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached on (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/NANDINI SUBRAMANI/Examiner, Art Unit 2656     

/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656