Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 8/1/2019 and 9/30/2019 are being considered by the examiner.

Claim Rejections-35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-2, 4, 8-10 and 15-18 and 20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Le Roux et al. (US-20190318725-A1) hereinafter referred to as Le Roux.
With respect to claims 1, 16 and 20 Le Roux teaches  A method/system/media storing instructions: 
and one or more computer-readable media storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations ([0048 ] Step 115 of FIG . 1A includes Inputting the received audio signal using a hardware processor into a pre – trained speech recognition network stored in a computer readable memory.) comprising:
([0086] A speech mixture 1305 includes speech by multiple speakers, for example two speakers, where utterance 1 1306 is an utterance spoken by speaker 1 with a first part 1307 in Japanese and a second part 1308 in English, and utterance 2 1309 is an utterance spoken by speaker 2 in Chinese.); 
generating, by the one or more computers, a sequence of feature vectors indicative of the acoustic characteristics of the utterance ([0076] A sequence of feature vectors obtained from the input mixture, for example the log magnitude of the short-time Fourier transform of the input mixture, is used as input to a mixture encoder 1110.); 
processing, by the one or more computers, the sequence of feature vectors ([0077] A sequence of feature vectors obtained from the input mixture , for example the log magnitude of the short-time Fourier transform of the input mixture , is used as input to a mixture encoder 1120.) using a speech recognition model ([0078 A sequence of feature vectors obtained from the input mixture , for example the log magnitude of the short - time Fourier transform of the input mixture , is used as input to a mixture encoder 1150…The mixture encoder 1150 is composed of multiple bidirectional long short - term memory ( BLSTM ) neural network layers , from the first BLSTM layer 1171 to the last BLSTM layer 1173 . Each BLSTM layer is composed of a forward long short - term memory ( LSTM ) layer and a back ward LSTM layer , whose outputs are combined and use as input by the next layer.) that has been trained using a loss function ([0124] Attention Decoder Network Parameters 1405 , and CTC Network Parameters 1409 are jointly optimized so that the loss function is reduced.) that uses N-best lists of decoded hypotheses ([0133] However, it is difficult to enumerate all possible label sequences for Y and compute λ log pctc(Y|X)+(1−λ)log patt(Y|X), because the number of possible label sequences increases exponentially to the length of the sequence. Therefore, a beam search technique is usually used to find Ŷ, in which shorter label sequence hypotheses are generated first, and only a limited number of hypotheses, which have a higher score than others, are extended to obtain longer hypotheses. Finally, the best label sequence hypothesis is selected in the complete hypotheses that reached the end of the sequence.)), the speech recognition model comprising an encoder, an attention ([0037] … FIG . 17Bincludes a speech separation network along with a hybridCTC / Attention - based speech recognition ASR network, in accordance with some embodiments of the present disclosure) module, and a decoder, wherein the encoder and decoder each comprise one or more recurrent neural network layers ([0008] For example, learned through experimentation are end-to-end automatic speech recognition (ASR) systems used with encoder-decoder recurrent neural networks (RNNs) to directly convert sequences of input speech features to sequences of output labels without any explicit intermediate representation of phonetic/linguistic constructs. Implementing the entire recognition system as a monolithic neural network can remove the dependence on ad-hoc linguistic resources.);
obtaining, by the one or more computers as a result of the processing with the speech recognition model, a sequence of output vectors representing distributions over a predetermined set of linguistic units ([0091] The label sequence search module 1406 finds the label sequence with the highest sequence probability using the first and second posterior probability distributions provided from the attention decoder network module 1404 and the CTC module 1408. The first and second posterior probabilities of label sequence computed by the attention decoder network module 1404 and the CTC module 1408 are combined into one probability); 
determining, by the one or more computers, a transcription for the utterance based on the sequence of output vectors ([0026] FIG . 9A is a flow diagram…wherein the multi-speaker ASR network includes a speaker separation network outputting…and a decoder network outputting a text for each target speaker from the recognition encoding for that target speaker , according to embodiments of the present disclosure, and [0006] As an additional benefit, among many benefits, is the joint training framework can train on more realistic data that contains only mixed signals and their transcriptions, and thus can be suited to large scale training on existing transcribed data.); and 
providing, by the one or more computers, data indicating the transcription of the utterance ([0026] FIG . 9A is a flow diagram…wherein the multi-speaker ASR network includes a speaker separation network outputting…and a decoder network outputting a text for each target speaker from the recognition encoding for that target speaker , according to embodiments of the present disclosure, and [0121] The selected utterances (and their transcripts) are then concatenated and considered as a single utterance in the generated corpus.)
With respect to claims 2 and 18, Le Roux teaches wherein the speech recognition model has been trained such that the loss function distributes probability weight over items in the N-best lists (probability mass distributed over best hypotheses as in eq. 43, and  [0133 Therefore, a beam search technique is usually used to find Y, in which shorter label sequence hypotheses are generated first, and only a limited number of hypotheses, which have a higher score than others, are extended to obtain longer hypotheses.)
With respect to claim 4 Le Roux teaches wherein the speech recognition model is configured to output a probability distribution over a predetermined set of grapheme symbols ([0093] End-to-end speech recognition is generally defined as a problem to find the most probable label sequence Y given input acoustic feature sequence X… where U * denotes a set of possible label sequences given a set of pre-defined labels U. A label may be a character or a word. The label sequence probability p (Y | X \) can be computed using a pre-trained neural network.)
With respect to claim 8 Le Roux teaches wherein the one or more recurrent neural network layers comprise long short-term memory (LSTM) cells ([0096] RNN may be implemented as a Long Short-Term Memory…)
([0105] After that, decoder state vector ql−1 is updated to ql using an LSTM…, and [ 0096 ] An encoder module 1402 …RNN may be implemented as a Long Short - Term Memory ( LSTM )…Another RNN may be a bidirectional RNN ( BRNNs ) or a bidirectional LSTM ( BLSTM ).)
With respect to claim 10 Le Roux recites wherein the encoder comprises a plurality of bidirectional LSTM layers ([0077] The mixture encoder 1120 is composed of multiple bidirectional long short-term memory (BLSTM) neural network layers , from the first BLSTM layer 1101 to the last BLSTM layer 1103.)
With respect to claim 15 Le Roux recites wherein the speech recognition model is configured to provide streaming speech recognition results that include substantially real-time transcriptions of a portion of an utterance while a speaker of the utterance continues to speak the utterance. 
With respect to claim 17 Le Roux recites wherein the speech recognition model is an end-to-end neural network model ([0226] In this invention, all the components of end-to-end multichannel speech recognition can be implemented with differentiable functions including multiple neural networks).  

Claim Rejections-35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention 

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Le Roux in view of Graves et al. (Graves, A. & Jaitly, N. (2014). Towards End-To-End Speech Recognition with Recurrent Neural Networks. Proceedings of the 31st International Conference on Machine Learning, in PMLR 32(2):1764-1772) hereinafter referred to as Graves.
With respect to claim 3, Le Roux do not teach wherein the speech recognition model has been trained to directly minimize expected word error rate.
Graves teach wherein the speech recognition model has been trained to directly minimize expected word error rate (Abstract: A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3%).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Le Roux to include the teachings of Graves, motivation being to allow direct modification of word error rate even in the absence of a language model (Graves: Abstract)

Claims 5, 7 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Le Roux in view of Hayashi et al. (Hayashi, T. et al. “Multi-Head Decoder for End-to-End Speech Recognition.” ArXiv abs/1804.08050 (2018): n. pag.)
With respect to claims 5 and 19 , Le Roux do not teach wherein the attention module provides multi-headed attention in which multiple different sets of weighting parameters are used to process different segments of output from the encoder.  
(p1 Col2 para3 ll 1-8: …in this study we present a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. Instead of the integration in the attention level, our proposed method uses multiple decoders for each attention and integrates their outputs to generate a final output. Furthermore, in order to make each head to capture the different modalities, different attention functions are used for each head, and Fig 3 shows different weights for each attention head).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Le Roux to include the teachings of Hayashi, motivation being to jointly focus on information from different representation subspaces at different positions (Hayashi p1 col 2 para 2.)
With respect to claim 7, Le Roux do not teach wherein the attention module comprises at least four attention heads.
Hayashi teaches the attention wherein the attention module comprises at least four attention heads (Table 1 page 4, #heads in MHA is 4.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Le Roux to include the teachings of Hayashi, motivation being to jointly focus on information from different representation subspaces at different positions (Hayashi p1 col 2 para 2.
Allowable Subject Matter
The following is an examiner’s statement of reasons for allowance:
Claim 6 recites “wherein the attention module comprises a plurality of neural networks that are separately trained to generate output to the decoder from different segments of output from the 
Claim 11 recites “wherein the speech recognition model has been trained by performing, for each training example of multiple training examples, operations including: determining a plurality of speech recognition hypotheses using the speech recognition model being trained; ranking the plurality of speech recognition hypotheses; identifying N highest-ranking speech recognition hypotheses in the plurality of speech recognition hypotheses, where N is an integer of a predetermined value; distributing probability mass concentrated entirely on the N highest-ranking speech recognition hypotheses; and approximating a loss function for training according to the distributed probability mass” which is allowable over the prior art. The closest teachings to the indicated allowable subject matter are the references that are cited in the current office action. However, none of the cited references teach and/or suggest distributing probability mass concentrated entirely on the N highest-ranking speech recognition hypotheses; and approximating a loss function for training according to the distributed probability mass.
Claims 12-14 depend from claim 12 and are allowable for substantially similar reasons.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675.  The examiner can normally be reached on Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.   Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/A.N.P./               Examiner, Art Unit 2657                                                                                                                                                                                         
/Paras D Shah/               Primary Examiner, Art Unit 2659                                                                                                                                                                                         
03/26/20201