Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 6, 7, 11-13, 16 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Le Roux (US-20190318725-A1), in view of Graves et al. (Graves, A. & Jaitly, N. (2014). Towards End-To-End Speech Recognition with Recurrent Neural Networks. Proceedings of the 31st International Conference on Machine Learning, in PMLR 32(2):1764-1772) hereinafter referred to as Graves.

With respect to claims 1 and 11 La Roux teaches [A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising-Claim 1] , and [ A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations comprising – Claim 11] (0011] According to an embodiment of the present disclosure, a speech recognition system for recognizing speech including overlapping speech by multiple speakers. The system includes a hardware processor. A computer storage memory to store data along with having computer-executable instructions stored thereon that, when executed by the hardware processor is to implement a stored speech recognition network.)
receiving a sequence of feature vectors indicative of acoustic characteristics of a training utterance ([0076] A sequence of feature vectors obtained from the input mixture, for example the log magnitude of the short-time Fourier transform [acoustic characteristic] of the input mixture, is used as input to a mixture encoder 1110.) and, [0086] A speech mixture 1305 includes speech by multiple speakers, for example two speakers, where utterance 1 1306 is an utterance spoken by speaker 1 with a first part 1307 in Japanese and a second part 1308 in English, and utterance 2 1309 is an utterance spoken by speaker 2 in Chinese.); 
receiving a ground-truth label sequence corresponding to the training utterance ([0176] The weight in the combination can be determined through experiments, based on the performance on a held out validation set. where R is a ground truth reference label sequence and Loss.sub.att is the cross-entropy loss function.); and 
training a speech recognition model to minimize word error rate by performing operations comprising: 
processing, using the speech recognition model (Le Roux [0059] The ASR model using a speech recognition network can process the inputted data to determine computer-usable information.), the sequence of feature vectors to obtain an N-best list of speech recognition hypotheses ([0133] Therefore, a beam search technique is usually used to find Ŷ, in which shorter label sequence hypotheses are generated first, and only a limited number of hypotheses, which have a higher score than others, are extended to obtain longer hypotheses. Finally, the best label sequence hypothesis is selected in the complete hypotheses that reached the end of the sequence.) for the training utterance;
La Roux does not explicitly disclose but Graves teaches for each speech recognition hypothesis in the N-best list of speech recognition hypotheses, identifying a respective number of word errors relative to the ground-truth label sequence corresponding to the training utterance (Graves, p7 col 1 para 2: For the 81 hour training set, the oracle error rates for the monogram, bigram and trigram candidates were 8.9%, 2% and 1.4% respectively, while the anti-oracle (rank 300) error rates varied from 45.5% for monograms and 33% for trigrams. Using larger N-best lists (up to N=1000) did not yield significant performance improvements, from which we concluded that the list was large enough to approximate the true decoding performance of the RNN).; and
approximating a loss function based on the respective number of word errors identified for each speech recognition hypothesis in the N-best list of speech recognition hypotheses (Graves: p4 col 2 ll 1-11: In speech recognition, for example, the standard measure is the word error rate (WER), defined as the edit distance between the true word sequence and the most probable word sequence emitted by the transcriber. We would therefore prefer transcriptions with high WER to be more probable than those with low WER. In the interest of reducing the gap between the objective function and the test criteria, this section proposes a method that allows an RNN to be trained to optimise the expected value of an arbitrary loss function defined over output transcriptions (such as WER).). 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify La Roux in view of Graves, in order for each speech recognition hypothesis in the N-best list of speech recognition hypotheses, identifying a respective number of word errors relative to the ground-truth label sequence corresponding to the training utterance to allow a direct optimisation of word error rate even in the absence of a language model (Abstract, Graves);
With respect to claims 2 and 12 La Roux teaches wherein processing the sequence of feature vectors to obtain the N-best list of speech recognition hypotheses comprises: 
determining, based on the sequence of feature vectors received as input to the speech recognition model, a plurality of speech recognition hypotheses for the training utterance (Le Roux [0065] The outputs of the CTC module 316 and the attention decoder 318 for the pipeline starting with the speaker-differentiating encoder 1 312 are combined to output a set of hypotheses [plurality of hypotheses]1 320, and the outputs of the CTC module 316 and the attention decoder 318 for the pipeline starting with the speaker-differentiating encoder 2 322 are combined to output a set of hypotheses 2 330. Text 1 345 is output from the set of hypotheses 1 320. Text 2 347 is output from the set of hypotheses 2 330.); 
ranking the plurality of speech recognition hypotheses ([0133] Therefore, a beam search technique is usually used to find Y, in which shorter label sequence hypotheses are generated first, and only a limited number of hypotheses, which have a higher score [ranking] than others, are extended to obtain longer hypotheses); and 
identifying the N-best list of speech recognition hypotheses as the N highest- ranking speech recognition hypotheses in the plurality of speech recognition hypotheses ([0133] Therefore, a beam search technique is usually used to find Y, in which shorter label sequence hypotheses are generated first, and only a limited number of hypotheses, which have a higher score than others, are extended to obtain longer hypotheses.)

With respect to claims 3 and 13 La Roux teaches wherein determining the plurality of speech recognition hypotheses comprises using beam search to determine the plurality of speech recognition hypotheses ([0133] Therefore, a beam search technique is usually used to find Y, in which shorter label sequence hypotheses are generated first, and only a limited number of hypotheses, which have a higher score than others, are extended to obtain longer hypotheses.)

With respect to claims 6 and 16 La Roux teaches wherein the speech recognition model comprises a sequence-to-sequence speech recognition model comprising an encoder and a decoder, wherein the encoder and the decoder each comprise one or more recurrent neural network layers ([0008] For example, learned through experimentation are end-to-end automatic speech recognition (ASR) systems used with encoder-decoder recurrent neural networks (RNNs) to directly convert sequences of input speech features to sequences of output labels without any explicit intermediate representation of phonetic/linguistic constructs. Implementing the entire recognition system as a monolithic neural network can remove the dependence on ad-hoc linguistic resources);

With respect to claims 7 and 17, wherein the one or more recurrent neural network layers comprise long short-term memory (LSTM) cells ([0076] Each BLSTM layer is composed of a forward long short - term memory (LSTM) layer and a back ward LSTM layer, whose outputs are combined and use as input by the next layer.)

Claims 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Le Roux, in view of Graves as applied to claims 3 and 13 in further view of Marcu (US 7454326 B2)

With respect to claims 4 and 14 La Roux does not explicitly disclose but Marcu teaches wherein the speech recognition model is trained using a same predetermined beam size and a same predetermined value of N (Marcu: Claims 23. The method of claim 19, further comprising: discarding the updated hypothesis if the updated hypothesis has a higher cost than n-best updated hypotheses in the stack, where n corresponds to a predetermined beam size.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify La Roux and Graves in view of Marcu, in order for training to use a same predetermined beam size and a same predetermined value of N to reduce the number of hypotheses stored (Col 10 ll24-29, Marcu);

Claims 5 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Le Roux, in view of Graves as applied to claims 3 and 13 in further view of Grupen (US 7454326 B2).

With respect to claims 5 and 15 La Roux does not explicitly disclose but Grupen teaches wherein N is an integer of a predetermined value (Grupen [0225] In some examples, the structured query (or queries) for the m-best (e.g., m highest ranked) candidate actionable intents are provided to task flow processing module 736, where m is a predetermined integer greater than zero.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify La Roux and Graves in view of Marcu, such that N is an integer of a predetermined value to ranks the candidate actionable intents based on the intent confidence scores ([225], Grupen);

Claims 9 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Le Roux, in view of Graves as applied to claims 1, 11 in further view of Senior (US 20170011738 A1).
With respect to claims 9 and 19 La Roux and Graves do not explicitly disclose but Senior teaches wherein after training speech recognition model, the trained speech recognition model is configured to provide streaming speech recognition results that include substantially real-time transcriptions of a portion of an utterance while a speaker of the utterance continues to speak the utterance (Senior: [0054] For example, the computing system 110 or another server system may receive audio data 184 over a network 183 from a user device 182 of a user 180, then use the trained second neural network 130 along with a language model and other speech recognition techniques to provide a transcription 190 of the user's utterance to the user device 182, where can be displayed, provided to an application or otherwise used. [0057] The model can be used in a streaming or continuous speech recognition system, where the automated speech recognizer provides transcription information as the user continues [real-time] to speak.).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify La Roux and Graves in view of Marcu, in order after training speech recognition model, the trained speech recognition model is configured to provide streaming speech recognition results that include substantially real-time transcriptions of a portion of an utterance while a speaker of the utterance continues to provide a transcription of the user's utterance to the user device ([0054], Senior);

Claims 10 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Le Roux, in view of Graves as applied to claims 1, 11 in further view of Hori (US 20170221474 A1).
With respect to claims 10 and 20 La Roux and Graves do not explicitly disclose but Hori teaches wherein performing the operations for training the speech recognition model to minimize word error rate further comprises using the loss function to distribute probability weight over items in the N- best list of speech recognition hypotheses for the training utterance ([0038] We use a set of N-best lists and obtain a loss function…where W.sub.k,n={w.sub.k,n,1, . . . , w.sub.k,n,T.sub.k,n} is a word sequence of an n-th hypothesis in the N-best list for the k-th utterance, and T.sub.k,n denotes a number of words in hypothesis W.sub.k,n, and the posterior probability of W.sub.k,n is determined… [Eq.10 shows probability distributed and summed over N-best list in the loss function that is minimized over a set that contains N-best list ).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify La Roux and Graves in view of Hori, in order to perform the operations for training the speech recognition model to minimize word error rate further comprises using the loss function to distribute probability weight over items in the N- best list of speech recognition hypotheses for the training utterance to reduce recognition errors ([0016], Hori);


Allowable Subject Matter
Claims 8 and 18 are objected to as being dependent upon a rejected base claims, but would be allowable if rewritten in independent form including all the limitations of the base claim and any intervening claims.
Claims 8 and 18 recite “wherein performing the operations for training the speech recognition model to minimize word error rate further comprises reducing variance by adjusting for an average number of word errors over the N-best list of speech recognition hypotheses for the training utterance”. The closest teaching comes from the cited art Graves who on p4 Col 2 Equation 20 shows WER loss function L that is approximated by averaging over Monte Carlo sampling, and on p5 Col 1 ll 8-11 recites “The advantage of reusing the alignment samples (as opposed to picking separate alignments for every k; t) is that the noise due to the loss variance largely cancels out, and only the difference in loss due to altering individual labels is added to the gradient.” However, neither Graves, nor any other cited art teaches reducing variance by adjusting for an average number of word errors over the N-best list

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675. The examiner can normally be reached on Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ATHAR N PASHA/Examiner, Art Unit 2657   

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657