DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
Claims 1-19 are pending in this application.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:

1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 2, 7, 12, 16 and 19 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Harsham et al., (US Pub. 2014/0372120) in view of Chen et al., (US Pub. 2021/0350786, filed on 2020-05-07).
Regarding claim 1, Harsham discloses an end-to-end automatic speech recognition (ASR) system comprising: 
interface configured to acquire acoustic feature sequence including utterances; 
a memory configured to store computer-executable ASR program modules including a context-expanded transformer network including [an encoder network and a decoder network, a beam search decoder and a speech-segment update module]; 
a processor, in connection with the memory, configured to repeatedly decode the utterances by performing steps of: 
 speech and text segments using the acoustic feature sequence and a token sequence provided from the [beam search decoder] ([0035]-[0042] receiving speech signal and determining a set of interpretations/recognition results that resembles the sequence of words represented by the speech; Fig. 2B, examples of the set of interpretations of the speech 240); 
updating the speech segment by appending the acoustic feature sequence to a last of the speech segment and updating the text segment by appending a token sequence of the recognition result for a previous utterance to a last of the text segment ([0035]-[0042] receiving continuous speech signal and determining text segment according to the received speech);
receiving the updated speech segment, the updated text segment and a partial token sequence from the [beam search decoder] ([0035]-[0042] receiving speech segments and determining a set of interpretations/recognition results that resembles the sequence of words represented by the speech);
estimating token probabilities for the [beam search decoder] based on the speech and text segments ([0035]-[0042] and Fig. 2B, estimating probabilities of sequences of acoustic features by using an acoustic model and probability of a sequence of words by using a language model); and 
finding a most probable token sequence as a speech recognition result from the estimated token probabilities using the beam search decoder ([0035]-[0042] Each recognition results/interpretation is “associated with a recognition confidence value, e.g., a score representing correctness of an interpretation in representing the sequence of words. … For each input speech segment, the speech recognition module can determine the recognition result, e.g., a word, with the largest recognition confidence value, yielding a sequence of words that is considered to represent the input speech sequence).
Harsham does not explicitly teach, however, Chen does explicitly teach including the bracketed limitation:
ASR program modules including a context-expanded transformer network including [an encoder network and a decoder network, a beam search decoder and a speech-segment update module]; and [beam search decoder] (Chen, Figs. 2A and 2B, [0034]-[0045][0082] ASR model 200 may include an end-to-end (E2E) sequence-to-sequence model and includes Recurrent Neural Network which includes an encoder 211, decoder/softmax 240; “the set of output labels can include wordpieces and/or entire words, in addition to or instead of graphemes. The output distribution of the joint network 230 can include a posterior probability value for each of the different output labels”);
receiving the updated speech segment, the updated text segment and a partial token sequence from the [beam search decoder] (Chen, [0041]-[0045] decoder 231 “takes the attention context output by the attender 221, as well as an embedding of the previous prediction”); and
estimating token probabilities for the [beam search decoder] based on the speech and text segments (Chen, [0038] “The probability distribution can then be used to select and assign scores to candidate orthographic elements (e.g., graphemes, wordpieces, and/or words) in a beam search process (e.g., by the Softmax layer 240)”).
Therefore, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to incorporate the system and the method recognize speech including a sequence of words as taught by Harsham with the method of including an end-to-end (E2E) sequence-to-sequence model and includes Recurrent Neural Network as taught by Chen to improves the accuracy of the ASR model in order to decrease a low word error rate with training models on larger training datasets (Chen, [0002]).
Regarding claim 2, Harsham in view of Chen discloses the ASR system of claim 1, and Harsham further discloses:
an output interface configured to generate text data according to the most probable token sequence ([0086] recognition result may output on a display).
Regarding claim 7, Harsham in view of Chen discloses the ASR system of claim 1, and Harsham further discloses:
wherein the updating the speech segment is performed by appending the acoustic feature sequence, wherein the updating text segment(s) is performed by appending a token sequence of the recognition result for a previous utterance to a last of the text segment ([0035]-[0042] receiving continuous speech signal and determining text segment according to the received speech);
Regarding claim 12, Harsham in view of Chen discloses the ASR system of claim 1, and Chen further discloses:wherein the transformer is configured to accept multiple utterances at once and predict output tokens for a last utterance using previous utterances (Chen, [0042][0043] predicting the next output).
Regarding claims 16 and 19, Claims 16 and 19 are the corresponding method and the medium claim corresponding the system claim 1. Claims 16 and 19 are rejected same rationale as applied to claim 1 above.
Claims 4-6, 9, 10, 13-15, and 18 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Harsham et al., (US Pub. 2014/0372120) in view of Chen et al., (US Pub. 2021/0350786, filed on 2020-05-07) and further in view of Thomson et al., (US Pat. 10,573,312).
Regarding claim 4, Harsham in view of Chen discloses the ASR system of claim 1. Harsham in view of Chen does not explicitly teach, however, Thomson does explicitly teach:
wherein the processor is configured to detect when each of the utterances includes a speaker-dependent (SD) context (Thomson, Fig. 43, step 4306 and Col. 8, lines 56-67, speaker-dependent ASR systems).
Therefore, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to incorporate the system and the method recognize speech including a sequence of words as taught by Harsham in view of Chen with the method of adapting a speaker-dependent ASR system to provide accuracy which may be improved as compared to the accuracy of transcription of audio from other people (Thomson, Col. 11, Lines 49-51).
Regarding claim 5, Harsham in view of Chen discloses the ASR system of claim 1. Harsham in view of Chen does not explicitly teach, however, Thomson does explicitly teach:
wherein the SD context is determined by a speaker identify data (ID) (Thomson, Fig. 43, step 4306 and Col. 8, lines 56-67, Col. 11, lines 21-33, Col. 12, lines 1-13, “a speaker-dependent speech model may be specific to a particular person” and based on client profile).
Regarding claim 6, Harsham in view of Chen and further in view of Thomson discloses the ASR system of claim 5, and Thomson further discloses:
wherein the speaker ID determined by a channel associated with a recording device storing the utterances of the speaker or a microphone arranged to the speaker (Thomson, Col. 11, lines 33-51, the speaker-dependent model is trained using speech pattern of the client).
Regarding claim 9, Harsham in view of Chen discloses the ASR system of claim 1, and Thomson further discloses:
wherein each of the encoder and decoder networks includes a deep feed-forward architecture having repeated blocks of self-attention and feed- forward layers (Thomson, Col. 111, lines 33-49, deep neural networks may include feed-forward and attention network).
Regarding claim 10, Harsham in view of Chen and further in view of Thomson discloses the ASR system of claim 9, and Chen further discloses:
wherein the decoder network features a source attention layer in each of the repeated blocks to read the output from the encoder (Chen, Fig. 2B, [0041]-[0043] attention layer 221 to read the output form the encoder layer 211).
Regarding claim 13, Harsham in view of Chen discloses the ASR system of claim 1, and Thomson further discloses:
wherein the segment size is determined by truncating oldest utterances in the segment if a segment duration exceeds a pre-defined constant length (Thomson, Col. 214, Lines 16-54, “In response to the n-gram count not being greater than q, the n-gram may be maintained in the n-gram table but may not be used to train the language model. In some embodiments, the variable q may be a minimum occurrence threshold and may depend on n (the length of the n-gram) and other factors”).
Regarding claim 14, Harsham in view of Chen and further in view of Thomson discloses the ASR system of claim 13, and Harsham further discloses:
wherein when an input segment is truncated, an output context corresponding to the input segment is truncated ([0035]-[0042] receiving continuous speech signal and determining text segment according to the received speech).
Regarding claim 15, Harsham in view of Chen discloses the ASR system of claim 1, and Thomson further discloses:
wherein the transformer includes self-attention, source attention and feed forward layers in the decoder (Thomson, Col. 111, lines 33-49, attention network and Feed-forward network).
Regarding claim 18, Claim 18 is the corresponding method corresponding the system claim 4. Claim 18 is rejected same rationale as applied to claim 4 above.
Claims 3, 8 and 17 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Harsham et al., (US Pub. 2014/0372120) in view of Chen et al., (US Pub. 2021/0350786, filed on 2020-05-07) and further in view of Norouzi et al., (US Pub. 2019/0362229).
Regarding claim 3, Harsham in view of Chen discloses the ASR system of claim 1. Harsham in view of Chen does not explicitly teach, however, Norouzi does explicitly teach:
wherein the processor is configured to stop decoding when no more acoustic feature sequence is provided (Norouzi, [0042] “To generate the new output sequence, the system samples from the likelihood distributions generated by the sequence generation neural network, e.g., until a pre-determined end-of-sequence output token is sampled or until the sequence reaches a pre-determined maximum length”).
Therefore, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to incorporate the system and the method recognize speech including a sequence of words as taught by Harsham in view of Chen the method of decoding until the sequence reaches a pre-determined maximum length as taught by Norouzi to allow a neural network to be trained to have state-of-the-art performance without excessive consumption of computational resources (Norouzi, [0008]).
Regarding claim 8, Harsham in view of Chen discloses the ASR system of claim 1. Harsham in view of Chen does not explicitly teach, however, Norouzi does explicitly teach:
wherein the text segment is an empty sequence at an initial stage (Norouzi, [0055] “For the first position in the output sequence, the prefix is the empty set”).
Regarding claim 17, Claim 17 is the corresponding method corresponding the system claim 3. Claim 17 is rejected same rationale as applied to claim 3 above.
Claim 11 is rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Harsham et al., (US Pub. 2014/0372120) in view of Chen et al., (US Pub. 2021/0350786, filed on 2020-05-07) and further in view of Chang et al., (WO 2022/030805, priority date: 2020-08-03).
Regarding claim 11, Harsham in view of Chen discloses the ASR system of claim 1. Harsham in view of Chen does not explicitly teach, however, Chang does explicitly teach:
wherein the self-attention and source attention mechanisms utilize interdependence between the input frames and the output tokens (on page 5/11, 6th paragraph, Fig. 4, the step of performing confidence-based filtering (S110) to find the occurrence position of an incorrect label in the time series speech data is a label transitioned between decoder time steps. Calculating the reliability using the transition probability between the labels (S111), calculating the reliability using the self-attention probability expressing the correlation between the labels (S112), and the method may include calculating reliability using a source-attention probability in consideration of the degree of correlation between labels (S113)”).
Therefore, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to incorporate the system and the method recognize speech including a sequence of words as taught by Harsham in view of Chen with the method of adapting the self-attention and source attention mechanisms as taught by Chang to improve the performance of the transformer-based speech recognition model (Chang, on page 5/11, 4th paragraph).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Please see attached form PTO-892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEONG-AH A. SHIN whose telephone number is (571)272-5933. The examiner can normally be reached 9 AM-3PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Seong-ah A. Shin
Primary Examiner
Art Unit 2659



/SEONG-AH A SHIN/           Primary Examiner, Art Unit 2659