DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to the submission filed January 23, 2020.  Claims 1-20 are pending.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on January 23, 2020; May 27, 2020; and November 4, 2020, is being considered by the examiner.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-7, 9, 12-17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Chang et al (US Patent Application Publication No. 2020/0335091).
Chang discloses joint end pointing and automatic speech recognition.  Regarding claims 1, 12, and 20, Chang teaches a method [Fig 1/Fig 2; para 0037], device comprising memory storing a program comprising one or more instructions; and a processor configured to execute the one or more instructions to control the electronic device [Fig 1/Fig 2; para 0037], and non-transitory computer-readable recording medium having recorded thereon a computer program, which, when executed by a computer [Fig 1/Fig 2; para 0037], performs a method comprising: obtaining an audio signal based on a speech input [para 0038-0039 ]; obtaining an output value of a first speech recognition model that outputs a character string at a first level based on the audio signal being input [(260); para 0038-0041 -- Output of the joint model 140 is evaluated using a beam search process 145 or another process… model can output scores over a distribution of output labels that includes orthographic elements (e.g., graphemes, wordpieces, or words); 0042-0048].  Chang fails to specifically teach a first and second recognition model in a single embodiment so as to provide for obtaining an output value of a second speech recognition model that outputs a character string at a second level corresponding to the audio signal based on the output value of the first speech recognition model based on the audio signal being input.  Chang teaches the output of the first model is evaluated using “beam search or another process” at para 0038-0041 and additionally teaches evaluating transcriptions produced by the recognizer can be re-scored with an additional speech recognition model [para 0048].  One having ordinary skill in the art at the time of the invention would have recognized 
Regarding claims 2 and 13, the disclosures of Chang teaches the character string at the second level comprises sub-sets of a set including, at least one character within the character string at the first level [para 0041-0042 -- As the joint model determines a set of outputs (e.g., output labels) for each of a plurality of output steps, the beam search process 145 can prune away unlikely search paths and maintain only the most probable paths. Often, this can include maintaining only a limited number of search beams]. 
Regarding claim 3, the disclosures of Chang teaches the character string at the second level comprises sub-strings that are more similar to a semantically-completed word than sub-strings within the character string at the first level [para 0041-0042 -- As the joint model determines a set of outputs (e.g., output labels) for each of a plurality of output steps, the beam search process 145 can prune away unlikely search paths and maintain only the most probable paths. Often, this can include maintaining only a limited number of search beams].
Regarding claims 4 and 14, the disclosures of Chang teaches splitting the audio signal into frames [para 0040 -- the feature extraction module 130 produces audio feature vectors for different time windows of audio, often referred to as frames]; and the feature extraction module 130 produces audio feature vectors for different time windows of audio, often referred to as frames. The series of feature vectors can then serve as input to various models. The audio feature vectors contain information on the characteristics of the audio data 125, such as mel-frequency cepstral coefficients (MFCCs). The audio features may indicate any of various factors, such as the pitch, loudness, frequency, and energy of audio].  
Regarding claims 5 and 15, the disclosures of Chang teaches obtaining an output value of a first encoder included in the first speech recognition model, wherein the first encoder is configure to encode the audio signal input to the electronic device such that the character string at the first level is output [Fig. 4; para 0058-0059; 0068-0070]; and obtaining the obtained output value of the first encoder as the output value of the first speech recognition model [Fig. 4; para 0058-0059; 0068-0070].
Regarding claims 6 and 16, the disclosures of Chang teaches obtaining, from the output value of the first encoder, an output value of a first decoder included in the first speech recognition model and is used to determine the character string at the first level corresponding to the audio signal [Fig. 4; para 0058-0059; 0068-0070]; and obtaining the output value of the first encoder and the output value of the first decoder as the output value of the first speech recognition model [Fig. 4; para 0058-0059; 0068-0070].  
Regarding claims 7 and 17, the disclosures of Chang teaches obtaining an output value of a second encoder included in the second speech recognition model and is used to encode the audio signal, based on the output value of the first speech recognition model, such that the character string at the second level is output [where 
Regarding claims 9 and 19, the disclosures of Chang teaches a plurality of stacked long short-term memory (LSTM) layers, and the output value of the first encoder comprises a sequence of hidden layer vectors respectively output by LSTM layers selected from the plurality of stacked LSTM layers included in the first encoder [para 0058-0059; 0068-0070], and the output value of the second encoder comprises a sequence of hidden layer vectors output by LSTM layers selected from the plurality of stacked LSTM layers included in the second encoder [where para 0048 provides for a second recognition model and Fig. 4; para 0058-0059; 0068-0070 provides for the encoder and the LSTM layers and implementing the encoder processing within the second recognition model requires only routine skill in the art and would have been obvious so as to improve the recognition accuracy of the system].  



Claims 8, 10, 11, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Chang in view of Zeyer et al (“Improved training of end-to-end attention models for speech recognition,” Proc. Interspeech 2018, pages 7-11).
Regarding claims 8 and 18, Chang fails to teach applying an attention in the recognition process.  In a similar field of endeavor, Zeyer teaches attention models for speech recognition, implementing a LSTM encoder network and LSTM decoder network utilizing attention weights, attention context vectors and beam search decoding [section 3. Model, section 4, Sub-word Units; section 5. Language Model combination] for improving recognition accuracy.  One having ordinary skill at the time of the invention would have recognized the advantages of implementing the attention weight processing techniques, suggested by Zeyer, in the system of Chang for the purpose of improving recognition accuracy, as taught by Zeyer.
Regarding claim 10, Chang fails to teach but, Zeyer teaches a plurality of stacked LSTM layers and an attention layer, wherein the attention layer is configured to apply an attention to the output value of the first encoder based on an output value of the first decoder at a previous time, and the output value of the first decoder comprises a sequence of context vectors generated by weighted summing the output value of the first encoder based on the attention [section 3. Model, section 4, Sub-word Units; section 5. Language Model combination] for improving recognition accuracy.  One having ordinary skill at the time of the invention would have recognized the advantages of implementing the attention weight processing techniques, suggested by Zeyer, in the system of Chang for the purpose of improving recognition accuracy, as taught by Zeyer. 
Regarding claim 11, the combination of Chang and Zeyer teaches based on training of the first speech recognition model for outputting the character string at the first level being completed, the second speech recognition model is trained to output the character string at the second level, based on the output value of the first speech recognition model [Fig. 4; para 0050-0051 – training process; 0058-0059; 0068-0070]..


Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANGELA A ARMSTRONG whose telephone number is (571)272-7598.  The examiner can normally be reached on M,T,TH,F 11:30-8:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  


ANGELA A. ARMSTRONG
Primary Examiner
Art Unit 2659



/ANGELA A ARMSTRONG/Primary Examiner, Art Unit 2659