DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This Office Action is in response to correspondence filed 03 September 2020 in reference to application 17/011,809.  Claims 1-13 are pending and have been examined.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 13 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because the claims are directed to a computer readable storage media which has not been limited to non-transitory embodiments by the claims or the specification.  Thus the claim scope include transitory media such as carrier waves, which have been held to be nonstatutory subject matter.  Therefore claim 13 is rejected as being nonstatutory.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1, 3-5, 7, 9-11, and 13 is/are rejected under 35 U.S.C. 102(a)(2) as being (a)(2) by anticipated Bui et al. (US PAP 2020/0327884).

Consider claim 1, Bui teaches a method for generating a speech recognition model (abstract), wherein the speech recognition model comprises an encoder and a decoder (0041, encoder-decoder), and the method comprises: 
obtaining training samples, wherein each of the training samples comprises a speech frame sequence and a corresponding labeled text sequence (0048, speech based training data, 0061, ground truth is corresponding known labels); 
training the encoder by using the speech frame sequence as an input feature of the encoder and using speech encoded frames of the speech frame sequence as an output feature of the encoder (0057, 0061-62, training the network, including the encoder, encoder encodes speech frames); 
training the decoder by using the speech encoded frames as a first input feature of the decoder and using the labeled text sequence as a first output feature of the decoder, and obtaining a current prediction text sequence (0060-62, decoder receives encoded speech frames and outputs predicted text labels, training); and 
training the decoder again by using the speech encoded frames as a second input feature of the decoder and using a sequence as a second output feature of the decoder, wherein the sequence is obtained by sampling the labeled text sequence and the current prediction text sequence based on a preset probability (0074-75, 0079-80, 0069, encoder features may be frozen, while decoder is retrained with training data, using cross-entropy loss feature which compares predicted text to ground truth from training data).

Consider claim 3, Bui teaches the method of claim 1, wherein the preset probability is determined based on an accuracy of the current prediction text sequence output by the decoder (0079-80, 0069, encoder features may be frozen, while decoder is retrained with training data, using cross-entropy loss feature which compares predicted text to ground truth from training data, i.e. accuracy measure).

Consider claim 4, Bui teaches the method of claim 3, wherein the preset probability is determined by: 
determining the preset probability of sampling the current prediction text sequence in a direct proportion to the accuracy of the current prediction text sequence (0069, cross entropy rises as prediction gets closer to ground truth); 
determining the preset probability of sampling the labeled text sequence in an inverse proportion to the accuracy of the current prediction text sequence (0069, cross entropy falls as prediction waivers from ground truth).

Consider claim 5, Bui teaches the method of claim 1, further comprising: terminating training the speech recognition model in response to that a proximity between the current prediction text sequence and the labeled text sequence satisfies a preset value and that a character error rate in the current prediction text sequence satisfies a preset value, wherein the labeled text sequence corresponds to the current prediction text sequence (00069-71, 0079-80, retraining until cross entropy loss function indicates accuracy of predictions compared to ground truth.).

Consider claim 7, Bui teaches A device for generating a speech recognition model (abstract), wherein the speech recognition model comprises an encoder and a decoder (0041, encoder-decoder), and the device comprises: 
a processor (0156, processors); and
 a memory configured to store instructions executable by the processor (0156, memory); 
wherein the processor is configured to execute the instructions to: 
obtain training samples, wherein each of the training samples comprises a speech frame sequence and a corresponding labeled text sequence (0048, speech based training data, 0061, ground truth is corresponding known labels); 
train the encoder by using the speech frame sequence as an input feature of the encoder and using speech encoded frames of the speech frame sequence as an output feature of the encoder (0057, 0061-62, training the network, including the encoder, encoder encodes speech frames); 
train the decoder by using the speech encoded frames as a first input feature of the decoder and using the labeled text sequence as a first output feature of the decoder, and obtaining a current prediction text sequence (0060-62, decoder receives encoded speech frames and outputs predicted text labels, training); and 
train the decoder again by using the speech encoded frames as a second input feature of the decoder and using a sequence as a second output feature of the decoder, wherein the sequence is obtained by sampling the labeled text sequence and the current prediction text sequence based on a preset probability (0074-75, 0079-80, 0069, encoder features may be frozen, while decoder is retrained with training data, using cross-entropy loss feature which compares predicted text to ground truth from training data).

Claim 9 contains similar limitations as claim 3 and therefore is rejected for the same reasons.

Claim 10 contains similar limitations as claim 4 and therefore is rejected for the same reasons.

Claim 11 contains similar limitations as claim 5 and therefore is rejected for the same reasons.

Consider claim 13, Bui teaches a computer readable storage medium storing computer programs that, when executed by a processor (0157, computer readable media), cause the processor to perform the operation of: 
obtaining training samples, wherein each of the training samples comprises a speech frame sequence and a corresponding labeled text sequence (0048, speech based training data, 0061, ground truth is corresponding known labels); 
training the encoder by using the speech frame sequence as an input feature of the encoder and using speech encoded frames of the speech frame sequence as an output feature of the encoder (0057, 0061-62, training the network, including the encoder, encoder encodes speech frames); 
training the decoder by using the speech encoded frames as a first input feature of the decoder and using the labeled text sequence as a first output feature of the decoder, and obtaining a current prediction text sequence (0060-62, decoder receives encoded speech frames and outputs predicted text labels, training); and 
training the decoder again by using the speech encoded frames as a second input feature of the decoder and using a sequence as a second output feature of the decoder, wherein the sequence is obtained by sampling the labeled text sequence and the current prediction text sequence based on a preset probability (0074-75, 0079-80, 0069, encoder features may be frozen, while decoder is retrained with training data, using cross-entropy loss feature which compares predicted text to ground truth from training data).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 2 and 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Bui in view of Prabhavalkar et al. (US PAP 2020/0027444).

Consider claim 2, Bui teaches the method of claim 1, wherein said obtaining training samples comprises: 
obtaining a speech signal (0048, speech training data); 
obtaining an initial speech frame sequence by extracting speech features from the speech signal (0056, speech feature vectors from training data).
Bui does not specifically teach
obtaining spliced speech frames by splicing speech frames in the initial speech frame sequence; and 
obtaining the speech frame sequence by down-sampling the spliced speech frames.
In the same field of speech recognition using encoder-decoder networks, Prabhavalkar teaches 
obtaining spliced speech frames by splicing speech frames in the initial speech frame sequence (0103, stacking frames); and 
obtaining the speech frame sequence by down-sampling the spliced speech frames (0103, down sampling frames).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to stack and down sample input features as taught by Prabhavalkar in the system of Bui in order to reduce the size of the input vectors and thereby reduce complexity (Prabhavalkar 0103).

Claim 8 contains similar limitations as claim 2 and therefore is rejected for the same reasons.

Claim 6 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Bui in view of Hori et al (2019/0189115).

Consider claim 6, Bui teaches the method of claim 1, but does not specifically teach wherein the labeled text sequence is a labeled syllable sequence, and the prediction text sequence is a predicted syllable sequence.
In the same field of encoder-decoder speech recognition, Hori teaches wherein the labeled text sequence is a labeled syllable sequence, and the prediction text sequence is a predicted syllable sequence (0026, output label sequences may be syllables).
Therefore it would have bene obvious to one of ordinary skill in the art at the time of effective filing to use syllables as labels as taught by Hori in the system of Bui in order to use a well-known linguistic unit that can be easily further processed.

Claim 12 contains similar limitations as claim 6 and therefore is rejected for the same reasons.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Karita et al. (US PAP 2021/0056954) also teaches training decoder features based on loss functions.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DOUGLAS C GODBOLD whose telephone number is (571)270-1451. The examiner can normally be reached 6:30am-5pm Monday-Thursday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

DOUGLAS GODBOLD
Examiner
Art Unit 2655



/DOUGLAS GODBOLD/           Primary Examiner, Art Unit 2655