DETAILED ACTION
This action is in response to the initial filing of Application no. 17/316856 on 05/11/2021.
Claims 1 – 20 are still pending in this application, with claims 1, 8 and 15 being independent.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

	Allowable Subject Matter
Aside from the non-prior art rejections, it has been determined that the prior art fails to teach or suggest in reasonable combination the limitations recited in claims 3 (with dependent claims 4 and 5), 6 (with dependent claim 7), 10 (with claims 11 and 12), 13 (with dependent claim 14), 17 (with dependent claims 18 and 19) and claim 20.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.

Claims 1- 20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1 – 17 of U.S. Patent No.11,037,547. Although the claims at issue are not identical, they are not patentably distinct from each other.

The claim mapping is as follows. 

Current Application
1. A method of attention-based end-to-end (A-E2E) automatic speech recognition (ASR) training, the method comprising: performing cross-entropy training of a model, based on one or more input features of a speech signal; determining a posterior probability vector at a time of a first wrong token among one or more output tokens of the model of which the cross-entropy training is performed; determining a loss of the first wrong token at the time, based on the determined posterior probability vector; determining a total loss of a training set of the model of which the cross-entropy training is performed, based on the determined loss of the first wrong token of each of hypothesis- reference pairs in the training set with respect to a reference token at the time; and updating the model of which the cross-entropy training is performed, based on the determined total loss of the training set.

2. The method of claim 1, wherein the posterior probability vector at the time is determined as follows: 
    PNG
    media_image1.png
    46
    453
    media_image1.png
    Greyscale

  where t denotes the time, pt denotes the posterior probability vector at the time t, Hen denotes the one or more features that are encoded, yt-1 denotes an output token at a previous time t-1, rt-1 denotes a reference token at the previous time t-1, and st-1 denotes a token randomly selected from {rt-1,yt-i}.

3. The method of claim 1, wherein the total loss of the training set is determined as follows:

    PNG
    media_image2.png
    82
    336
    media_image2.png
    Greyscale

where L(θ) denotes the total loss of the training set, (Y,R) denotes the hypothesis- reference pairs in the training set, t' denotes the time, yt. denotes the first wrong token at the time, rta denotes the reference token at the time, and lo(ytw, rtc) denotes the loss of the first wrong token.

4. The method of claim 3, wherein the loss of the first wrong token is determined as follows: 

    PNG
    media_image3.png
    52
    327
    media_image3.png
    Greyscale

where ptw,rt, denotes a posterior probability of the reference token at the time.

5. The method of claim 3, wherein the loss of the first wrong token is determined as follows:

    PNG
    media_image4.png
    61
    393
    media_image4.png
    Greyscale

where ptw,rt, denotes a posterior probability of the reference token at the time, and pt',yt, denotes a posterior probability of the first wrong token at the time.

6. The method of claim 1, further comprising selecting a hypothesis with a longest correct prefix, from a plurality of hypotheses of the model of which the cross-entropy training is performed, wherein the determining posterior probability vector at the time comprises determining the posterior probability vector at the time of the first wrong token included in the selected hypothesis.

7. The method of claim 6, wherein the total loss of the training set is determined as follows:

    PNG
    media_image5.png
    81
    468
    media_image5.png
    Greyscale

where L(O) denotes the total loss of the training set, (Y,R) denotes hypothesis-reference pairs in the training set, f'" denotes the time, y16, denotes the first wrong token at the time, ret~,, denotes a reference token at the time, and lo(y 1, retsl,) denotes the loss of the first wrong token.

8. An apparatus for attention-based end-to-end (A-E2E) automatic speech recognition (ASR) training, the apparatus comprising: at least one memory configured to store program code; and at least one processor configured to read the program code and operate as instructed by the program code, the program code including: performing code configured to cause the at least one processor to perform cross- entropy training of a model, based on one or more input features of a speech signal; first determining code configured to cause the at least one processor to determine a posterior probability vector at a time of a first wrong token among one or more output tokens of the model of which the cross-entropy training is performed; second determining code configured to cause the at least one processor to determine a loss of the first wrong token at the time, based on the determined posterior probability vector; third determining code configured to cause the at least one processor to determine a total loss of a training set of the model of which the cross-entropy training is performed, based on the determined loss of the first wrong token of each of hypothesis-reference pairs in the training set with respect to a reference token at the time; and updating code configured to cause the at least one processor to update the model of which the cross-entropy training is performed, based on the determined total loss of the training set.

9. The apparatus of claim 8, wherein the posterior probability vector at the time is determined as follows:

    PNG
    media_image6.png
    60
    408
    media_image6.png
    Greyscale

where t denotes the time, pt denotes the posterior probability vector at the time t, Hen denotes the one or more features that are encoded, yt-1 denotes an output token at a previous time t-1, rtri denotes a reference token at the previous time t-1, and st-i denotes a token randomly selected from {rt-i,yt-i}.

10. The apparatus of claim 8, wherein the total loss of the training set is determined as follows:

    PNG
    media_image2.png
    82
    336
    media_image2.png
    Greyscale

where L(θ) denotes the total loss of the training set, (Y,R) denotes the hypothesis- reference pairs in the training set, t' denotes the time, yt. denotes the first wrong token at the time, rta denotes the reference token at the time, and lo(ytw, rtw) denotes the loss of the first wrong token.

11. The apparatus of claim 10, wherein the loss of the first wrong token is determined as follows:

    PNG
    media_image7.png
    57
    324
    media_image7.png
    Greyscale

where ptt, denotes a posterior probability of the reference token at the time.

12. The apparatus of claim 10, wherein the loss of the first wrong token is determined as follows:

    PNG
    media_image8.png
    61
    393
    media_image8.png
    Greyscale

where pt,rt,, denotes a posterior probability of the reference token at the time, and pt',yt, denotes a posterior probability of the first wrong token at the time.

13. The apparatus of claim 8, further comprising selecting code configured to cause the at least one processor to select a hypothesis with a longest correct prefix, from a plurality of hypotheses of the model of which the cross-entropy training is performed, wherein the first determining code is further configured to cause the at least one processor to determine the posterior probability vector at the time of the first wrong token included in the selected hypothesis.

14. The apparatus of claim 13, wherein the total loss of the training set is determined as follows:

    PNG
    media_image9.png
    94
    424
    media_image9.png
    Greyscale

where L(O) denotes the total loss of the training set, (Y,R) denotes hypothesis-reference pairs in the training set, PI,' denotes the time, y/j1, denotes the first wrong token at the time, ret~,, denotes a reference token at the time, and lo(ytjl, rtji,&) denotes the loss of the first wrong token.

15. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor of a device, cause the at least one processor to: perform cross-entropy training of a model, based on one or more input features of a speech signal; determine a posterior probability vector at a time of a first wrong token among one or more output tokens of the model of which the cross-entropy training is performed; determine a loss of the first wrong token at the time, based on the determined posterior probability vector; determine a total loss of a training set of the model of which the cross-entropy training is performed, based on the determined loss of the first wrong token of each of hypothesis-reference pairs in the training set with respect to a reference token at the time; and update the model of which the cross-entropy training is performed, based on the determined total loss of the training set.

16. The non-transitory computer-readable medium of claim 15, wherein the posterior probability vector at the time is determined as follows:

    PNG
    media_image10.png
    52
    432
    media_image10.png
    Greyscale

where t denotes the time, pt denotes the posterior probability vector at the time t, Hen denotes the one or more features that are encoded, yt-l denotes an output token at a previous time t-1, rt-i denotes a reference token at the previous time t-1, and st-i denotes a token randomly selected from {rt-1,yt-1}.

17. The non-transitory computer-readable medium of claim 15, wherein the total loss of the training set is determined as follows: 

    PNG
    media_image11.png
    66
    315
    media_image11.png
    Greyscale

where L(θ) denotes the total loss of the training set, (Y,R) denotes the hypothesis- reference pairs in the training set, t' denotes the time, yt- denotes the first wrong token at the time, rta denotes the reference token at the time, and lo(ytw, rtc) denotes the loss of the first wrong token.

18. The non-transitory computer-readable medium of claim 17, wherein the loss of the first wrong token is determined as follows:

    PNG
    media_image12.png
    79
    448
    media_image12.png
    Greyscale

where ptw,rt, denotes a posterior probability of the reference token at the time.

19. The non-transitory computer-readable medium of claim 17, wherein the loss of the first wrong token is determined as follows:

    PNG
    media_image13.png
    58
    399
    media_image13.png
    Greyscale

where ptw,rt, denotes a posterior probability of the reference token at the time, and pt',yt, denotes a posterior probability of the first wrong token at the time.

20. The non-transitory computer-readable medium of claim 15, wherein the instructions further cause the at least one processor to: select a hypothesis with a longest correct prefix, from a plurality of hypotheses of the model of which the cross-entropy training is performed; and determine the posterior probability vector at the time of the first wrong token included in the selected hypothesis.
US 11,037,547
1. A method of attention-based end-to-end (A-E2E) automatic speech recognition (ASR) training, the method comprising: performing cross-entropy training of a model, based on one or more input features of a speech signal; selecting a hypothesis with a longest correct prefix, from a plurality of hypotheses of the model of which the cross-entropy training is performed; determining a posterior probability vector at a time of a first wrong token included in the selected hypotheses, among one or more output tokens of the model of which the cross-entropy training is performed; determining a loss of the first wrong token at the time, based on the determined posterior probability vector; determining a total loss of a training set of the model of which the cross-entropy training is performed, based on the determined loss of the first wrong token; and updating the model of which the cross-entropy training is performed, based on the determined total loss of the training set.

2. The method of claim 1, wherein the posterior probability vector at the time is determined as follows:

    PNG
    media_image1.png
    46
    453
    media_image1.png
    Greyscale

p.sub.t=Decoder(s.sub.t−1∈{r.sub.t−1,y.sub.t−1},H.sup.enc), where t denotes the time, p.sub.t denotes the posterior probability vector at the time t, H.sup.enc denotes the one or more features that are encoded, y.sub.t−1 denotes an output token at a previous time t−1, r.sub.t−1 denotes a reference token at the previous time t−1, and s.sub.t−1 denotes a token randomly selected from {r.sub.t−1,y.sub.t−1}.

3. The method of claim 1, wherein the total loss of the training set is determined as follows: 

    PNG
    media_image2.png
    82
    336
    media_image2.png
    Greyscale

where L(θ) denotes the total loss of the training set, (Y,R) denotes hypothesis-reference pairs in the training set, t.sup.ω denotes the time, y.sub.t.sub.ω denotes the first wrong token at the time, r.sub.t.sub.ω denotes a reference token at the time, and l.sub.θ(y.sub.t.sub.ω, r.sub.t.sub.ω) denotes the loss of the first wrong token.

4. The method of claim 3, wherein the loss of the first wrong token is determined as follows:

    PNG
    media_image3.png
    52
    327
    media_image3.png
    Greyscale

, where p.sub.t.sub.ω.sub.,r.sub.t.sub.ω denotes a posterior probability of the reference token at the time.

5. The method of claim 3, wherein the loss of the first wrong token is determined as follows:

    PNG
    media_image4.png
    61
    393
    media_image4.png
    Greyscale
, where p.sub.t.sub.ω.sub.,r.sub.t.sub.ω denotes a posterior probability of the reference token at the time, and p.sub.t.sub.ω.sub.,y.sub.t.sub.ω denotes a posterior probability of the first wrong token at the time.

6. The method of claim 1, wherein the total loss of the training set is determined as follows: 
    PNG
    media_image5.png
    81
    468
    media_image5.png
    Greyscale
 , where L(θ) denotes the total loss of the training set, (Y,R) denotes hypothesis-reference pairs in the training set, t.sup.jl,ω denotes the time, y.sub.t.sub.jl,ω.sup.jl denotes the first wrong token at the time, r.sub.t.sub.jl,ω row denotes a reference token at the time, and l.sub.θ(y.sub.t.sub.jl,ω.sup.jl,r.sub.t.sub.jl,ω) denotes the loss of the first wrong token.
7. An apparatus for attention-based end-to-end (A-E2E) automatic speech recognition (ASR) training, the apparatus comprising: at least one memory configured to store program code; and at least one processor configured to read the program code and operate as instructed by the program code, the program code including: performing code configured to cause the at least one processor to perform cross-entropy training of a model, based on one or more input features of a speech signal; selecting code configured to cause the at least one processor to select a hypothesis with a longest correct prefix, from a plurality of hypotheses of the model of which the cross-entropy training is preformed; first determining code configured to cause the at least one processor to determine a posterior probability vector at a time of a first wrong token included in the selected hypothesis, among one or more output tokens of the model of which the cross-entropy training is performed; second determining code configured to cause the at least one processor to determine a loss of the first wrong token at the time, based on the determined posterior probability vector; third determining code configured to cause the at least one processor to determine a total loss of a training set of the model of which the cross-entropy training is performed, based on the determined loss of the first wrong token; and updating code configured to cause the at least one processor to update the model of which the cross-entropy training is performed, based on the determined total loss of the training set.

8. The apparatus of claim 7, wherein the posterior probability vector at the time is determined as follows:

    PNG
    media_image6.png
    60
    408
    media_image6.png
    Greyscale
, where t denotes the time, p.sub.t denotes the posterior probability vector at the time t, H.sup.enc denotes the one or more features that are encoded, y.sub.t−1 denotes an output token at a previous time t−1, r.sub.t−1 denotes a reference token at the previous time t−1, and s.sub.t−1 denotes a token randomly selected from {r.sub.t−1,y.sub.t−1}.

9. The apparatus of claim 7, wherein the total loss of the training set is determined as follows: 
    PNG
    media_image2.png
    82
    336
    media_image2.png
    Greyscale
, where L(θ) denotes the total loss of the training set, (Y,R) denotes hypothesis-reference pairs in the training set, t.sup.ω denotes the time, y.sub.t.sub.ω denotes the first wrong token at the time, r.sub.t.sub.ω denotes a reference token at the time, and l.sub.θ(y.sub.t.sub.ω, r.sub.t.sub.ω) denotes the loss of the first wrong token.

10. The apparatus of claim 9, wherein the loss of the first wrong token is determined as follows: 
    PNG
    media_image7.png
    57
    324
    media_image7.png
    Greyscale
, where p.sub.t.sub.ω.sub.,r.sub.t.sub.ω denotes a posterior probability of the reference token at the time.

11. The apparatus of claim 9, wherein the loss of the first wrong token is determined as follows:

    PNG
    media_image8.png
    61
    393
    media_image8.png
    Greyscale
, where p.sub.t.sub.ω.sub.,r.sub.t.sub.ω denotes a posterior probability of the reference token at the time, and p.sub.t.sub.ω.sub.,y.sub.t.sub.ω denotes a posterior probability of the first wrong token at the time.

12. The apparatus of claim 7, wherein the total loss of the training set is determined as follows: 
    PNG
    media_image2.png
    82
    336
    media_image2.png
    Greyscale
, where L(θ) denotes the total loss of the training set, (Y,R) denotes hypothesis-reference pairs in the training set, t.sup.jl,ω denotes the time, y.sub.t.sub.jl,ω.sup.jl denotes the first wrong token at the time, r.sub.t.sub.jl,ω denotes a reference token at the time, and l.sub.θ(y.sub.t.sub.jl,ω.sup.jl, r.sub.t.sub.jl,ω) denotes the loss of the first wrong token.
13. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor of a device, cause the at least one processor to: perform cross-entropy training of a model, based on one or more input features of a speech signal; select a hypotheses with a longest correct prefix, from a plurality of hypotheses of the model of which the cross-entropy training is performed; determine a posterior probability vector at a time of a first wrong token included in the selected hypothesis, among one or more output tokens of the model of which the cross-entropy training is performed; determine a loss of the first wrong token at the time, based on the determined posterior probability vector; determine a total loss of a training set of the model of which the cross-entropy training is performed, based on the determined loss of the first wrong token; and update the model of which the cross-entropy training is performed, based on the determined total loss of the training set.

14. The non-transitory computer-readable medium of claim 13, wherein the posterior probability vector at the time is determined as follows:

    PNG
    media_image10.png
    52
    432
    media_image10.png
    Greyscale
, where t denotes the time, p.sub.t denotes the posterior probability vector at the time t, H.sup.enc denotes the one or more features that are encoded, y.sub.t−1 denotes an output token at a previous time t−1, r.sub.t−1 denotes a reference token at the previous time t−1, and s.sub.t−1 denotes a token randomly selected from {r.sub.t−1,y.sub.t−1}.

15. The non-transitory computer-readable medium of claim 13, wherein the total loss of the training set is determined as follows: 
    PNG
    media_image11.png
    66
    315
    media_image11.png
    Greyscale
 , where L(θ) denotes the total loss of the training set, (Y,R) denotes hypothesis-reference pairs in the training set, t.sup.ω denotes the time, y.sub.t.sub.ω denotes the first wrong token at the time, r.sub.t.sub.ω denotes a reference token at the time, and l.sub.θ(y.sub.t.sub.ω, r.sub.t.sub.ω) denotes the loss of the first wrong token.

16. The non-transitory computer-readable medium of claim 15, wherein the loss of the first wrong token is determined as follows: 
    PNG
    media_image7.png
    57
    324
    media_image7.png
    Greyscale
 where p.sub.t.sub.ω.sub.,r.sub.t.sub.ω denotes a posterior probability of the reference token at the time.

17. The non-transitory computer-readable medium of claim 15, wherein the loss of the first wrong token is determined as follows: 
    PNG
    media_image13.png
    58
    399
    media_image13.png
    Greyscale
, where p.sub.t.sub.ω.sub.,r.sub.t.sub.ω denotes a posterior probability of the reference token at the time, and p.sub.t.sub.ω.sub.,y.sub.t.sub.ω denotes a posterior probability of the first wrong token at the time.


	As shown above, claims 1- 17 of US 11,037,547 in combination recite the limitations of claims 1 – 20 of the currently pending application. Therefore, claims 1-20 of the currently pending application are obvious variants of claims 1 – 17 of US 11,037,547.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 8 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Prabhavalkar et al. (US 2020/0043483) (“Prabhavalkar”) in view of Norouzi et al. (US 2019/0362229) (“Norouzi”).
For claims 1 and 15, Prabhavalkar discloses a method and non transitory computer readable medium storing instructions that ,when executed by at least one processor of a device, cause the at least one processor to perform the method of attention-based end-to-end (A-E2E) automatic speech recognition (ASR) training (Abstract; [0080 – 0083]), the method comprising: performing cross-entropy training of a model (attention based model, Fig.1) ([0044]), based on one or more input features of a speech signal ([0037] [0038] [0044]); determining a posterior probability vector at a time of a token among one or more output tokens of the model of which the cross-entropy training is performed (a decoder network models an output distribution over the next target conditioned on the sequence of previous predictions, P(yu|y*u-1; y*u-2…; y*0;x), wherein the next target is associated with a time step, [0028] [0038] [0071- 0073]); determining a loss of the token at the time, based on the determined posterior probability vector (–log(P(yu*|y*u-1 … y*0 - <sos>, x) wherein u correlates to a time of the token, [0028] [0038] [0044]); determining a total loss of a training set of the model of which the cross-entropy training is performed, based on the determined loss of the token of each of hypothesis-reference pairs in the training set with respect to a reference token (ground truth labeled sequence and individual tokens, e.g. characters, in the sequence as hypothesis-reference pair, [0037] [0073]) at the time (ground-truth labeled sequence is used as input during training, [0037] [0044]); and updating the model of which the cross-entropy training is performed, based on the determined total loss of the training set ([0044]). Yet, Prabhavalkar fails to teach that the token is a first wrong token.
However, Norouzi discloses a method for training a sequence generation (speech recognition) neural network (Abstract; [0020] [0026 – 0028]), wherein an output (grapheme, character or word, [0020]) of a decoder ([0026] [0027]) at a time step is either right (both the target sequence and the generated sequence generate talks, Fig4) or wrong (instead of he , the neural network produces ee,  Fig.4; [0027] [0065]).
Therefore, it would have been obvious to modify Prabhavalkar’s teachings with Nouruzi’s teachings so that the output tokens (labels) of the model are right or wrong with respects to ground truth label sequences (Prabhavalkar, [0037]), wherein an output token can be either a first right or first wrong token for the purpose of performing automatic speech recognition using sequence-to-sequence models such as attention based models, wherein error is an inherent feature of these models (Prabhavalkar, [0006]). 

For claim 8, Prabhavalkar discloses an apparatus for attention-based end-to-end (A-E2E) automatic speech recognition (ASR) training (Abstract), the apparatus comprising: at least one memory configured to store program code ([0080]); and at least one processor configured to read the program code and operates as instructed by the program code ([0081 – 0083]), the program code including: performing code configured to cause the at least one processor to perform cross-entropy training of a model (attention based model, Fig.1) ([0044]), based on one or more input features of a speech signal ([0037] [0038] [0044]); first determining code configured to cause the at least one processor to determine a posterior probability vector at a time of a token among one or more output tokens of the model of which the cross-entropy training is performed (a decoder network models an output distribution over the next target conditioned on the sequence of previous predictions, P(yu|y*u-1; y*u-2…; y*0;x), wherein the next target is associated with a time step, [0028] [0038]); second determining code configured to cause the at least one processor to determine a loss of the token at the time, based on the determined posterior probability vector (–log(P(yu*|y*u-1 … y*0 - <sos>, x) wherein u correlates to a time of the token, [0028] [0038] [0044]); third determining code configured to cause the at least one processor to determine a total loss of a training set of the model of which the cross-entropy training is performed, based on the determined loss of the token of each of hypothesis-reference pairs in the training set (ground truth labeled sequence and individual tokens, e.g. characters,  in the sequence as hypothesis-reference pair, [0037] [0073]) with respect to a reference token (token in ground truth label sequence is used as input during training) at the time  (ground-truth labeled sequence is used as input during training, [0037] [0044]); and updating code configured to cause the at least one processor to update the model of which the cross-entropy training is performed, based on the determined total loss of the training set ([0044]). Yet, this embodiment of Prabhavalkar fails to teach that the token is a first wrong token.
However, Norouzi discloses a method for training a sequence generation (speech recognition) neural network (Abstract; [0020] [0026 – 0028]), wherein an output (grapheme, character or word, [0020]) of a decoder ([0026] [0027]) at a time step is either right (both the target sequence and the generated sequence generate talks, Fig4) or wrong (instead of he , the neural network produces ee,  Fig.4; [0027] [0065]).
Therefore, it would have been obvious to modify Prabhavalkar’s teachings with Nouruzi’s teachings so that the output tokens (labels) of the model are right or wrong with respects to ground truth label sequences (Prabhavalkar, [0037]), wherein an output token can be either a first right or first wrong token for the purpose of performing automatic speech recognition using sequence-to-sequence models such as attention based models, wherein error is an inherent feature of these models (Prabhavalkar, [0006]). 

Claims 2, 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Prabhavalkar et al. (US 2020/0043483) (“Prabhavalkar”) in view of Norouzi et al. (US 2019/0362229) (“Norouzi”), and further in view of Prabhavalkar et al. (US 2020/0027444) (“Prabhavalkar1”)
For claims 2, 9 and 16, the combination of Prabhavalkar and Norouzi fails to teach, wherein the posterior probability vector at the time is determined as follows:  
    PNG
    media_image14.png
    18
    313
    media_image14.png
    Greyscale
where t denotes the time, pt denotes the posterior probability vector at the time t, Henc denotes the one or more features that are encoded, yt_1 denotes an output token at a previous time t-1, rt_1 denotes a reference token at the previous time t-1, and s t_1 denotes a token randomly selected from { rt_1,yt_1 }.
However, Prabhavalkar1 discloses a method and system for performing speech recognition (Abstract), wherein an input to decoder during training is selected between an input data based on a known value for an element in a training example, and the output of the decoder from the element in the training example ([0094]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s invention to modify the combined teachings of Prabhavalkar and Norouzi with Prabhavalkar1’s teachings so that scheduled sampling is used to train the decoder, wherein an input to decoder during training is selected between an input data based on a known value for an element in a training example, and the output of the decoder from the element in the training example for the purpose of reducing the gap between training and inference behavior (Prabhavalkar1, [0093]), and further wherein the posterior probability vector at the time is determined as follows:  
    PNG
    media_image14.png
    18
    313
    media_image14.png
    Greyscale
 where t denotes the time, pt denotes the posterior probability vector at the time t, Henc denotes the one or more features that are encoded, yt_1 denotes an output token at a previous time t-1, rt_1 denotes a reference token at the previous time t-1, and s t_1 denotes a token randomly selected from { rt_1,yt_1 }.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SONIA L GAY whose telephone number is (571)270-1951. The examiner can normally be reached Monday-Friday 9-5 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SONIA L GAY/Primary Examiner, Art Unit 2657