DETAILED ACTION
This action is in response to the claims filed 08/29/2018. Claims 1-25 are pending and have been examined.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claim 13-16 and 24-25 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because “one or more computer readable storage mediums” is being interpreted in light of the specification (para. 0081) to include both transitory media as well as non-transitory media. The examiner recommends an amendment to replace a “computer readable storage medium” with “non-transitory computer-readable storage medium” 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1, 5-6, 9-10, 13, 15, 17, 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. “improved training for online end-to-end speech recognition systems” hereinafter Kim. Further in view of Yoon Kim et al. “Sequence-Level Knowledge Distillation” hereinafter Yoon.
Regarding claim 1
Kim teaches, A computer-implemented method comprising obtaining a first output sequence from a bidirectional Recurrent Neural Network (RNN) model for an input sequence; (Figure 1 and Section 3.1.1 ¶01 “The first step is to build an offline end-to-end model as our teacher model. Since there is no latency restriction, we use a deep bidirectional RNN with LSTM units (BLSTM) to predict the correct label sequence y given the entire utterance x” the BLSTM takes in an input sequence x_t to produce a predicted label sequence y_t) obtaining a second output sequence from a unidirectional RNN model for the input sequence; (Figure 1 Section 3.1.2 ¶01 “a model that can operate in an online manner, without access to the future input frames…. 
    PNG
    media_image1.png
    321
    469
    media_image1.png
    Greyscale
” the unidirectional LSTM depicted on the right side of the figure takes as input the very same input sequence x and outputs a second sequence Q.) training the unidirectional RNN model to increase the similarity between the at least one first output and the second output. (Section 3.1 “The student network is trained to minimize the difference between its own output distributions and those of the teacher network.” minimizing the distributions also inturn minimizes the difference between the first and second output. Of course minimizing the difference corresponds to increasing the simularity.)
Kim does not explicity teach, selecting at least one first output from the first output sequence based on a similarity between the at least one first output and a second output from the second output sequence;
Yoon however when addressing issues related to knowledge distilation teaches, selecting at least one first output from the first output sequence based on a similarity between the at least one first output and a second output from the second output sequence; (Page 5 column 1 “Local updating suggests selecting a training sequence which is close to y and has high probability under the teacher model” the teacher model outputs a plurality of sequences, corresponding to the first output sequence. The sequence that is most similar to the output of the student model is selected to be used as the training sequence, or the first output. This is depicted in Figure 1 (right))
	It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a teacher student model in which the teacher provides additional candidate outputs to the student model as taught by Yoon to the disclosed invention of Kim.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement “the sequence-to-sequence framework [which] has been successfully applied to… speech recognition… methods described in this paper can be used to similarly train smaller models” (Yoon Conclusion) 

Regarding claim 5
	Kim/Yoon teach claim 1
Further, Kim teaches, wherein training the unidirectional RNN model includes training the unidirectional RNN model to increase the similarity between a first distribution of the at least one first output and a second distribution of the second output. (Section 3.1.2 ¶02 “Pt(k|x; θBLSTM) be the output distribution at time step t generated from the BLSTM-CTC teacher model. Let Qt(k|x; θLSTM) be the output distribution at time step t generated from LSTM-KL student model. The goal is to find the parameters θLSTM that minimizes the KL divergence….between these distributions” the parameters of the Unidirectional LSTM are found that minimizes the KL divergence which is equivalent to increasing similarity. Further these distributions are derived from the first output and second output, as they are associated with label distributions for the given sequence x)

Regarding claim 6
	Kim/Yoon teach claim 1
Further, Kim teaches, wherein training the unidirectional RNN model includes training the unidirectional RNN model to decrease Kullback-Leibler (KL) divergence between the first distribution and the second distribution. (Section 3.1.2 ¶02 “Pt(k|x; θBLSTM) be the output distribution at time step t generated from the BLSTM-CTC teacher model. Let Qt(k|x; θLSTM) be the output distribution at time step t generated from LSTM-KL student model. The goal is to find the parameters θLSTM that minimizes the KL divergence… between these distributions”)

Regarding claim 9
	Kim/Yoon teach claim 1
Further, Kim teaches, comprising training the bidirectional RNN model before training the unidirectional RNN model (Section 3.1.2  “Once the BLSTM-CTC model is trained, the next step is to transfer the predictive ”knowledge” of this offline model to a model that can operate in an online manner, without access to the future input frames. To do so, we adopt a teacher-student approach in order to train the LSTM model [unidirectional RNN] to minimize the Kullback-Leibler (KL) divergence between the output distributions of the offline BLSTM-CTC model and the online LSTM model”)

Regarding claim 10
	Kim/Yoon teach claim 1
	Further Kim teaches, wherein the bidirectional RNN model is a bidirectional Long Short-Term Memory (LSTM) and the unidirectional RNN model is a unidirectional LSTM. (Section 3.1.2  “Once the BLSTM-CTC model is trained, the next step is to transfer the predictive ”knowledge” of this offline model to a model that can operate in an online manner, without access to the future input frames. To do so, we adopt a teacher-student approach in order to train the LSTM model [unidirectional RNN]” as shown in figure 1 the model on the left is the BLSTM, having future and past context access. The model on the right corresponds to the unidirectional LSTM, it only has access to past context.)

Regarding claim 13
	Claim 13 is rejected by Kim/Yoon for the reason set forth in claim 1.
Further Kim teaches, A computer program product including one or more computer readable storage mediums collectively storing program instructions that are executable by a processor or programmable circuitry to cause the processor or programmable circuitry to perform operations (Section 4.1 “We investigated the performance of our proposed training strategies on Microsoft’s U.S. English Cortana personal assistant task. The training set has approximately 3,400 hours of utterances, the validation set has 10 hours of utterances, and the test set has 10 hours of utterances” in order to handle such a large amount of data Kim necessarily employs a computer product to store the training set and model definitions which is used to  execute training on a processor to then preform the experiments.)
Regarding claim 15
	Claim 15 is rejected for the reasons set forth in claim 5 and claim 13
Regarding claim 17
	Claim 17 is rejected by Kim/Yoon for the reason set forth in claim 13.
Regarding claim 19
	Claim 19 is rejected for the reasons set forth in claim 5 and claim 17

Claim 2, 3, 4, 14, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim/Yoon. Further in view of Hori et al WIPO Document ID WO 2019116604 A1, hereinafter Hori.

Regarding claim 2
	Kim/Yoon teach claim 1
	Kim/Yoon does not explicitly teach, wherein the at least one first output includes an output that appears sequentially earlier in the first output sequence than the second output appears in the second output sequence.
	Hori when addressing selecting a prior output from a joint model for use with another model teaches, wherein the at least one first output includes an output that appears sequentially earlier in the first output sequence than the second output appears in the second output sequence. (¶025 “The attention decoder network module 204 receives the hidden vector sequence from the encoder network module 202 and a previous label from the label sequence search module 206, and then computes first posterior probability distributions of the next label for the previous label using the decoder network” the attention decoder network produces a second output which is part of the second output sequence. The posterior probability is computed based on a sequentially early label, or previous label, from the first output label sequence.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a multi model sequence prediction method by searching for a prior output label sequence generated by a module with the highest sequence probability to be used to predict a future label sequence taught by Hori to the disclosed invention of Kim/Yoon
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a model which “exploits the benefits of both character and word level architectures, and enables high-accuracy open-vocabulary end-to-end ASR[automatic speech recognition]” (¶0007 Hori) 

Regarding claim 3
	Kim/Yoon teach claim 1
	Kim/Yoon does not explicitly teach, wherein selecting the at least one first output includes searching for the at least one first output within a predetermined range in the first output sequence, wherein the predetermined range is determined from an index of the second output in the second output sequence.
	Hori when addressing selecting a prior output from a joint model for use with another model teaches, wherein selecting the at least one first output includes searching for the at least one first output within a predetermined range in the first output sequence, wherein the predetermined range is determined from an index of the second output in the second output sequence. (¶025 “The attention decoder network module 204 receives the hidden vector sequence from the encoder network module 202 and a previous label from the label sequence search module 206, and then computes first posterior probability distributions of the next label for the previous label using the decoder network” Similar to claim 2, the sequence search module preforms a search on the previous label corresponding to the first output. The predetermined range is k-1 where k is the index of the current step in the sequence to be labeled by the decoder network module.)
	For the reasons to combine Hori with Kim/Yoon see the rejection of claim 2
Regarding claim 4
	Kim/Yoon teach claim 1
	Kim/Yoon does not explicitly teach, wherein selecting the at least one first output includes selecting, for each second output from the second output sequence, at least one first corresponding output from the first output sequence, wherein each second output appears in a same relative sequential order in the second output sequence as each of the at least one corresponding first output in the first output sequence.
	Hori when addressing selecting a prior output from a joint model for use with another model teaches, wherein selecting the at least one first output includes selecting, for each second output from the second output sequence, at least one first corresponding output from the first output sequence, wherein each second output appears in a same relative sequential order in the second output sequence as each of the at least one corresponding first output in the first output sequence. (¶025 and Figure 3 “The attention decoder network module 204 receives the hidden vector sequence from the encoder network module 202 and a previous label from the label sequence search module 206, and then computes first posterior probability distributions of the next label for the previous label using the decoder network” as apparent in the schematic in Figure. 3 which depicts the decoder network module producing the output sequence, where each label is based on the 1 previous label from the first output sequence. Therefore each second output is 1 step ahead of the first output, corresponding to “in a same relative sequential order” in the second output sequence produced by the decoder network module. Where the first output labels previously generated make up a first sequence. A prediction by the decoder network module is made for each second output in the sequence.) 
For the reasons to combine Hori with Kim/Yoon see the rejection of claim 2

Regarding claim 14
	Claim 14 is rejected for the reasons set forth in claim 3 and claim 13
Regarding claim 18
	Claim 18 is rejected for the reasons set forth in claim 3 and claim 17


Claim 7, 8, 16, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim/Yoon. Further in view of Ruder et al. “Knowledge Adaptation: Teaching to Adapt” hereinafter Ruder.

Regarding claim 7
	Kim/Yoon teach claim 5
	Kim/Yoon does not explicitly teach, wherein the first distribution is a weighted sum of distributions of each of the at least one first output. 
Ruder however, when addressing utilizing multiple teachers for knowledge transfer to a student model teaches, wherein the first distribution is a weighted sum of distributions of each of the at least one first output. (Section 3.3 ¶02-¶03 “To this end, we consider three measures of domain similarity… based on Kullback-Leibler divergence and are computed with regard to the domains’ term distributions… The student model with multiple teachers is then trained to imitate the sum of the teacher’s individual predictions weighted with the normalized similarity” the multiple teachers together represent a single teacher each generation their own output distribution D. The weighted sum of these output distributions corresponds to the first distribution used to train the student model.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a teacher student model in which the teacher is made up of multiple models each presenting a prediction for a student model to account for when making its own predictions taught by Ruder to the disclosed invention of Kim/Yoon
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a model in which “a student model that takes into account the predictions of multiple teachers and their domain similarities is able to outperform the state-of-the-art for multi-source unsupervised domain adaptation… and showed how this measure can be used to achieve state of-the-art results” (Conclusion Ruder)

Regarding claim 8
	Kim/Yoon/Ruder teach claim 7
Futher Ruder teaches, determining weights of the weighted sum of distributions of each of the at least one first output based on similarity of distribution of each of the at least one first output individually compared with the second output. (Section 3.3 ¶03 “The student model with multiple teachers is then trained to imitate the sum of the teacher’s individual predictions weighted with the normalized similarity sim(DS, DT ) of their respective source domain DS to the target domain DT” 
    PNG
    media_image2.png
    45
    302
    media_image2.png
    Greyscale
 the corresponding weight in the summation for each distribution is based on the similarity of the distribution of the target DT and each of the teacher distributions DS individually.)
For the reasons to combine Ruder with Kim/Yoon see the rejection of claim 7
Regarding claim 16
	Claim 16 is rejected for the reasons set forth in claim 7 and claim 13
Regarding claim 20
	Claim 20 is rejected for the reasons set forth in claim 7 and claim 17


10. Claim 11, 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim/Yoon. Further in view of Sak et al. “Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition” hereinafter Sak.

Regarding claim 11
	Kim/Yoon teach claim 10
	Kim/Yoon does not explicitly teach, training the unidirectional LSTM model by using Connectionist Temporal Classification (CTC) training
However Sak when addressing issues related to training context dependent LSTM RNNs teaches, training the unidirectional LSTM model by using Connectionist Temporal Classification (CTC) training (Section 3.1 ¶01 “We train and evaluate LSTM RNN acoustic models on hand transcribed, anonymized utterances taken from real 16kHz Google voice search traffic…. Noise is taken from the audio of YouTube videos…. This “Multi-style training” also alleviates overfitting of CTC models to training data” ¶04 “For CTC models, we obtained the best results with depth 5… Unidirectional models used 500 memory cells in each layer” the quotes passage generally describes the training of LSTM RNN models, and particular considerations for CTC models. Further Figure 3(a) depicts the labels derived after training a “unidirectional CD phone CTC” which corresponds to a unidirectional LSTM trained with CTC.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention train a unidirectional LSTM network using CTC methods and some modifications as taught by Sak to the disclosed invention of Kim/Yoon
One of ordinary skill in the arts would have been motivated to make this modification in order to improve “Performance of the blank-symbol acoustic models…by the introduction of context-dependent phonetic units, [showing] that we can train word level acoustic models to achieve reasonable accuracy on medium vocabulary speech recognition without using a language model.” (Sak Conclusion)

Regarding claim 12
	Kim/Yoon teach claim 1
Further Kim teaches, wherein the input sequence is an audio sequence of a speech, (Section 4.1 “As acoustic input features, we used 80-dimensional log mel filter bank coefficients extracted from 25 ms frames of audio every 10 ms.”)  
Kim/Yoon does not explicitly teach, and the first output sequence and the second output sequence are phoneme sequences
However Sak when addressing issues related to training context dependent LSTM RNNs teaches, and the first output sequence and the second output sequence are phoneme sequences (Section 2.4 “it was shown that it is possible to build context dependent whole-phone models… We use three frames of 40-dimensional logmel filterbanks to represent each whole-phone instance” phones or phonemes are used as the whole-phone instance labels for the CTC models) (Section 3.2 ¶03 “Figure 3 shows label posteriors estimated by various CTC phone and CD phone models” Figure 3 shows the results of both ULSTM and BLSTM models whose outputs sequences are “phones” rather than letters.)
For the reasons to combine Sak with Kim/Yoon see the rejection of claim 11

11. Claim 21, 22, 24, 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim. Further in view of Hori et al WIPO Document ID WO 2019116604 A1, hereinafter Hori.

Regarding claim 21
Kim teaches, A computer-implemented method comprising obtaining a first output sequence from a bidirectional Recurrent Neural Network (RNN) model for an input sequence; (Figure 1 and Section 3.1.1 ¶01 “The first step is to build an offline end-to-end model as our teacher model. Since there is no latency restriction, we use a deep bidirectional RNN with LSTM units (BLSTM) to predict the correct label sequence y given the entire utterance x” the BLSTM takes in an input sequence x_t to produce a predicted label sequence y_t) obtaining a second output sequence from a unidirectional RNN model for the input sequence; (Figure 1 Section 3.1.2 ¶01 “a model that can operate in an online manner, without access to the future input frames…. 
    PNG
    media_image1.png
    321
    469
    media_image1.png
    Greyscale
” the unidirectional LSTM depicted on the right side of the figure takes as input the very same input sequence x and outputs a second sequence Q.) training the unidirectional RNN model to increase the similarity between the at least one first output and the second output. (Section 3.1 “The student network is trained to minimize the difference between its own output distributions and those of the teacher network.” minimizing the distributions also inturn minimizes the difference between the first and second output. Of course minimizing the difference corresponds to increasing the simularity.)
Kim does not explicity teach, selecting at least one first output from the first output sequence, where the at least one first output includes a first output that appears sequentially earlier in the first output sequence than a second output appears in the second sequence;
Hori when addressing selecting a prior output from a joint model for use with another model teaches, selecting at least one first output from the first output sequence, where the at least one first output includes a first output that appears sequentially earlier in the first output sequence than a second output appears in the second sequence; (¶025 “The attention decoder network module 204 receives the hidden vector sequence from the encoder network module 202 and a previous label from the label sequence search module 206, and then computes first posterior probability distributions of the next label for the previous label using the decoder network” the attention decoder network produces a second output which is part of the second output sequence. The posterior probability is computed based on a sequentially early label from the first output label sequence.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a multi model sequence prediction method by searching for a prior output label sequence generated by a module with the highest sequence probability to be used to predict a future label sequence taught by Hori to the disclosed invention of Kim.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a model which “exploits the benefits of both character and word level architectures, and enables high-accuracy open-vocabulary end-to-end ASR[automatic speech recognition]” (¶0007 Hori) 

Regarding claim 22
	Kim/Hori teach claim 21
Further, Kim teaches, wherein training the unidirectional RNN model includes training the unidirectional RNN model to increase the similarity between a first distribution of the at least one first output and a second distribution of the second output. (Section 3.1.2 ¶02 “Pt(k|x; θBLSTM) be the output distribution at time step t generated from the BLSTM-CTC teacher model. Let Qt(k|x; θLSTM) be the output distribution at time step t generated from LSTM-KL student model. The goal is to find the parameters θLSTM that minimizes the KL divergence….between these distributions” the parameters of the Unidirectional LSTM are found that minimizes the KL divergence which is equivalent to increasing similarity. Further these distributions are derived from the first output and second output, as they are label distributions for the given sequence x)

Regarding claim 24
	Claim 24 is rejected by Kim/Hori for the reason set forth in claim 21.
Further Kim teaches, A computer program product including one or more computer readable storage mediums collectively storing program instructions that are executable by a processor or programmable circuitry to cause the processor or programmable circuitry to perform operations (Section 4.1 “We investigated the performance of our proposed training strategies on Microsoft’s U.S. English Cortana personal assistant task. The training set has approximately 3,400 hours of utterances, the validation set has 10 hours of utterances, and the test set has 10 hours of utterances” in order to handle such a large amount of data Kim necessarily employs a computer product to store the training set and execute  training on a processor to then preform the experiments.)

Regarding claim 25
	Kim/Hori teach claim 24
Further, Kim teaches, wherein training the unidirectional RNN model includes training the unidirectional RNN model to increase the similarity between a first distribution of the at least one first output and a second distribution of the second output. (Section 3.1.2 ¶02 “Pt(k|x; θBLSTM) be the output distribution at time step t generated from the BLSTM-CTC teacher model. Let Qt(k|x; θLSTM) be the output distribution at time step t generated from LSTM-KL student model. The goal is to find the parameters θLSTM that minimizes the KL divergence….between these distributions” the parameters of the Unidirectional LSTM are found that minimizes the KL divergence which is equivalent to increasing similarity. Further these distributions are derived from the first output and second output, as they are label distributions for the given sequence x)


12. Claim 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim/Hori. Further in view of Ruder et al. “Knowledge Adaptation: Teaching to Adapt” hereinafter Ruder.

Regarding claim 23
	Kim/Hori teach claim 22
	Kim/Hori does not explicitly teach, wherein the first distribution is a weighted sum of distributions of each of the at least one first output. 
Ruder however, when addressing utilizing multiple teachers for knowledge transfer to a student model teaches, wherein the first distribution is a weighted sum of distributions of each of the at least one first output. (Section 3.3 ¶02-¶03 “To this end, we consider three measures of domain similarity… based on Kullback-Leibler divergence and are computed with regard to the domains’ term distributions… The student model with multiple teachers is then trained to imitate the sum of the teacher’s individual predictions weighted with the normalized similarity” the multiple teachers together represent a single teacher each generation their own output distribution D. The weighted sum of these output distributions corresponds to the first distribution used to train the student model.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a teacher student model in which the teacher is made up of multiple models each presenting a prediction for a student model to account for when making its own predictions taught by Ruder to the disclosed invention of Kim/Hori
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a model in which “a student model that takes into account the predictions of multiple teachers and their domain similarities is able to outperform the state-of-the-art for multi-source unsupervised domain adaptation… and showed how this measure can be used to achieve state of-the-art results” (Conclusion Ruder)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached on Monday-Friday 7:30 am – 4:00 pm (EST).
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki, can be reached at telephone number 5712723719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://portal.uspto.gov/external/portal. Should you have questions about access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

/J.R.G./Examiner, Art Unit 2122

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122