DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Applicant’s claim for the benefit of a prior-filed U.S. Provisional Application Number: 62/641,2061, filed 03/09/2018, which is acknowledged.

Drawings
The drawings were received on 03/11/2019.  These drawings are acceptable.

Information Disclosure Statement
The information disclosure statements (IDSs) submitted on 05/07/2020 is being considered by the examiner.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claims 1-20 are rejected under 35 U.S.C. 102(a)(1) and 102(a)(2) as being anticipated by Parthasarathi et al. (US Pub. No. 2017/0270919, hereinafter ‘Hari’).

Regarding independent claim 1 limitations, Hari teaches: a method performed by one or more computers, the method comprising:
receiving an input observation; (Hari teaches input as natural language speech audio data as Fig. 17, in 0032-0035: As shown in FIG. 1, a device 110 receives an audio input 11 corresponding to a spoken utterance from a desired user 10…. Further details of performing speech recognition using the present improvements are discussed below, fol­lowing a discussion of the overall speech processing system of FIG. 2…; and as depicted in Fig. 14:

    PNG
    media_image1.png
    540
    1218
    media_image1.png
    Greyscale

In 0107: Once determined, the reference audio data (includ­ing feature vectors x’1 ... x' m) may be encoded by an encoder to result in encoded reference audio data E(x'1 ... x'm). This encoded reference audio data (which may 1 be an encoded feature vector) may then be used for speech detection and/or speech recognition. For example, as shown in FIG. 14, the  audio features vector … )
generating, from the input observation, an output label distribution over possible labels for the input observation at a final time, (claimed output label distribution as corresponding speech for input audio, in 0051 …   The speech recognition engine 258 attempts to match received audio feature vectors to language phonemes 253  and language words as models known 254.)
the generating comprising: processing the input observation using a first neural network configured to process the input observation to generate a distribution over possible values for an intermediate indicator at a first time earlier than the final time; (in 0086-0087: Encoding is a general technique for projecting a sequence of features into a vector space. One goal of encoding is to project data points into a multi-dimensional vector space so that various operations can be performed on the vector combinations [claimed process the input observation to generate a distribution over possible values for an intermediate indicator at a first time earlier than the final time] to determine how they (or the data they contain) related to each other [claimed processing the input observation using a first neural network configured to process the input observation to generate a distribution over possible values for an intermediate indicator at a first time earlier than the final time]… The encoder E may be imple­mented as a recurrent neural network (RNN) [claimed processing the input observation using a first neural network configured to process the input observation to generate a distribution over possible values for an intermediate indicator at a first time earlier than the final time], for example as an long short-term memory RNN (LSTM-RNN) or as a gated recurrent unit RNN (GRU-RNN). An RNN is a tool whereby a network of nodes may be represented numerically and where each node representation includes information about the preceding portions of the network. For example, the RNN performs a linear transformation of the sequence of feature vectors [claimed process the input observation to generate a distribution over possible values for an intermediate indicator at a first time earlier than the final time] which converts the sequence into a fixed size vector. The resulting vector maintains features of the sequence in reduced vector space that can otherwise be arbitrarily long…) 
	generating, from the distribution over possible values for the intermediate  indicator at the first time, an input value for the intermediate indicator; and (in 0107: Once determined, the reference audio data (including feature vectors x′1 . . . x′m) may be encoded by an encoder to result in encoded reference audio data E(x′1 . . . x′m). This encoded reference audio data (which may be an encoded feature vector) may then be used for speech detection and/or speech recognition. For example, as shown in FIG. 14, the audio features vectors for the reference audio data may include audio feature vector x′1 1402 through audio feature vector x′m 1404. In the example of the reference audio data corresponding to the wakeword, audio feature vector x′1 1402 may correspond to the wakeword start time [Claimed generating, from the distribution over possible values for the intermediate  indicator at the first time, an input value for the intermediate indicator] 1032 and audio feature vector x′m 1404 may correspond to the wakeword end time 1034. The audio feature vectors may be processed by RNN encoder 1450 to create encoded reference feature vector yreference 1410 [claimed generating, from the distribution over possible values for the intermediate  indicator at the first time, an input value for the intermediate indicator], which by virtue of the RNN encoding represents the entire reference audio data from audio feature vector x′1 1402 to audio feature vector x′m 1404 in a single feature vector…)
processing the input value for the intermediate indicator using a second neural network configured to process the input value for the intermediate indicator to determine the output label distribution over possible values for the input observation at the final time; (as depicted in Fig. 15:

    PNG
    media_image2.png
    680
    968
    media_image2.png
    Greyscale

Claimed second Neural network as the network processing Encoder output as depicted in Fig. 15, in 0096: A classifier is a known machine learning based tool to classify inputs into certain configured classes. A classifier may be trained in a manner to use the RNN encoded vectors discussed above… To configure a classifier to operate on RNN encoded data a DNN with a softrnax layer and an RNN-encoder may be used. Depending on the output size a hierarchical softmax layer can be used as known in the art. The DNN [claimed processing the input value for the intermediate indicator using a second neural network configured to process the input value for the intermediate indicator to determine the output label distribution over possible values for the input observation at the final time] takes the RNN-encoder output as input and produces a probability distribution over all classes where the highest scoring class may be selected…)
and providing an output derived from the output label distribution. (depicted labels in Fig. 15 as claimed output, in 0111-0113: A frame-wise speech detector may have the form H(n; x1 . . . xn+d) and may predicts the probability of Pr(n-th frame is “desired speech”|x1 . . . xn+d). H can be implemented in different ways, a common state-of-the-art choice is to implement H as a (deep) neural network (DNN) or recurrent neural network (RNN). … The output of the classifier H may include different scores 1530 for each desired label, for example a first score that the particular audio data frame corresponds to desired speech, a second score that the particular audio data frame corresponds to undesired speech, and a third score that the particular audio data frame corresponds to non-speech. Alternatively, the classifier H may simply a label 1540 for the particular audio frame as to which category the particular frame corresponds to (e.g., desired speech) [claimed  providing an output derived from the output label distribution] along with a particular score…)

Regarding claim 2, the rejection of claim 1 is incorporated and Hari further teaches the method of claim 1 wherein the first neural network is configured to apply a softmax transform to the distribution for the intermediate indicator. (Hari teaches in 0096: … To configure a classifier to operate on RNN encoded data a DNN with a softrnax layer and an RNN-encoder may be used [claimed wherein the first neural network is configured to apply a softmax transform to the distribution for the intermediate indicator]. Depending on the output size a hierarchical softmax layer [claimed wherein the first neural network is configured to apply a softmax transform to the distribution for the intermediate indicator] can be used as known in the art…)

	Regarding claim 3, the rejection of claim 1 is incorporated and Hari further teaches the method of claim 1 wherein the second neural network is configured to receive the input value of the intermediate indicator as a one-hot encoded input value. (Hari teaches using one-hot encoded input to process the output to the claimed second neural net, in 0093-0096: … A word sequence is usually represented as a series of one-hot vectors [claimed …the intermediate indicator as a one-hot encoded input value] (i.e., a Z-sized vector representing the Z available words in a lexicon, with one bit high to represent the particular word in the sequence). The one-hot vector is often augmented with information from other models,… A classifier may be trained in a manner to use the RNN encoded vectors [claimed …the intermediate indicator as a one-hot encoded input value] discussed above. Thus, a classifier may be trained to classify an input set of features x1 , ... xN into a fixed number of classes 1 ... C (where C may be two, and the classifier may be configured to simply classify an input feature vector into one category or the other)…. The DNN [claimed wherein the second neural network is configured to receive the input value of the intermediate indicator as a one-hot encoded input value.] takes the RNN-encoder output as input and produces a probability distribution over all classes where the highest scoring class may be selected. In mathematical notation, given a sequence x1 , ... xN> and an encoder E….)
	
	
	Regarding claim 4, the rejection of claim 1 is incorporated and Hari further teaches the method of claim 1 further comprising: processing one or more of the input observation, the input value of the intermediate indicator using a third neural network to generate a correction to the output label distribution for the final time. (Hari teaches using detected energy level to correct label as claimed, in 0121-0122: … For the task of recog­nizing speech from the desired talker, this constraint is advantageous. The reference audio data may be used as an example of the desired talker' s speech [claimed processing one or more of the input observation, the input value of the intermediate indicator], and then by sub­tracting the LAMS, the system may shift the features cor­responding to the desired speaker closer to being zero-mean. This allows the system to train a classifier, e.g., a DNN, to better classify [claimed further comprising: processing one or more of the input observation, the input value of the intermediate indicator using a third neural network to generate a correction to the output label distribution for the final time.] a desired talker' s speech…. The energy level difference  (which is normalized due to the subtraction) may then be fed into a feed-forward deep neural network (DNN) [claimed processing one or more of the input observation, the input value of the intermediate indicator using a third neural network to generate…] or other machine learning trained model for classification. The model may be configured to classify energy level differences as representing speech belonging to the desired user (who spoke the reference audio data) or as representing non-speech or speech belonging to a different person [claimed further comprising: processing one or more of the input observation, the input value of the intermediate indicator using a third neural network to generate a correction to the output label distribution for the final time]; and in 00125-00127: As above with speech detection, the encoded reference audio data vector E(x′1 . . . x′m) may be provided as an additional input to “guide” the speech recognition system towards the desired word sequence…. One implementation is to make the computation of the frame-wise state probability during ASR dependent on E(x′1 . . . x′m): p(s n |x 1 . . . x n+d ,E(x′ 1 . . . x′ m))… Here, p may be implemented either as a DNN [claimed processing one or more of the input observation, the input value of the intermediate indicator using a third neural network to generate a correction to the output label distribution for the final time] or RNN (can be an LSTM-RNN or GRU-RNN or any other RNN variant) and p and E are jointly trained as described above.  )

Regarding claim 5, the rejection of claim 4 is incorporated and Hari further teaches the method of claim 4 further comprising: generating a corrected output label distribution for the final time based on the determined output label distribution for the final time and the determined correction. (Hari teaches in 0125 -0127: … As above with speech detection, the encoded reference audio data vector E(x′1 . . . x′m) may be provided as an additional input to “guide” the speech recognition system towards the desired word sequence…. One implementation is to make the computation of the frame-wise state probability during ASR dependent on E(x′1 . . . x′m): p(s n |x 1 . . . x n+d ,E(x′ 1 . . . x′ m))… Here, p may be implemented either as a DNN or RNN (can be an LSTM-RNN or GRU-RNN or any other RNN variant) and p and E are jointly trained as described above. One difference between speech detection is that in speech recognition the decision is not only made between (desired) speech and non-speech, but also between the units of speech (phones, senons, etc.). If p and E are trained on training data for which undesired speech is mapped to an existing non-speech class, or a newly defined undesired-speech class [claimed generating a corrected output label distribution for the final time based on the determined output label distribution for the final time and the determined correction], then the approach can learn both ignoring undesired speech and improving the distinction between the units of speech and between speech and noise. If the training data does not contain any non-desired speech, then the approach is likely to learn a speaker and/or acoustic condition adaptation [claimed generating a corrected output label distribution for the final time based on the determined output label distribution for the final time and the determined correction], i.e., improve the distinction between the units of speech and between speech and noise. )

Regarding claim 6, the rejection of claim 5 is incorporated and Hari further teaches the method of claim 5, wherein the provided output is the corrected output label distribution or data identifying one or more highest-scoring labels according to the corrected output label distribution. (Hari teaches in classifier using a max function for selecting Highest P-scoring class, in 0096: The DNN takes the RNN-encoder output as input and produces a probability distribution over all classes where the highest scoring class may be selected. In mathematical notation, given a sequence x1 , ... xN> and an encoder E, the classifier H may be expressed as: 
    PNG
    media_image3.png
    34
    365
    media_image3.png
    Greyscale
…. where p(cly) is implemented as a DNN [claimed wherein the provided output is the corrected output label distribution or data identifying one or more highest-scoring labels according to the corrected output label distribution].; where the claimed correction is computed in 0126-0127: One implementation is to make the computation of the frame-wise state probability during ASR dependent on E(x′1 . . . x′m):
p(s n |x 1 . . . x n+d ,E(x′ 1 . . . x′ m))… Here, p may be implemented either as a DNN  [claimed wherein the provided output is the corrected output label distribution or data identifying one or more highest-scoring labels according to the corrected output label distribution]or RNN (can be an LSTM-RNN or GRU-RNN or any other RNN variant) and p and E are jointly trained as described above. One difference between speech detection is that in speech recognition the decision is not only made between (desired) speech and non-speech, but also between the units of speech (phones, senons, etc.). If p and E are trained on training data for which undesired speech is mapped to an existing non-speech class, or a newly defined undesired-speech class, then the approach can learn both ignoring undesired speech and improving the distinction between the units of speech and between speech and noise. If the training data does not contain any non-desired speech, then the approach is likely to learn a speaker and/or acoustic condition adaptation, i.e., improve the distinction between the units of speech and between speech and noise.; Where the p scored assigned is the highest as claimed, in 0113: The output of the classifier H may include different scores 1530 for each desired label, for example a first score that the particular audio data frame corresponds to desired speech, a second score that the particular audio data frame corresponds to undesired speech, and a third score that the particular audio data frame corresponds to non-speech. Alternatively, the classifier H may simply a label 1540 for the particular audio frame as to which category the particular frame corresponds to (e.g., desired speech) along with a particular score. This implementation may be considered to be giving the particular audio frame a first probability of 1, a second probability of 0 and a third probability of 0…)

Regarding claim 7, the rejection of claim 4 is incorporated and Hari further teaches the method of claim 4, wherein the third neural network is configured to receive the input value of the intermediate indicator as a one-hot encoded input value. (Hari teaches using one-hot encoded input to process the output to the claimed third neural net, in 0093-0096: … A word sequence is usually represented as a series of one-hot vectors [claimed …the intermediate indicator as a one-hot encoded input value] (i.e., a Z-sized vector representing the Z available words in a lexicon, with one bit high to represent the particular word in the sequence). The one-hot vector is often augmented with information from other models,… A classifier may be trained in a manner to use the RNN encoded vectors [claimed …the intermediate indicator as a one-hot encoded input value] discussed above. Thus, a classifier may be trained to classify an input set of features x1 , ... xN into a fixed number of classes 1 ... C (where C may be two, and the classifier may be configured to simply classify an input feature vector into one category or the other)…. The DNN [claimed … neural network is configured to receive the input value of the intermediate indicator as a one-hot encoded input value.] takes the RNN-encoder output as input and produces a probability distribution over all classes where the highest scoring class may be selected. In mathematical notation, given a sequence x1 , ... xN> and an encoder E….; And claimed third NN in  0126-0127: One implementation is to make the computation of the frame-wise state probability during ASR dependent on E(x′1 . . . x′m): p(s n |x 1 . . . x n+d ,E(x′ 1 . . . x′ m))… Here, p may be implemented either as a DNN  [claimed wherein the third neural network is configured to receive the input value of the intermediate indicator as a one-hot encoded input value]or RNN (can be an LSTM-RNN or GRU-RNN or any other RNN variant) and p and E are jointly trained as described above. One difference between speech detection is that in speech recognition the decision is not only made between (desired) speech and non-speech, but also between the units of speech (phones, senons, etc.)…)

Regarding claim 8, the rejection of claim 1 is incorporated and Hari further teaches the method of claim 1, wherein generating, from the distribution over possible values for the intermediate indicator at the first time, an input value for the intermediate indicator comprises: sampling an input value from the distribution over possible values or selecting a possible value having the highest score in the distribution. (Hari teaches using one-hot encoded input to process as claimed selecting a possible value having the highest score in the distribution, in 0093-0096: … A word sequence is usually represented as a series of one-hot vectors [claimed selecting a possible value having the highest score in the distribution] (i.e., a Z-sized vector representing the Z available words in a lexicon, with one bit high to represent the particular word in the sequence [claimed [claimed selecting a possible value having the highest score in the distribution]). The one-hot vector is often augmented with information from other models,…)

Regarding independent claims 9 and 17 limitations, Hari teaches:
system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: and one or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: (Hari teaches 00153-0154: .... Persons having ordinary skill in the field of computers and speech processing should recognize that components and process steps described herein may be interchangeable with other components or steps, or combinations of components or steps, and still achieve the benefits and advantages of the present disclosure… Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure [claimed one or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations ]. The computer readable storage media may be implemented by a volatile computer memory, non-volatile computer memory, hard drive, solid-state memory, flash drive, removable disk and/or other media. In addition, components of one or more of the modules and engines may be implemented as in firmware or hardware, such as the acoustic front end 256, which comprise among other things, analog and/or digital filters ( e.g., filters configured as firmware to a digital signal processor (DSP)) [claimed system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform operations])
Claim 9 and 17  limitations are similar to those rejected in claim 1 and are rejected under the same rationale.

Regarding claims 10-12, the rejection of claim 9 is incorporated. The limitations are similar to claims 2-4 a limitations and are therefore rejected under that same rationale.
Regarding claim 13, the rejection of claim 12 is incorporated. The limitations are similar to claim 5 a limitations and are therefore rejected under that same rationale.
Regarding claims 14, the rejection of claim 13 is  incorporated. The limitations are similar to claim 6 limitations and are therefore rejected under that same rationale.
Regarding claim 15, the rejection of claim 12 is incorporated. The limitations are similar to claim 7 limitations and are therefore rejected under that same rationale.
Regarding claim 16, the rejection of claim 9 is incorporated. The limitations are similar to claim 8 limitations and are therefore rejected under that same rationale.
Regarding claim 18, the rejection of claim 17 is incorporated. The limitations are similar to claim 3 limitations and are therefore rejected under that same rationale.
Regarding claim 19, the rejection of claim 17 is incorporated. The limitations are similar to claim 4 limitations and are therefore rejected under that same rationale.
Regarding claim 20, the rejection of claim 19 is incorporated. The limitations are similar to claim 5 limitations and are therefore rejected under that same rationale.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure listed below:
Lu et al. (US Pub. No. 2018/0144248) teaching the use of a plurality of neural network for developing language processing methods in imaging captioning task.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516. The examiner can normally be reached Monday-Friday, 8:00am-5:00pm EST..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/OLUWATOSIN O ALABI/              Examiner, Art Unit 2129