Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on September 9, 2022, in which claims 1, 13, and 21 are currently amended. Claims 15-20 are canceled.  Claims 1-14 and 21-26  are currently pending.

Response to Arguments
Applicant’s arguments with respect to rejection of claims 1-14 and 21-26 under 35 U.S.C. 103 based on amendment have been considered. The argument is moot in view of a new ground of rejection set forth below.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-14 and 21-26 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. Regarding claims 1, 13, and 21, "to process the feedforward input to generate, by using a single forward pass of the feedforward generative neural network, a feedforward output that defines, for each of the plurality of generation time steps, a respective likelihood distribution over possible values for the output audio speech waveform at the generation time step" lacks support in the published instant specification.  While the specification mentions generating an output example in a single forward pass, there is no mention of generating a plurality of respective likelihood distributions for a plurality of generation time steps in a single forward pass.  Similarly, the published instant specification does not teach “processing a training feedforward input comprising the training context input using a single forward pass of the feedforward generative neural network in accordance with current values of the feedforward parameters to generate a training feedforward output” which appears to imply that the training occurs in a single pass of the feedforward neural network. 

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-14 and 21-26 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding claim 1, claim 1 teaches "input to generate, by using a single forward pass of the feedforward generative neural network, a feedforward output that defines, for each of the plurality of generation time steps, a respective likelihood distribution over possible values for the output audio speech waveform" which seems to imply that a likelihood distribution over all of the generation time steps is generated in a single forward pass, however, claim 1 also recites "a first divergence from...the likelihood distribution for the generation time step defined by the training feedforward output generated by the single forward pass of the feedforward generative neural network" which seems to contradict the previous limitation in suggesting that the single forward pass only applies to a single generation time step rather than the plurality of generation time steps. In the interest of further examination these limitations are interpreted as performing a forward pass for each of the generation time steps.
This rejection also applies to independent claims 13 and 21 which recite similar limitations.

The remaining claims are rejected with respect to their dependence on the rejected claims. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.


	Claims 1, 3, 9-10, 12-13, 21, and 23 are rejected under U.S.C. §103 as being unpatentable over the combination of Wang (“Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis”, 2016) and in view of Paine (“FAST WAVENET GENERATION ALGORITHM”, 2016) and in further view of Jinyu Li (“Learning Small Size DNN with Output Distribution Based Criteria”, 2014).

	 Regarding claim 1, Wang teaches A computer-implemented method of training a feedforward generative neural network having a plurality of feedforward parameters and configured to generate an output audio speech waveform conditioned on features of an input text segment([p. 2471 Summary] "Recently, the continuous representation of raw word inputs, called “word embedding”, has been successfully used in various natural language processing tasks. It has also been used as the additional or alternative linguistic input features to a neural-network-based acoustic model for TTS systems...In this paper, we further investigate the use of this embedding technique to represent phonemes, syllables and phrases for the acoustic model based on the recurrent and feed-forward neural network." [p. 2471 §1] "Text-to-speech (TTS) synthesis converts text strings into speech waveforms")
	wherein the feedforward generative neural network is configured to receive a feedforward input comprising the features of the input text segment and to process the feedforward input to generate, by using a single forward pass of the feedforward generative neural network, ([p. 2473 §3.2] "For acoustic modeling, the text of an utterance is converted into a sequence of frames {I1, I2, ··· , It, ··· , IT }, wherein T is the total number of frames of the training data utterance. The linguistic vector It at time t consists of the embedded vector of the word and the phonemic context at that time. Together with a sequence of acoustic feature vectors{O1, ··· , OT }, the acoustic model can be trained. During the synthesis time, the linguistic vector of the test data can be fed into the acoustic model and acoustic features can be predicted." [p. 2476 §5.3] "The model structure for all RNN systems contained two normal feed-forward layers with the sigmoid activation function and two bi-directional LSTM layers. Except the first feed-forward layer, the number of hidden nodes of the following layers was fixed at (512, 256, 256), respectively. For systems using more than one kind of embedded vector, e.g. Res, the size of the first hidden layer was 1024. Otherwise, it was 512. The DNN systems adopted similar structure, except the LSTM layer was replaced by a normal feed-forward layer with 512 nodes" Wang explicitly teaches using both a fully feed-forward network and a network with feed-forward layers being fed into recurrent layers to generate the probability distribution over respective time steps.  Inference in a feed-forward network occurs in a single pass by definition such that a feedforward network performing inference in a single pass would lead to an obvious and expected outcome.)
	a feedforward output that defines, for each of the plurality of generation time steps, a respective likelihood distribution over possible values for the output audio speech waveform at the generation time step, and([p. 2474 §4.2] "The input projection layer maps the one-hot vector of context word w into mI(w). Because the input vector is one-hot, the projected mI(w) actually corresponds to the I(w)-th row of the projection matrix M, where I(w) is the index of w. Then, the hidden representation h is calculated as the average of u...This h will be further transformed by another projection matrix M into u = M' h. The dimension of u is the same as that of the input one-hot vector. Then, on the basis of the softmax function, the ‘probability’ to generate the word wi can be written as [Eqn. 2]")
	wherein the training comprises: obtaining a training context input;([p. 2472 §2] "a specific data corpus must be prepared to train each module. For example, prosodic models are usually trained on the Boston University News Radio Corpus [17] and syntactic parsers are usually trained using the Penn Treebank corpus" Preparing an appropriate database interpreted as synonymous with obtaining a training context input.)
	processing a training feedforward input comprising the training context input using a single forward pass of the feedforward generative neural network in accordance with current values of the feedforward parameters to generate a training feedforward output comprising output values for the plurality of generation time steps;([p. 2473 §3.2] "For acoustic modeling, the text of an utterance is converted into a sequence of frames {I1, I2, ··· , It, ··· , IT }, wherein T is the total number of frames of the training data utterance. The linguistic vector It at time t consists of the embedded vector of the word and the phonemic context at that time. Together with a sequence of acoustic feature vectors{O1, ··· , OT }, the acoustic model can be trained. During the synthesis time, the linguistic vector of the test data can be fed into the acoustic model and acoustic features can be predicted." [p. 2476 §5.3] "The model structure for all RNN systems contained two normal feed-forward layers with the sigmoid activation function and two bi-directional LSTM layers. Except the first feed-forward layer, the number of hidden nodes of the following layers was fixed at (512, 256, 256), respectively. For systems using more than one kind of embedded vector, e.g. Res, the size of the first hidden layer was 1024. Otherwise, it was 512. The DNN systems adopted similar structure, except the LSTM layer was replaced by a normal feed-forward layer with 512 nodes" [p. 2474 §4.3] "For training the neural network model, the input and output features need to be normalized" Wang explicitly teaches using both a fully feed-forward network and a network with feed-forward layers being fed into recurrent layers to generate the probability distribution over respective time steps.).
	However, Wang does not explicitly teach wherein each output example includes a respective output sample at each of a plurality of generation time steps,
	processing the training context input using a trained autoregressive generative neural network,
	wherein the trained autoregressive generative neural network has been trained to autoregressively generate, by using a plurality of forward passes of the autoregressive generative neural network, a plurality of autoregressive outputs,
	wherein for each of the plurality of generation time steps, the trained autoregressive generative neural network generates, using one of the plurality of forward passes of the autoregressive generative neural network, an autoregressive output that defines a likelihood distribution over possible values for an output audio waveform of the text segment being spoken at the generation time step conditioned on output samples at one or more preceding generation time steps;
	determining a first gradient with respect to the feedforward parameters to minimize a divergence loss that depends on, for each of the plurality of generation time steps, a first divergence from the likelihood distribution defined by the autoregressive output  generated by the respective one of the plurality of forward passes of the autoregressive generative neural network for the generation time step and the likelihood distribution for the generation time step defined by the training feedforward output generated by the single forward pass of the feedforward generative neural network; and
	determining an update to the current values of the feedforward parameters based at least in part on the first gradient..

	Paine, in the same field of endeavor, teaches wherein each output example includes a respective output sample at each of a plurality of generation time steps,([p. 1 §1] "when generating audio using a trained model, the predictions are sequential. Every time an output value is predicted, the prediction is then fed back to the input of the network to predict the next sample" [p. 2 §2.1] "This graph, like the one in Figure 1, shows how a single output sample is generated except now it is in terms of the pre-computed (”recurrent”) states. In fact, upon closer inspection, the reader will notice that the graph shown in Figure 2 looks exactly like a single step of a multi-layer RNN. For some given time t, the incoming input sample (h0e) can be thought of as the ”embedding” input and is given the subscript ’e’...it should be noted that, due to the dilated convolutions, outputs at each layer will depend on the stored recurrent states computed several time steps back" Paine explicitly teaches that each output sample is generated for a particular time step t.)
	processing the training context input using a trained autoregressive generative neural network,([p. 1 §1] "when generating audio using a trained model, the predictions are sequential. Every time an output value is predicted, the prediction is then fed back to the input of the network to predict the next sample." [p. 2 §1] "While we present this fast generation scheme for Wavenet, the same scheme can be applied anytime one wants to perform auto-regressive generation or online prediction using a model with dilated convolution layers. For example, the decoder in ByteNet performs auto-regressive generation using dilated convolution layers, therefore our fast generation scheme can be applied.")
	wherein the trained autoregressive generative neural network has been trained to autoregressively generate, by using a plurality of forward passes of the autoregressive generative neural network, a plurality of autoregressive outputs,([Abstract] "While this method is presented for Wavenet, the same scheme can be applied anytime one wants to perform auto regressive generation or online prediction using a model with dilated convolution layers." [p. 1 §1] "Wavenet models the conditional probability via a stack of dilated causal convolutional layers for next-sample audio generation given all of the previous samples.  At training time, since the audio samples for all timestamps are known, the conditional predictions can be naturally made in parallel. However, when generating audio using a trained model, the predictions are sequential. Every time an output value is predicted, the prediction is then fed back to the input of the network to predict the next sample" Every time an output value is predicted is interpreted as synonymous with for each of the plurality of forward passes. This is also interpreted as synonymous with the dilation of the network as taught by Paine.)
	wherein for each of the plurality of generation time steps, the trained autoregressive generative neural network generates, using one of the plurality of forward passes of the autoregressive generative neural network, an autoregressive output that defines a likelihood distribution over possible values for an output audio waveform of the text segment being spoken at the generation time step conditioned on output samples at one or more preceding generation time steps;([Abstract] "While this method is presented for Wavenet, the same scheme can be applied anytime one wants to perform auto regressive generation or online prediction using a model with dilated convolution layers." [p. 1 §1] "Wavenet models the conditional probability via a stack of dilated causal convolutional layers for next-sample audio generation given all of the previous samples.  At training time, since the audio samples for all timestamps are known, the conditional predictions can be naturally made in parallel. However, when generating audio using a trained model, the predictions are sequential. Every time an output value is predicted, the prediction is then fed back to the input of the network to predict the next sample" [p. 2 §2.1] "due to the dilated convolutions, outputs at each layer will depend on the stored recurrent states computed several time steps back" output value interpreted as synonymous with output sample.).
the likelihood distribution defined by the autoregressive output  generated by the respective one of the plurality of forward passes of the autoregressive generative neural network for the generation time step and the likelihood distribution for the generation time step defined by the training feedforward output generated by the single forward pass of the feedforward generative neural network ([p. 1 §1] "Wavenet models the conditional probability via a stack of dilated causal convolutional layers for next-sample audio generation given all of the previous samples. At training time, since the audio samples for all timestamps are known, the conditional predictions can be naturally made in parallel. However, when generating audio using a trained model, the predictions are sequential. Every time an output value is predicted, the prediction is then fed back to the input of the network to predict the next sample." Conditional probability interpreted as directly proportional to likelihood distribution and achieving same outcome.).

	Wang as well as Paine are directed towards neural networks for text to speech generation.  Therefore, Wang as well as Paine are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Wang with the teachings of Paine by using an autoregressive network to generate speech waveforms from context input probability distributions.  Paine provides as additional motivation for combination ([Abstract] "our proposed approach removes redundant convolution operations by caching previous calculations, thereby reducing the complexity to O(L) time. Timing experiments show significant advantages of our fast implementation over a naive one. While this method is presented for Wavenet, the same scheme can be applied anytime one wants to perform autoregressive generation or online prediction using a model with dilated convolution layers.").  This motivation for combination also applies to the remaining claims which depend on this combination.

	However, the combination of Wang and Paine does not explicitly teach determining a first gradient with respect to the feedforward parameters to minimize a divergence loss that depends on, for each of the plurality of generation time steps, a first divergence from the likelihood distribution defined by the autoregressive output
	determining an update to the current values of the feedforward parameters based at least in part on the first gradient..

	Jinyu Li, in the same field of endeavor, teaches determining a first gradient with respect to the feedforward parameters to minimize a divergence loss that depends on, for each of the plurality of generation time steps, a first divergence from the likelihood distribution defined by the [autoregressive] output ([p. 1 Col. 2 Sec. 2] "The training criterion is to minimize cross entropy which is reduced to minimize the negative log likelihood because every frame has only one target label st (4) ...The DNN parameters are optimized with back propagation using stochastic gradient descent." See also Eqn. 5 and 6.  The claim language is interpreted as having a single pass of the generative network corresponding to each time step.)
	determining an update to the current values of the feedforward parameters based at least in part on the first gradient.([p. 1 Col. 2 Sec. 2] "The DNN parameters are optimized with back propagation using stochastic gradient descent.").

	Wang, Paine, and Jinyu Li are all directed towards generative neural networks for speech synthesis.  Therefore, Wang, Paine, and Jinyu Li are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Wang and Paine and Jinyu Li by using a KL divergence to probabilistically minimize the loss function in the neural network.  Jinyu Li teaches as a motivation for combination ([p. 2 Col. 2] “Without the need for transcriptions, the small-size DNN trained based on optimizing Eq. (6) can use much more training data than trained with Eq. (4) for standard DNN training…This training criterion is particularly useful for the industry scenario, where the amount of un-transcribed data is much larger than the amount of transcribed data due to the deployment feed-back loop.”).  This motivation for combination also applies to the remaining claims which depend on this combination.

	 Regarding claim 3, the combination of Wang, Paine, and Jinyu Li teaches The method of claim 1, wherein the training further comprises: obtaining a ground-truth output example for the training context input; and(Wang [p. 2473 §3.2] "For acoustic modeling, the text of an utterance is converted into a sequence of frames {I1, I2, · · · , It, · · · , IT }, wherein T is the total number of frames of the training data utterance." Training data interpreted as synonymous with ground-truth output example. [p. 3 §3.2] "As every senone is represented by a Gaussian distribution, for any pair of senones and , we use the symmetric KL divergence as their distance measure, which is the metric to cluster a large set of senones into a small set.")
	generating, from the training feedforward output, a predicted output example by sampling from the probability distributions.(Jinyu Li [p. 4 Col. 1 Sec. 4.2] "The first one uses the standard decision-tree-based process to generate a 1k senone set...the standard method which splits the decision tree by using the likelihood from single Gaussians." See also section 3.2.  Smaller senone set is sampled from gaussian probability distribution.).
	
	 Regarding claim 9, the combination of Wang, Paine, and Jinyu Li teaches The method of claim 1, wherein the training further comprises: obtaining a different context input;(Wang [p. 2472 §2] "a specific data corpus must be prepared to train each module. For example, prosodic models are usually trained on the Boston University News Radio Corpus [17] and syntactic parsers are usually trained using the Penn Treebank corpus" Preparing an appropriate database interpreted as synonymous with obtaining a training context input.  Boston University News Radio Corpus interpreted as different corpus from the Penn Treebank Corpus and vice versa.)
	processing the different context input using the trained autoregressive generative neural network to obtain, for each of the plurality of generation time steps, a respective different autoregressive output; and(Jinyu Li [p. 2 Col. 1 Sec. 3.1] "In this paper, we propose to directly minimize the KL divergence between the output distribution of the small-size DNN and the large-size DNN by leveraging large amounts of un-transcribed data to get a better small-size DNN than using the standard training method with only transcribed data." Eqn. 5, 6.  See Eqn. 2, 3 for relationship between output and state 's'. Equations 5 and 6 describe the process of generating a likelihood distribution over a range of time steps conditioned on a context input.)
	determining a fourth gradient with respect to the feedforward parameters to maximize a contrastive loss that depends at least in part on, for each of the generation time steps, a second divergence from the likelihood distribution defined by the different autoregressive output for the generation time step and the likelihood distribution for the generation time step defined by the training feedforward output, and wherein determining the update to the current values of the feedforward parameters comprises determining the update based at least in part on the fourth gradient.(Jinyu Li [p. 1 Col. 2 Sec. 2] "The training criterion is to minimize cross entropy which is reduced to minimize the negative log likelihood because every frame has only one target label st (4) ...The DNN parameters are optimized with back propagation using stochastic gradient descent." See also Eqn. 5 and 6).
	
	 Regarding claim 10, the combination of Wang, Paine, and Jinyu Li teaches The method of claim 1, wherein the first divergence is a KL divergence.(Jinyu Li [p. 2 Col. 1 Sec. 3.1] "In this paper, we propose to directly minimize the KL divergence between the output distribution of the small-size DNN and the large-size DNN by leveraging large amounts of un-transcribed data to get a better small-size DNN than using the standard training method with only transcribed data." Eqn. 5, 6.  See Eqn. 2, 3 for relationship between output and state 's'. Equations 5 and 6 describe the process of generating a likelihood distribution over a range of time steps conditioned on a context input.).
	
	 Regarding claim 12, the combination of Wang, Paine, and Jinyu Li teaches The method of claim 1, wherein the divergence loss depends at least in part on a sum of the first divergences at each of the time steps.(Jinyu Li [p. 2 Col. 2 Sec. 3.1] "For that mini-batch, calculate the error signal of Eq. (6), and then do back propagation for the small-size DNN." Eqn. 6 shows calculation of divergence as a summation at each time step.).	

	Regarding claim 13, claim 13 is substantially similar to claim 1.  Therefore, the rejection applied to claim 1 also applies to claim 13. 

	Regarding claims 21 and 23, claims 21 and 23 are directed towards a system comprising computers and storage devices storing instructions that when executed by the one or more computers, cause the computers to perform the method of claims 1 and 3, respectively.  Therefore, the rejection applied to claims 1 and 3 also apply to claims 21 and 23.  Paine explicitly teaches that the generation is performed using a computer graphics processing unit ([p. 5 §3.2] "When L is small, the naive implementation performs better than expected due to GPU parallelization of the convolution operations. However, when L is large, our efficient implementation starts to significantly outperform the naive method.") and further provides the code used to execute the method on a computer ([p. 1 §1 footnote 3]).  

	Claims 2, 14, and 22 are rejected under U.S.C. §103 as being unpatentable over the combination of Wang, Paine, Jinyu Li, and Kan Li (“The Kernel Adaptive Autoregressive-Moving-Average Algorithm”, 2015).

	 Regarding claim 2, the combination of Wang, Paine, and Jinyu Li teaches The method of claim 1.
	However, the combination of Wang, Paine, and Jinyu Li doesn't explicitly teach, the feedforward input further comprises a respective noise input at each of the generation time steps..

	Li, in the same field of endeavor, teaches the feedforward input further comprises a respective noise input at each of the generation time steps.([p. 335 Col. 2] "The extended KRLS (Ex-KRLS) algorithm [6] is the kernelized extended RLS algorithm [27] and can only model a random walk where wi is the state or process noise." See Eqn. 7 where noise is added to the input.).

	The combination of Wang, Paine, and Jinyu Li as well as Kan Li are directed towards using neural networks for text to speech generation.  Therefore, the combination of Wang, Paine, and Jinyu Li as well as Kan Li are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Wang, Paine, and Jinyu Li with the teachings of Kan Li by adding a noise input at each of the time steps. The addition of noise is well known in signal processing and more specifically speech synthesis and would be obvious to one of ordinary skill in the art.  Jinyu Li provides as motivation for combination ([p. 335 §I] “We demonstrate the computational power of the KAARMA algorithm by solving a set of benchmark grammatical inference problems and comparing its performance with RNNs operating on equivalent recurrent architectures in the input space. Furthermore, we show that KAARMA-based DFA can outperform LSMs on spike data, which opens the door for many novel neuroscience applications”).

Regarding claim 14, claim 14 is substantially similar to claim 2.  Therefore, the rejection applied to claim 2 also applies to claim 14.

Regarding claim 22, claim 22 is directed towards a system comprising computers and storage devices storing instructions that when executed by the one or more computers, cause the computers to perform the method of claim 2.  Therefore, the rejection applied to claim 2 also applies to claim 22.  Paine explicitly teaches that the generation is performed using a computer graphics processing unit ([p. 5 §3.2] "When L is small, the naive implementation performs better than expected due to GPU parallelization of the convolution operations. However, when L is large, our efficient implementation starts to significantly outperform the naive method.") and further provides the code used to execute the method on a computer ([p. 1 §1 footnote 3]).  

	Claims 4 and 24 are rejected under U.S.C. §103 as being unpatentable over the combination of Wang and Paine and Jinyu Li and Mohammadi (US 10186252 B1).

	 Regarding claim 4, the combination of Wang, Paine, and Jinyu Li teaches The method of claim 3.
	However, the combination of Wang, Paine, and Jinyu Li doesn't explicitly teach the ground-truth output example and the predicted output example are speech waveforms, wherein the training further comprises:
	generating a first magnitude spectrogram of the ground-truth output example;
	generating a second magnitude spectrogram of the predicted output example;
	determining a second gradient with respect to the feedforward parameters to minimize a magnitude loss that depends on the difference between the first and second magnitude spectrograms, and wherein determining the update to the current values of the feedforward parameters comprises determining the update based at least in part on the second gradient..

	Mohammadi, in the same field of endeavor, teaches The method of claim 3, wherein the ground-truth output example and the predicted output example are speech waveforms, wherein the training further comprises:([Abstract] "The text is decomposed into a sequence of phonemes and a text feature matrix constructed to define the manner in which the phonemes are pronounced and accented. A spectrum generator then queries a neural network to produce normalized spectrograms based on the input of the sequence of phonemes and features." [Col. 2 l. 10-13] "the pitch contours associated with phonemes are also normalized before retrieval and subsequently de-normalized based on the associated duration. The de-normalized pitch contours are used to convert the de-normalized spectrograms into waveforms that are concatenated into the synthetic speech.")
	generating a first magnitude spectrogram of the ground-truth output example;(FIG. 1 training audio 112 is passed through 122 and 120 to create spectrum matrix 140. [Col. 3 l. 42] "the system also includes a converter 122 that converts the audio representation of the speaker from an audio file to a representation in terms of Mel Cepstral coefficients and pitch...Based on the output of the forced alignment 116 and the mel cepstal coefficients, a spectrogram generator 120 produces a spectrum matrix 140.")
	generating a second magnitude spectrogram of the predicted output example;([Abstract] "The text is decomposed into a sequence of phonemes and a text feature matrix constructed to define the manner in which the phonemes are pronounced and accented. A spectrum generator then queries a neural network to produce normalized spectrograms based on the input of the sequence of phonemes and features." see also 140.)
	determining a second gradient with respect to the feedforward parameters to minimize a magnitude loss that depends on the difference between the first and second magnitude spectrograms, and wherein determining the update to the current values of the feedforward parameters comprises determining the update based at least in part on the second gradient.([Col. 4 l. 17] "Thereafter, the neural network trainer 150 trains three deep neural networks, one for the spectrum data, one for the duration data, and one for the pitch data...Y=[y1, . . . , yN] represent the output matrix which is the spectrum matrix" [Col. 4 l. 44] "The goal of the DNN training stage is to optimize the F function by estimating the parameters of the model: Ŷ=F(X) such that Ŷ is the most similar to Y" [Col. 5 l. 16] "The vectors X and Y can be used to train the deep neural network using a batch training process that evaluates all data at once in each iteration before updating the weights and biases using a gradient descent algorithm" Mohammadi explicitly teaches using gradient descent based on the known and predicted spectrogram matrices.).

	The combination of Wang, Paine, and Jinyu Li as well as Mohammadi are directed towards using neural networks for text to speech generation.  Therefore, the combination of Wang, Paine, and Jinyu Li as well as Mohammadi are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Wang, Paine, and Jinyu Li with the teachings of Mohammadi by generating spectrograms. Spectrograms are well known in the field of signal processing and more specifically speech synthesis, and the usage of spectrograms would be obvious to one of ordinary skill in the art.  Mohammadi specifically teaches that in speech synthesis it is common that ([Abstract] “A spectrum generator then queries a neural network to produce normalized spectrograms based on the input of the sequence of phonemes and features”) and ([Col. 5 l. 38-50] “is difficult to encode in a HMM and results in various approximations that reduce the accuracy of the spectrogram model. In contrast to the prior art , FIG . 2B illustrates spectrogram words encoding in accordance with the preferred embodiment of Phrase level the present invention”).  

Regarding claim 24, claim 24 is directed towards a system comprising computers and storage devices storing instructions that when executed by the one or more computers, cause the computers to perform the method of claim 4.  Therefore, the rejection applied to claim 4 also applies to claim 24.  Paine explicitly teaches that the generation is performed using a computer graphics processing unit ([p. 5 §3.2] "When L is small, the naive implementation performs better than expected due to GPU parallelization of the convolution operations. However, when L is large, our efficient implementation starts to significantly outperform the naive method.") and further provides the code used to execute the method on a computer ([p. 1 §1 footnote 3]).  

	Claims 5-8 and 25-26 are rejected under U.S.C. §103 as being unpatentable over the combination of Wang and Paine and Jinyu Li and Bo Li (US 2017/0278513 A1).

	 Regarding claim 5, the combination of Wang, Paine, and Jinyu Li teaches determining a third gradient with respect to the feedforward parameters to minimize a perceptual loss that depends on a measure of difference between the features of the ground-truth output example and the features of the predicted output example, and wherein determining the update to the current values of the feedforward parameters comprises determining the update based at least in part on the third gradient.(Jinyu Li [p. 1 Col. 2 Sec. 2] "The training criterion is to minimize cross entropy which is reduced to minimize the negative log likelihood because every frame has only one target label st (4)...The DNN parameters are optimized with back propagation using stochastic gradient descent." See also Eqn. 5 and 6).
	However, the combination of Wang, Paine, and Jinyu Li doesn't explicitly teach the training further comprises: processing the ground-truth output example using a trained feature generation neural network to obtain features of the ground-truth output example, wherein the trained feature generation neural network is a pre-trained neural network that takes a waveform as input;
	processing the predicted output example using the trained feature generation neural network to obtain features of the predicted output example,.

	Bo Li, in the same field of endeavor, teaches the training further comprises: processing the ground-truth output example using a trained feature generation neural network to obtain features of the ground-truth output example, wherein the trained feature generation neural network is a pre-trained neural network that takes a waveform as input;(FIG. 4 410 410 takes two channel waveform as input)
	processing the predicted output example using the trained feature generation neural network to obtain features of the predicted output example,([¶0073] "The computing system 320 may provide the output of the neural network 323 to a filter and sum module 325." See FIG. 3 output of 323 passed to 327 [¶0074] "For example, the neural network 327 indicates likelihoods that time-frequency feature representations correspond to different speech units when the time-frequency feature representations are output by filter module 325 and based on audio waveform samples 321" Acoustic Model neural network interpreted as synonymous with trained feature generation network.).

	The combination of Wang, Paine, and Jinyu Li as well as Bo Li are directed towards using neural networks for speech synthesis.  Therefore, the combination of Wang, Paine, and Jinyu Li as well as Bo Li are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Wang, Paine, and Jinyu Li with the teachings of Bo Li by taking a waveform as an input.  Bo Li teaches as a motivation for combination ([¶0038] “The training process can be enhanced using gated feedback. Recognition information from acoustic model reflects the content of speech and is believed to help earlier layers of the network. Augmenting the network input at each frame with the prediction from the previous frame can improve performance”).  

	 Regarding claim 6, the combination of Wang, Paine, Jinyu Li, and Bo Li teaches The method of claim 5, wherein the feature generation neural network is a speech recognition neural network.(Bo Li [Abstract] "Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed.").
	
	 Regarding claim 7, the combination of Wang, Paine, Jinyu Li, and Bo Li teaches The method of claim 5, wherein the features are outputs of an intermediate layer in the feature generation network.(Bo Li [¶0087] " In some implementations, the neural network trained as an acoustic model includes a convolutional layer and multiple hidden layers." output of an intermediate layer interpreted as synonymous with result of output of hidden layer.).
	
	 Regarding claim 8, the combination of Wang, Paine, Jinyu Li, and Bo Li teaches The method of claim 5, wherein the feature generation neural network is a trained autoregressive generative neural network.(Paine [p. 1 §1] "Wavenet (Oord et al., 2016), a deep generative model of raw audio waveforms, has drawn a tremendous amount of attention since it was first released. It changed existing paradigms in audio generation by directly modeling the raw waveform of audio signals" [p. 6 §4] "generation scheme can be applied anytime one wants to perform auto-regressive generation or online prediction using a model with dilated convolution layers").

Regarding claims 25-26, claims 25-26 are directed towards a system comprising computers and storage devices storing instructions that when executed by the one or more computers, cause the computers to perform the method of claims 5-6, respectively.  Therefore, the rejection applied to claims 5-6 also applies to claims 25-26.  Paine explicitly teaches that the generation is performed using a computer graphics processing unit ([p. 5 §3.2] "When L is small, the naive implementation performs better than expected due to GPU parallelization of the convolution operations. However, when L is large, our efficient implementation starts to significantly outperform the naive method.") and further provides the code used to execute the method on a computer ([p. 1 §1 footnote 3]).  

	Claim 11 is rejected under U.S.C. §103 as being unpatentable over the combination of Wang and Paine and Jinyu Li and Aosen Wang (US 20190050710 A1).

	 Regarding claim 11, the combination of Wang, Paine, and Jinyu Li teaches The method of claim 1.
	However, the combination of Wang, Paine, and Jinyu Li does not explicitly teach wherein the first divergence is a Jensen-Shannon Divergence..

	Aosen Wang, in the same field of endeavor, teaches the first divergence is a Jensen-Shannon Divergence.([¶0062] " In some embodiments, Jensen-Shannon divergence between the two statistical distributions for each layer (or for the model as a whole) is used to identify the optimal bit-widths with the least information loss for that layer.").

	The combination of Wang, Paine, and Jinyu Li as well as Aosen Wang are directed towards using neural networks for speech synthesis.  Therefore, the combination of Wang, Paine, and Jinyu Li as well as Aosen Wang are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to substitute the KL Divergence of the combination of Wang, Paine, and Jinyu Li with the Jensen-Shannon Divergence in Aosen Wang.  A Jensen-Shannon divergence is well known in art, and the substitution of a KL Divergence with a Jensen-Shannon divergence would be obvious to one of ordinary skill.  Aosen Wang teaches as a motivation for combination ([¶0062] "In some embodiments, Jensen-Shannon divergence between the two statistical distributions for each layer (or for the model as a whole) is used to identify the optimal bit-widths with the least information loss for that layer.").

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Arik (“Deep Voice: Real-time Neural Text-to-Speech”, 2017) is directed towards an auto-regressive neural network for text to speech.  Faria (“Discriminative Acoustic Features for Deployable Speech Recognition”, 2016) is directed towards a text-to-speech system and teaches ([p. 56 Ch. 4] "However, the quality of commercially produced transcription for other media can vary from 5-10% word error rate [137]. In these cases, the problem of long audio alignment
has been addressed by recursive [124], [138] or multi-pass [137] strategies that attempt to recognize speech and then perform text-level alignment [139], [140]. An alternative singlepass approach proceeds with the alignment of incremental chunks of audio or text data [141], and may include a modified Viterbi search algorithm [142]; a similar greedy variant of DTW is considered in [143]. Presegmentation at phrase boundaries has also been used for long audio alignment [144].")
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720.  The examiner can normally be reached on M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        



/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124