Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on April 14, 2022, in which claims 1, 3, and 13 are amended. Claims 15-20 are withdrawn from consideration.  Claims 1-14, and 21-26 are currently pending.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on February 17, 2022 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Arguments
The rejections to claims 3 and 13 under 35 U.S.C. § 112(b) are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.
Applicant’s arguments with respect to rejection of claims 1-14 under 35 U.S.C. 101 based on amendment have been considered and are persuasive. The rejections are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.
Applicant’s arguments with respect to rejection of claims 1-14 under 35 U.S.C. 103(a) based on amendment have been considered and are persuasive. The argument is moot in view of a new ground of rejection set forth below.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: 
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


	Claims 1, 3, 9-10, 12-13, 21, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Chouireb (“Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model”, 2008) and in view of Jinyu Li (“Learning Small Size DNN with Output Distribution Based Criteria”, 2014). 

	Regarding claim 1, Chouireb teaches A computer-implemented method of training a feedforward generative neural network having a plurality of feedforward parameters and configured to generate an output audio speech waveform conditioned on features of an input text segment ([p. 74 §1] "This paper proposes multi-layer back-propagation neural networks models for the generation of coder parameter vectors and prosodic information, as well as a residual excited LPC coder to perform standard Arabic TTS system. The present paper will concentrate essentially on the neural network which converts phonetic and timing information to the LSF parameters required to generate the speech waveform." [p. 80 §6] "In our system, three standard feedforward backpropagation neural networks are used to perform prosodic generation")
	wherein each output example includes a respective output sample at each of a plurality of generation time steps, ([p. 79 §5] "The value of time index i during frame j is calculated using Eq. (6) (we have chosen β = 0.2), such that time index i reaches its maximum value during frame j = I" The calculated value of the time index during frame j is interpreted as synonymous with a respective output sample at each of a plurality of time steps.)
	wherein the training comprises: obtaining a training context input; ([p. 75 §4] "In order to train and validate a neural network to perform phonetic-to-acoustic mapping, it was necessary to prepare an appropriate database. This database, consisting of a set of recordings of speech from a single speaker, was then labeled phonetically, syntactically and prosodically" Preparing an appropriate database interpreted as synonymous with obtaining a training context input.)
	processing a training feedforward input comprising the training context input using the feedforward generative neural network in accordance with current values of the feedforward parameters to generate a training feedforward output; ([p. 79 §5.2] "The network is trained using back-propagation algorithm. A block diagram of the training part is shown in Fig. 5" See FIG. 5.  Training parameters interpreted as synonymous with feedforward parameters to generate a training feedforward output.)
	processing the training context input using a trained autoregressive generative neural network, ([p. 74 §2] "The complete system is shown in Fig. 1. The text-to-speech system includes a text-to-linguistic description subsystem, three neural networks used to assign prosodic information such as: duration, gain and pitch to each phonetic segment, another neural network used to convert the linguistic description and phoneme duration into a series of coder parameter vectors and the synthesis section of a parametric speech coder which uses a source-filter model. This coder presents an autoregressive filter, using line spectral frequencies (LSF) to drive this filter" See also FIG. 5 for processing of training context input. Using a autoregressive filter in a trained generative neural network interpreted as synonymous with using a trained autoregressive generative neural network.).
	However, Chouireb does not explicitly teach wherein the feedforward generative neural network is configured to receive a feedforward input comprising the features of the input text segment and to process the feedforward input to generate a feedforward output that defines, for each of the generation time steps, a respective likelihood distribution over possible values for the output audio speech waveform at the generation time step, and 
	wherein the trained autoregressive generative neural network has been trained to autoregressively generate, for each of the plurality of generation time steps, an autoregressive output that defines a likelihood distribution over possible values for an output audio waveform of the text segment being spoken at the generation time step conditioned on output samples at preceding generation time steps; 
	determining a first gradient with respect to the feedforward parameters to minimize a divergence loss that depends on, for each of the generation time steps, a first divergence from the likelihood distribution defined by the autoregressive output for the generation time step and the likelihood distribution for the generation time step defined by the training feedforward output; and 
	determining an update to the current values of the feedforward parameters based at least in part on the first gradient.  

Jinyu Li, in the same field of endeavor, teaches wherein the feedforward generative neural network is configured to receive a feedforward input comprising the features of the input text segment and to process the feedforward input to generate a feedforward output that defines, for each of the generation time steps, a respective likelihood distribution over possible values for the output audio speech waveform at the generation time step, and ([p. 2 Col. 1 Sec. 3.1] "In this paper, we propose to directly minimize the KL divergence between the output distribution of the small-size DNN and the large-size DNN by leveraging large amounts of un-transcribed data to get a better small-size DNN than using the standard training method with only transcribed data." Eqn. 5, 6. See Eqn. 2, 3 for relationship between output and state 's'. Equations 5 and 6 describe the process of generating a likelihood distribution over a range of time steps conditioned on a context output.)
	wherein the trained autoregressive generative neural network has been trained to [autoregressively] generate ([p. 2 §3.2] “We also want to generate a small senone set with a better accuracy than using the standard senone generation method”), for each of the plurality of generation time steps ([p. 2 §3.2] "where fi(x) is the i-th feature function for input x, λsi is the weight for the s-th class and i-th feature" i-th feature function for input x interpreted as synonymous with time step), an [autoregressive] output that defines a likelihood distribution over possible values for an output [audio waveform] of the [text] segment being spoken at the generation time step conditioned on output samples at preceding generation time steps; ([p. 2 Col. 1 Sec. 3.1] "In this paper, we propose to directly minimize the KL divergence between the output distribution of the small-size DNN and the large-size DNN by leveraging large amounts of un-transcribed data to get a better small-size DNN than using the standard training method with only transcribed data." Eqn. 5, 6. See Eqn. 2, 3 for relationship between output and state 's'. Equations 5 and 6 describe the process of generating a likelihood distribution over a range of time steps conditioned on a output of the previous generation time step.)
	determining a first gradient ([p. 1 §2] “The DNN parameters are optimized with back propagation using stochastic gradient descent”) with respect to the feedforward parameters ([p. 2 §3 Col. 2] “For each mini-batch, do forward propagation of both large-size and small-size DNNs”) to minimize a divergence loss that depends on, for each of the generation time steps, a first divergence from the likelihood distribution defined by the [autoregressive] output for the generation ([p. 2 §3.2] “We also want to generate a small senone set with a better accuracy than using the standard senone generation method”) time step ([p. 2 §3.2] "where fi(x) is the i-th feature function for input x, λsi is the weight for the s-th class and i-th feature" i-th feature function for input x interpreted as synonymous with time step) and the likelihood distribution for the generation time step defined by the training feedforward output; and ([p. 1 Col. 2 Sec. 2] "The training criterion is to minimize cross entropy which is reduced to minimize the negative log likelihood because every frame has only one target label st (4) ...The DNN parameters are optimized with back propagation using stochastic gradient descent." See also Eqn. 5 and 6)
	determining an update to the current values of the feedforward parameters based at least in part on the first gradient. ([p. 1 Col. 2 Sec. 2] "The DNN parameters are optimized with back propagation using stochastic gradient descent."). 

	Chouireb and Jinyu Li are both directed towards generative neural networks for speech synthesis.  Therefore, Chouireb and Jinyu Li are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Chouireb and Jinyu Li by using a KL divergence to probabilistically minimize the loss function in the neural network. While Jinyu Li does not explicitly teach an autoregressive input or output, this deficiency is already taught by the primary reference Chouireb.  Jinyu Li teaches as a motivation for combination ([p. 2 Col. 2] “Without the need for transcriptions, the small-size DNN trained based on optimizing Eq. (6) can use much more training data than trained with Eq. (4) for standard DNN training… This training criterion is particularly useful for the industry scenario, where the amount of un-transcribed data is much larger than the amount of transcribed data due to the deployment feed-back loop.”).

	While not relied upon, special attention is directed towards the art of Kan Li (“The Kernel Adaptive Autoregressive-Moving-Average Algorithm”, 2015) which with the exception of taking text input and outputting speech audio waveforms, teaches an autoregressive generative neural network similar to that described by the claim language.  Kan Li as seen as a secondary mapping which could be provided as an alternate rejection.  While the amendments significantly changed the scope of the claims, Kan Li still teaches the neural network system with the exception of the audio waveform and input text segment, whose deficiencies are cured by Chouireb.  The mapping below is only provided in order to strengthen the obviousness of the claimed invention by combination, and not in any way to replace the previously introduced mapping of Chouireb and Jinyu Li.

Kan Li teaches A computer-implemented method of training a feedforward generative neural network having a plurality of feedforward parameters and configured to generate an output [audio speech waveform] conditioned on features of an input [text] segment ([p. 334 Col. 1] "In this paper, we present a novel kernel adaptive recurrent filtering algorithm based on the autoregressive moving-average (ARMA) model, which is trained with recurrent stochastic gradient descent...We demonstrate its capabilities to provide exact solutions with compact structures by solving a set of benchmark nondeterministic polynomial-complete problems involving grammatical inference." [p. 340 COl. 1] "During training, we can treat the recurrent network as a single multistage feedforward network" [p. 337 Col. 1] "By the representer theorem, the SSM defined by (14) and (15) can be expressed as the following set of weights" Solving interpreted as synonymous with generating output examples.  Weights interpreted as form of feedforward parameter.)
wherein each output example includes a respective output sample at each of a plurality of generation time steps, ([p. 339 Col. 1] Algorithm 1. output y is generated at each time step t.)
wherein the training comprises: obtaining a training context input; ([p. 335 Col. 1] "By mapping the input symbols into a potentially infinite dimensional feature space, an adaptive filter with feedback can be trained to approximate any dynamical or nonlinear time-dependent relationship" input symbols interpreted as synonymous with context input.  Obtaining interpreted as a function contained within mapping.).
processing a training feedforward input comprising the training context input using the feedforward generative neural network in accordance with current values of the feedforward parameters to generate a training feedforward output; ([p. 337 Col. 1] FIG. 2, Eqn. 10, 11. "This augmented state vector si ∈ Rns is formed by concatenating the output yi with the original state vector xi" current parameter values interpreted as synonymous with state vector.)
processing the training context input using a trained autoregressive generative neural network, ([p. 334 Col. 1] "In this paper, we present a novel kernel adaptive recurrent filtering algorithm based on the autoregressive moving-average (ARMA) model, which is trained with recurrent stochastic gradient descent...We demonstrate its capabilities to provide exact solutions with compact structures by solving a set of benchmark nondeterministic polynomial-complete problems involving grammatical inference." See also FIG. 2.).

It would be obvious to combine the disclosure of Chouireb with that of Kan Li by using text segments as input into the neural network to generate output audio waveforms as is well-known in the field of text-to-speech.  The further combination of the loss function Jinyu Li would be sufficient to teach the entirety of the claim limitations.

Regarding claim 3, the combination of Chouireb and Jinyu Li teaches The method of claim 1, wherein the training further comprises: obtaining a ground-truth output example for the training context input; and (Chouireb [p. 80 §5.2] "We divided up our database into three subsets: 60% of the database for the training set, 25% for the validation set, and 15% for the test set." ground-truth output example interpreted as synonymous with test set.)
	generating, from the training feedforward output, a predicted output example by sampling from the likelihood distributions. (Jinyu Li [p. 4 Col. 1 Sec. 4.2] "The first one uses the standard decision-tree-based process to generate a 1k senone set...the standard method which splits the decision tree by using the likelihood from single Gaussians." See also section 3.2.  Smaller senone set is sampled from gaussian probability distribution.). 

	Regarding claim 9, the combination of Chouireb, and Jinyu Li teaches The method of claim 1, wherein the training further comprises: obtaining a different context input; (Chouireb [p. 80 §5] "We divided up our database into three subsets: 60% of the database for the training set, 25% for the validation set, and 15% for the test set" Validation set and test set both interpreted as different context inputs.)
	processing the different context input using the trained autoregressive generative neural network to obtain, for each of the plurality of generation time steps, a respective different autoregressive output; and (Jinyu Li [p. 2 Col. 1 Sec. 3.1] "In this paper, we propose to directly minimize the KL divergence between the output distribution of the small-size DNN and the large-size DNN by leveraging large amounts of un-transcribed data to get a better small-size DNN than using the standard training method with only transcribed data." Eqn. 5, 6. See Eqn. 2, 3 for relationship between output and state 's'. Equations 5 and 6 describe the process of generating a likelihood distribution over a range of time steps conditioned on a context input.)
	determining a fourth gradient with respect to the feedforward parameters to maximize a contrastive loss that depends at least in part on, for each of the generation time steps, a second divergence from the likelihood distribution defined by the different autoregressive output for the generation time step and the likelihood distribution for the generation time step defined by the training feedforward output, and wherein determining the update to the current values of the feedforward parameters comprises determining the update based at least in part on the fourth gradient. (Jinyu Li [p. 1 Col. 2 Sec. 2] "The training criterion is to minimize cross entropy which is reduced to minimize the negative log likelihood because every frame has only one target label st (4) ...The DNN parameters are optimized with back propagation using stochastic gradient descent." See also Eqn. 5 and 6). 

	Regarding claim 10, the combination of Chouireb, and Jinyu Li teaches The method of claim 1, wherein the first divergence is a KL divergence. ( [p. 2 Col. 1 Sec. 3.1] "In this paper, we propose to directly minimize the KL divergence between the output distribution of the small-size DNN and the large-size DNN by leveraging large amounts of un-transcribed data to get a better small-size DNN than using the standard training method with only transcribed data." Eqn. 5, 6. See Eqn. 2, 3 for relationship between output and state 's'. Equations 5 and 6 describe the process of generating a likelihood distribution over a range of time steps conditioned on a context input.). 

	Regarding claim 12, the combination of Chouireb, and Jinyu Li teaches The method of claim 1, wherein the divergence loss depends at least in part on a sum of the first divergences at each of the time steps. (Jinyu Li [p. 2 Col. 2 Sec. 3.1] "For that mini-batch, calculate the error signal of Eq. (6), and then do back propagation for the small-size DNN." Eqn. 6 shows calculation of divergence as a summation at each time step.). 
	
	Regarding claim 13, claim 13 is substantially similar to claim 1.  Therefore, the rejection applied to claim 1 also applies to claim 13. 

	Regarding claims 21 and 23, claims 21 and 23 are directed towards a system comprising computers and storage devices storing instructions that when executed by the one or more computers, cause the computers to perform the method of claims 1 and 3, respectively.  Therefore, the rejection applied to claims 1 and 3 also apply to claims 21 and 23.  Chouireb explicitly teaches that the generation is performed on a computer ([p. 74 §I] "In recent years, with the increasing power of modern computers, there has been a growing interest on neural networks for prosodic prediction")

	Claims 2, 14, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chouireb, and Jinyu Li and in further view of Kan Li (“The Kernel Adaptive Autoregressive-Moving-Average Algorithm”, 2015).

	Regarding claim 2, the combination of Chouireb and Jinyu Li teaches The method of claim 1.
	However, the combination of Chouireb and Jinyu Li does not explicitly teach wherein the feedforward input further comprises a respective noise input at each of the generation time steps.  

Kan Li, in the same field of endeavor, teaches wherein the feedforward input further comprises a respective noise input at each of the generation time steps.  ([p. 335 Col. 2] "The extended KRLS (Ex-KRLS) algorithm [6] is the kernelized extended RLS algorithm [27] and can only model a random walk where wi is the state or process noise." See Eqn. 7 where noise is added to the input.). 

	Chouireb, Jinyu Li, and Kan Li are all directed towards generative neural networks for speech synthesis. Therefore, Chouireb, Jinyu Li, and Kan Li are all analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Chouireb and Jinyu Li with the teachings of Kan Li by adding a noise input at each of the time steps. The addition of noise is well known in signal processing and more specifically speech synthesis and would be obvious to one of ordinary skill in the art.  Jinyu Li provides as motivation for combination ([p. 335 §I] “We demonstrate the computational power of the KAARMA algorithm by solving a set of benchmark grammatical inference problems and comparing its performance with RNNs operating on equivalent recurrent architectures in the input space. Furthermore, we show that KAARMA-based DFA can outperform LSMs on spike data, which opens the door for many novel neuroscience applications”).

Regarding claim 14, claim 14 is substantially similar to claim 2.  Therefore, the rejection applied to claim 2 also applies to claim 14.

Regarding claim 22, claim 22 is directed towards a system comprising computers and storage devices storing instructions that when executed by the one or more computers, cause the computers to perform the method of claim 2.  Therefore, the rejection applied to claim 2 also applies to claim 22.  Chouireb explicitly teaches that the generation is performed on a computer ([p. 74 §I] "In recent years, with the increasing power of modern computers, there has been a growing interest on neural networks for prosodic prediction")

	Claims 4 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chouireb, and Jinyu Li and in further view of Mohammadi (US 10186252 B1).

	Regarding claim 4, the combination Chouireb and Jinyu Li teaches The method of claim 3.
	However, the combination Chouireb and Jinyu Li does not explicitly teach the ground-truth output example and the predicted output example are speech waveforms, wherein the training further comprises: 
	generating a first magnitude spectrogram of the ground-truth output example; 
	generating a second magnitude spectrogram of the predicted output example; 
	determining a second gradient with respect to the feedforward parameters to minimize a magnitude loss that depends on the difference between the first and second magnitude spectrograms, and wherein determining the update to the current values of the feedforward parameters comprises determining the update based at least in part on the second gradient.  

Mohammadi, in the same field of endeavor, teaches the ground-truth output example and the predicted output example are speech waveforms, wherein the training further comprises: ([Abstract] "The text is decomposed into a sequence of phonemes and a text feature matrix constructed to define the manner in which the phonemes are pronounced and accented. A spectrum generator then queries a neural network to produce normalized spectrograms based on the input of the sequence of phonemes and features." [Col. 2 l. 10-13] "the pitch contours associated with phonemes are also normalized before retrieval and subsequently de-normalized based on the associated duration. The de-normalized pitch contours are used to convert the de-normalized spectrograms into waveforms that are concatenated into the synthetic speech.")
	generating a first magnitude spectrogram of the ground-truth output example; (FIG. 1 training audio 112 is passed through 122 and 120 to create spectrum matrix 140. [Col. 3 l. 42] "the system also includes a converter 122 that converts the audio representation of the speaker from an audio file to a representation in terms of Mel Cepstral coefficients and pitch...Based on the output of the forced alignment 116 and the mel cepstal coefficients, a spectrogram generator 120 produces a spectrum matrix 140.")
	generating a second magnitude spectrogram of the predicted output example ( [Abstract] "The text is decomposed into a sequence of phonemes and a text feature matrix constructed to define the manner in which the phonemes are pronounced and accented. A spectrum generator then queries a neural network to produce normalized spectrograms based on the input of the sequence of phonemes and features." see also 140.)
	determining a second gradient with respect to the feedforward parameters to minimize a magnitude loss that depends on the difference between the first and second magnitude spectrograms, and wherein determining the update to the current values of the feedforward parameters comprises determining the update based at least in part on the second gradient. ([Col. 4 l. 17] "Thereafter, the neural network trainer 150 trains three deep neural networks, one for the spectrum data, one for the duration data, and one for the pitch data...Y=[y1, . . . , yN] represent the output matrix which is the spectrum matrix" [Col. 4 l. 44] "The goal of the DNN training stage is to optimize the F function by estimating the parameters of the model: Ŷ=F(X) such that Ŷ is the most similar to Y" [Col. 5 l. 16] "The vectors X and Y can be used to train the deep neural network using a batch training process that evaluates all data at once in each iteration before updating the weights and biases using a gradient descent algorithm" Mohammadi explicitly teaches using gradient descent based on the known and predicted spectrogram matrices.). 

	Chouireb, Jinyu Li, and Mohammadi are all directed towards using neural networks for speech synthesis.  Therefore, Chouireb, Jinyu Li, and Mohammadi are all analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the methods of Jinyu Li and Kan Li with that of Mohammadi by generating spectrograms. Spectrograms are well known in the field of signal processing and more specifically speech synthesis, and the usage of spectrograms would be obvious to one of ordinary skill in the art.  Mohammadi specifically teaches that in speech synthesis it is common that ([Abstract] “A spectrum generator then queries a neural network to produce normalized spectrograms based on the input of the sequence of phonemes and features”) and ([Col. 5 l. 38-50] “is difficult to encode in a HMM and results in various approximations that reduce the accuracy of the spectrogram model. In contrast to the prior art , FIG . 2B illustrates spectrogram words encoding in accordance with the preferred embodiment of Phrase level the present invention”).  
Regarding claim 24, claim 24 is directed towards a system comprising computers and storage devices storing instructions that when executed by the one or more computers, cause the computers to perform the method of claim 4.  Therefore, the rejection applied to claim 4 also applies to claim 24.  Chouireb explicitly teaches that the generation is performed on a computer ([p. 74 §I] "In recent years, with the increasing power of modern computers, there has been a growing interest on neural networks for prosodic prediction")

	Claims 5-8, and 25-26 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chouireb, and Jinyu Li and in further view of Bo Li (US 2017/0278513 A1).

	Regarding claim 5, the combination Chouireb and Jinyu Li teaches The method of claim 3, determining a third gradient with respect to the feedforward parameters to minimize a perceptual loss that depends on a measure of difference between the features of the ground-truth output example and the features of the predicted output example, and wherein determining the update to the current values of the feedforward parameters comprises determining the update based at least in part on the third gradient (Jinyu Li [p. 1 Col. 2 Sec. 2] "The training criterion is to minimize cross entropy which is reduced to minimize the negative log likelihood because every frame has only one target label st (4) ...The DNN parameters are optimized with back propagation using stochastic gradient descent." See also Eqn. 5 and 6).
	However, the combination Chouireb and Jinyu Li does not explicitly teach the training further comprises: processing the ground-truth output example using a trained feature generation neural network to obtain features of the ground-truth output example, wherein the trained feature generation neural network is a pre-trained neural network that takes a waveform as input; 
	processing the predicted output example using the trained feature generation neural network to obtain features of the predicted output example,  

Bo Li, in the same field of endeavor, teaches The method of claim 3, wherein the training further comprises: processing the ground-truth output example using a trained feature generation neural network to obtain features of the ground-truth output example, wherein the trained feature generation neural network is a pre-trained neural network that takes a waveform as input; (FIG. 4 410 410 takes two channel waveform as input)
	processing the predicted output example using the trained feature generation neural network to obtain features of the predicted output example, ([¶0073] "The computing system 320 may provide the output of the neural network 323 to a filter and sum module 325." See FIG. 3 output of 323 passed to 327 [¶0074] "For example, the neural network 327 indicates likelihoods that time-frequency feature representations correspond to different speech units when the time-frequency feature representations are output by filter module 325 and based on audio waveform samples 321" Acoustic Model neural network interpreted as synonymous with trained feature generation network.). 

Chouireb, Jinyu Li, and Bo Li are all directed towards using neural networks for speech synthesis.  Therefore, Chouireb, Jinyu Li, and Bo Li are all analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to substitute the input data types of Jinyu Li and Kan Li with that of Bo Li by taking a waveform as an input.  Bo Li teaches as a motivation for combination ([¶0038] “The training process can be enhanced using gated feedback. Recognition information from acoustic model reflects the content of speech and is believed to help earlier layers of the network. Augmenting the network input at each frame with the prediction from the previous frame can improve performance”).  

	Regarding claim 6, the combination of Chouireb, Jinyu Li, and Bo Li teaches The method of claim 5, wherein the feature generation neural network is a speech recognition neural network. (Bo Li [Abstract] "Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed."). 

	Regarding claim 7, the combination of Chouireb, Jinyu Li, and Bo Li teaches The method of claim 5, wherein the features are outputs of an intermediate layer in the feature generation network. (Bo Li [¶0087] " In some implementations, the neural network trained as an acoustic model includes a convolutional layer and multiple hidden layers." output of an intermediate layer interpreted as synonymous with result of output of hidden layer.). 

	Regarding claim 8, the combination of Chouireb, Jinyu Li, and Bo Li teaches The method of claim 5, wherein the feature generation neural network is a trained autoregressive generative neural network. (Chouireb [p. 74 §2] "The complete system is shown in Fig. 1. The text-to-speech system includes a text-to-linguistic description subsystem, three neural networks used to assign prosodic information such as: duration, gain and pitch to each phonetic segment, another neural network used to convert the linguistic description and phoneme duration into a series of coder parameter vectors and the synthesis section of a parametric speech coder which uses a source-filter model. This coder presents an autoregressive filter, using line spectral frequencies (LSF) to drive this filter" See also FIG. 5 for processing of training context input. Using a autoregressive filter in a trained generative neural network interpreted as synonymous with using a trained autoregressive generative neural network.). 

Regarding claims 25-26, claims 25-26 are directed towards a system comprising computers and storage devices storing instructions that when executed by the one or more computers, cause the computers to perform the method of claims 5-6, respectively.  Therefore, the rejection applied to claims 5-6 also applies to claims 25-26.  Chouireb explicitly teaches that the generation is performed on a computer ([p. 74 §I] "In recent years, with the increasing power of modern computers, there has been a growing interest on neural networks for prosodic prediction")

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chouireb, and Jinyu Li and in further view of Wang (US 2019/0050710 A1).

	Regarding claim 11, the combination of Chouireb and Jinyu Li teaches The method of claim 1.  

	However, the combination of Chouireb and Jinyu Li does not explicitly teach the first divergence is a Jensen-Shannon Divergence.  

Wang, in the same field of endeavor, teaches The method of claim 1, wherein the first divergence is a Jensen-Shannon Divergence. ([¶0062] " In some embodiments, Jensen-Shannon divergence between the two statistical distributions for each layer (or for the model as a whole) is used to identify the optimal bit-widths with the least information loss for that layer."). 

	Chouireb, Jinyu Li, and Wang are all directed towards using neural networks for speech synthesis.  Therefore, Chouireb, Jinyu Li, and Wang are all analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to substitute the KL Divergence of the combination of Jinyu Li and Chouireb with the Jensen-Shannon Divergence in Wang.  A Jensen-Shannon divergence is well known in art, and the substitution of a KL Divergence with a Jensen-Shannon divergence would be obvious to one of ordinary skill.  Wang teaches as a motivation for combination ([¶0062] "In some embodiments, Jensen-Shannon divergence between the two statistical distributions for each layer (or for the model as a whole) is used to identify the optimal bit-widths with the least information loss for that layer.").

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126