DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claims 4 to 5 are objected to because of the following informalities:  
Independent claims 4 and 5 set forth a limitation of a method “determining features based on the text input”, but there is a lack of express antecedent basis for “the text input”.  Applicants have incorporated some of the limitations of independent claim 1, but have omitted the step of “receiving text input”.
Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 4 to 5, 8, 11 to 12, 15, and 18 to 19 are rejected under 35 U.S.C. 103 as being unpatentable over Fructuoso et al. (U.S. Patent Publication 2015/0186359) in view of Fructuoso et al. (U.S. Patent Publication 2016/0343366).

Concerning independent claims 1, 8, and 15, Fructuoso et al. (‘359) discloses a method, system, and computer-readable medium for prosody generation in text-to-speech synthesis, comprising:
“receiving text input” – computing system 120 obtains text 121 for which synthesized speech should be generated, where text 121 may be provided by any appropriate source, including client device 110, accessed from storage, received from another computing device, or another source (¶[0030: Figure 1); computing system 120 obtains prosody information for text 202, which includes the phrase ‘hello there’; data indicating a set of linguistic features corresponding to text is obtained (¶[0044]: Figure 2);
“determining unit-level features based on the text input” – computing device 120 obtains data indicating linguistic features 122 corresponding to text 121; computing device may access a lexicon to identify a sequence of phonetic units, e.g., phonemes, in a phonetic representation of text 121 (¶[0031]: Figure 1); computing device 120 extracts linguistic features, e.g., phonemes, from text 202; computing device 120 determines a sequence 203 of phonetic units 204a – 204g (“unit-level features”) that form a phonetic representation of text 202 (¶[0044]: Figure 2); data indicating a set of linguistic features corresponding to text is obtained; a sequence of phonetic units, e.g., phonemes, in a phonetic representation of text can be obtained (¶[0056]: Figure 3); here, linguistic features as phonetic units are “unit-level features based on the text 
“providing the features as input to a unit-level [recurrent] neural network ([R]NN)” – computing system 120 provides data indicating linguistic features to neural network 140 (¶[0025]: Figure 1); to obtain prosody information, computing system 120 provides an input data set 230 to trained neural network 240 (¶[0049]: Figure 2); neural network 240, then, is ‘unit-level neural network’ because it operates on phonetic units;
“determining unit-level embedded data from one or more activations of a hidden layer of the [R]NN” – neural network 140 provides neural network outputs 142 that indicate prosody information for linguistic features 122 (¶[0035]: Figure 1); for each set of input data provided, neural network 240 provides a corresponding set of output data; neural network provides output values that indicate information shown in output data set 250 in response to receiving input data set 230; output data can indicate prosody information for the linguistic group indicated by input data set 230 (¶[0050]: Figure 2); Applicants’ “embedded data” appears to be a ‘coined’ term representing information internal to layers of the neural network during processing; Figure 1 illustrates neural network 140 including an input layer on the left side, a plurality of three hidden layers in the middle, and an output layer on the right side, where an input layer, a plurality of hidden layers, and an output layer are conventional in any neural network; implicitly, a neural network operates by “one or more activations of a hidden layer”; “embedded data” is being construed as internal data that is implicitly being passed between hidden layers of a neural network;

“determining speech data based on a speech unit search, wherein the speech unit search selects, from a database, speech units using the unit-level embedded data as an input to the speech unit search[, and wherein the speech data further depends on the output of the frame-level RNN]” – prosody information from outputs 142 may be used to estimate prosody targets for selection of units, e.g., samples of human speech; unit selection speech synthesis systems commonly use a cost function to evaluate different units; units determined to minimize a cost function may be selected for inclusion in audio output; prosody characteristics can be used in the cost function for selection of units, which biases a selection process so that units are selected to match or approximate the prosody characteristic indicated by outputs 142 (¶[0040]: Figure 1); prosody information may be used to generate synthesized speech data for text 202 
“causing a speech output to be generated based on the speech data” – computing system 120 uses outputs 142 from neural network 140 to generate audio data representing text 121 (¶[0039]: Figure 1); computing device 120 provides audio data 160 to client device 110, and client device 110 may then play the audio with a speaker for a user 102 to hear, or store the audio data 160 for later use (¶[0042]: Figure 1); audio data representing the text is generated using output of the neural network (¶[0062]: Figure 3: Step 308). 
Concerning independent claims 1, 8, and 15, Fructuoso et al. (‘359) discloses all of the limitations of these independent claims, but does not expressly disclose a neural network that is “recurrent”, and omits limitations of “providing an input to a frame-level recurrent neural network based on an input of the unit-level recurrent neural network” and determining speech data based on a speech unit search “wherein the speech data further depends on an output of the frame-level RNN.”  That is, Fructuoso et al. (‘359) only discloses using one neural network that operates on phonetic representations of text, where these phonetic representations are equivalent to phonetic units as described by Applicants’ Specification, so only “a unit-level neural network” is disclosed by Fructuoso et al. (‘359), but not “a frame-level neural network”.  Generally, a “recurrent” neural network is conventionally one of the most common varieties of neural networks, so that it would be obvious choice of a neural network to select one that is “recurrent”.  Moreover, Applicants’ limitation of determining “embedded data from one or more activations of a hidden layer” and a neural network “during training having an input layer, one or more hidden layers including said hidden layer, and an output layer” are maintained by be implicit for Fructuoso et al. (‘359).  Here, Fructuoso et al. (‘359) actually illustrates a neural network having an input layer, one or more hidden layers, and an output layer as neural network 140 in Figure 1.  Implicitly, neural networks operate by ‘activations’ of nodes of hidden layers given input data received at an input layer and producing output data at an output layer.  Applicants’ “embedded data” is simply a ‘coined’ term representation internal data of a neural network at hidden layers.
Concerning independent claims 1, 8, and 15, Fructuoso et al. (‘366) teaches whatever limitations omitted are omitted by Fructuoso et al. (‘359) are directed to “recurrent” neural networks, “providing an input to a frame-level recurrent neural network based on an input of the unit-level recurrent neural network”, and “wherein the speech data further depends on an output of the frame-level RNN.”  Generally, Fructuoso et al. (‘366) teaches speech synthesis using two neural networks as illustrated in Figures 1 and 2.  Fructuoso et al. (‘366) states that these neural networks can be “a recurrent neural network”.  (¶[0010] and ¶[0069]: Figure 3)  Here, Fructuoso et al. (‘366)’s first neural network 120 corresponds to “a unit-level recurrent neural network” and second neural network 130 corresponds to “a frame-level recurrent neural network”.  Fructuoso et al. (‘366) is similar to Fructuoso et al. (‘359) as teaching Fructuoso et al. (‘366) teaches that linguistic features of phonetic units are mapped by to acoustic features by first neural network 120.  (¶[0029] - ¶[0031]: Figure 1)  Then, these acoustic features are “frame-level features” that an input to second neural network 130.  A representation of acoustic features 124 may be real values which parameterize audio including a spectrum, fundamental frequency, and excitation parameters.  (¶[0032]: Figure 1)  Compare Specification, ¶[0041] - ¶[0042] and ¶[0062], which describes ‘frame-level features’ as spectrum parameters, fundamental frequency, or excitation.  The representation of acoustic features 224 may be provided to a second neural network 230, where second neural network 230 may receive data that indicates a particular quantity of frames.  (¶[0053]: Figure 2)  A second neural network may receive data that indicates a particular quantity of frames of audio data that are to be generated, and a number of acoustic features which may be needed in order to generate each linguistic feature.  (¶[0073]: Figure 3)  Accordingly, acoustic features that parameterize audio as a spectrum, fundamental frequency, and excitation parameters are at a “frame-level”.  Conventionally, an audio spectrum, fundamental frequency, and excitation parameters are calculated in audio processing for each frame.  Fructuoso et al. (‘366), then, teaches “providing an input to a frame-level recurrent neural network based on an output the unit-level recurrent neural network” because first neural network 120 takes linguistic features at a unit-level of phonemes and outputs acoustic features characteristic of Fructuoso et al. (‘366) in a neural network for prosody generation in speech synthesis of Fructuoso et al. (‘359) for a purpose of enabling improved handling and synthesis of combinations of unseen linguistic units.
 
Concerning independent claim 4, Fructuoso et al. (‘359) discloses a method comprising: 
“determining features based on the text input” – computing device 120 obtains data indicating linguistic features 122 corresponding to text 121; computing device may access a lexicon to identify a sequence of phonetic units, e.g., phonemes, in a phonetic representation of text 121 (¶[0031]: Figure 1); computing device 120 extracts linguistic features, e.g., phonemes, from text 202; computing device 120 determines a sequence 203 of phonetic units 204a – 204g that form a phonetic representation of text 202 (¶[0044]: Figure 2); data indicating a set of linguistic features corresponding to text is obtained; a sequence of phonetic units, e.g., phonemes, in a phonetic representation of text can be obtained (¶[0056]: Figure 3); here, linguistic features as phonetic units are “features based on the text input”;
“providing the features as input to a first [recurrent] neural network ([R]NN)” – computing system 120 provides data indicating linguistic features to neural network 140 
“determining embedded data from one or more activations of a hidden layer of the first [R]NN, the first [R]NN during training having an input layer, one or more hidden layers including said hidden layer, and an output layer” – neural network 140 provides neural network outputs 142 that indicate prosody information for linguistic features 122 (¶[0035]: Figure 1); computing system 120 provides an input data set 230 to a trained neural network 240 (¶[0049]: Figure 2); a neural network can be a neural network that is trained using speech (¶[0057]: Figure 3); a neural network used to obtain prosody information can be a trained neural network, where a state of training can be represented by internal weight values and other parameters defining the properties of the neural network (¶[0064]: Figure 3); Applicants’ “embedded data” appears to be a ‘coined’ term representing information internal to layers of the neural network during processing; Figure 1 illustrates neural network 140 including an input layer on the left side, a plurality of three hidden layers in the middle, and an output layer on the right side, where an input layer, a plurality of hidden layers, and an output layer are conventional in any neural network; implicitly, a neural network operates by “one or more activations of a hidden layer”;
“determining target prosody features [using the second RNN]” – to obtain prosody information (“target prosody features”), computing system 120 provides an input data set 230 to a trained neural network (¶[0049]: Figure 2); output data can indicate prosody information for a linguistic group indicated by input data set 230; neural network 240 can map linguistic groups to prosody values (¶[0050]: Figure 2);a neural e.g., by unit-selection synthesis (¶[0062]: Figure 3);
“determining speech data based on a speech unit search, wherein the speech unit search selects, from a database, speech units using embedded data as an input to the speech unit search” – prosody information from outputs 142 may be used to estimate prosody targets for selection of units, e.g., samples of human speech; unit selection speech synthesis systems commonly use a cost function to evaluate different units; units determined to minimize a cost function may be selected for inclusion in audio output; prosody characteristics can be used in the cost function for selection of units, which biases a selection process so that units are selected to match or approximate the prosody characteristic indicated by outputs 142 (¶[0040]: Figure 1); prosody information may be used to generate synthesized speech data for text 202 using unit-selection speech synthesis; prosody information may be used to set prosody targets for selecting units in a unit selection synthesis system; prosody information for a particular phonetic unit can be used in a target cost as a way to select a unit that matches prosody estimated by neural network 240 for a particular phonetic unit (¶[0054]: Figure 2); 
“causing speech output to be generated based on the speech data, wherein the speech unit search is performed using the embedded data and the target prosody features as inputs to the speech unit search” – computing system 120 uses outputs 142 from neural network 140 to generate audio data representing text 121 (¶[0039]: Figure 
Concerning independent claim 4, Fructuoso et al. (‘366) teaches “providing the features and the embedded data as input to a second RNN” – acoustic features 224 may be provided to a second neural network 230 (¶[0052]: Figure 2); process 300 may include providing a particular set of linguistic features as input to a first neural network that are mapped to acoustic features, and may include providing a representation of this particular set of acoustic features as input to a second neural network (¶[0068] - ¶[0069]: Figure 3: Steps 330 and 350). 

Concerning claims 11 and 18, similar considerations apply as directed to independent claim 4.

Concerning independent claim 5, Fructuoso et al. (‘359) discloses a method comprising: 
“determining features based on the text input” – computing device 120 obtains data indicating linguistic features 122 corresponding to text 121; computing device may access a lexicon to identify a sequence of phonetic units, e.g., phonemes, in a phonetic representation of text 121 (¶[0031]: Figure 1); computing device 120 extracts linguistic e.g., phonemes, from text 202; computing device 120 determines a sequence 203 of phonetic units 204a – 204g that form a phonetic representation of text 202 (¶[0044]: Figure 2); data indicating a set of linguistic features corresponding to text is obtained; a sequence of phonetic units, e.g., phonemes, in a phonetic representation of text can be obtained (¶[0056]: Figure 3); here, linguistic features as phonetic units are “features based on the text input”;
“providing the features as input to a first [recurrent] neural network ([R]NN)” – computing system 120 provides data indicating linguistic features to neural network 140 (¶[0025]: Figure 1); to obtain prosody information, computing system 120 provides an input data set 230 to trained neural network 240 (¶[0049]: Figure 2);
“determining embedded data from one or more activations of a hidden layer of the [second] RNN, the [second] RNN during training having an input layer, one or more hidden layers including said hidden layer, and an output layer” – neural network 140 provides neural network outputs 142 that indicate prosody information for linguistic features 122 (¶[0035]: Figure 1); for each set of input data provided, neural network 240 provides a corresponding set of output data; neural network provides output values that indicate information shown in output data set 250 in response to receiving input data set 230; output data can indicate prosody information for the linguistic group indicated by input data set 230 (¶[0050]: Figure 2); Applicants’ “embedded data” appears to be a ‘coined’ term representing information internal to layers of the neural network during processing; Figure 1 illustrates neural network 140 including an input layer on the left side, a plurality of three hidden layers in the middle, and an output layer on the right side, where an input layer, a plurality of hidden layers, and an output layer are 
“determining target prosody features using the [third R]NN” – to obtain prosody information (“target prosody features”), computing system 120 provides an input data set 230 to a trained neural network (¶[0049]: Figure 2); output data can indicate prosody information for a linguistic group indicated by input data set 230; neural network 240 can map linguistic groups to prosody values (¶[0050]: Figure 2);a neural network is trained to provide output indicating prosody information (¶[0057]: Figure 3); prosody characteristics indicated by output of a neural network may be used to select speech samples to include in an audio representation, e.g.,  by unit-selection synthesis (¶[0062]: Figure 3);
“determining speech data based on a speech unit search, wherein the speech unit search selects, from a database, speech units using embedded data as an input to the speech unit search” – prosody information from outputs 142 may be used to estimate prosody targets for selection of units, e.g., samples of human speech; unit selection speech synthesis systems commonly use a cost function to evaluate different units; units determined to minimize a cost function may be selected for inclusion in audio output; prosody characteristics can be used in the cost function for selection of units, which biases a selection process so that units are selected to match or approximate the prosody characteristic indicated by outputs 142 (¶[0040]: Figure 1); prosody information may be used to generate synthesized speech data for text 202 using unit-selection speech synthesis; prosody information may be used to set prosody targets for selecting units in a unit selection synthesis system; prosody information for a 
“causing speech output to be generated based on the speech data, wherein the speech unit search is performed using the embedded data and the target prosody features as inputs to the speech unit search” – computing system 120 uses outputs 142 from neural network 140 to generate audio data representing text 121 (¶[0039]: Figure 1); computing device 120 provides audio data 160 to client device 110, and client device 110 may then play the audio with a speaker for a user 102 to hear, or store the audio data 160 for later use (¶[0042]: Figure 1); audio data representing the text is generated using output of the neural network (¶[0062]: Figure 3: Step 308); here, speech unit selection clearly uses prosody information (“the target prosody features”), and implicitly uses any intermediately-generated internal data (“the embedded data”).
Concerning independent claim 5, Fructuoso et al. (‘366) teaches:
“determining target duration output from the first RNN” – second neural network 220 may receive data that indicates a particular quantity of frames that are to be generated, or, in other words, a duration of time in which samples from the model to which it maps acoustic features 224 will occupy in synthesized speech may be communicated to second neural network 230; that is, duration information may be indicative of the number of acoustic features 224 which may be needed in order to generate each linguistic feature (¶[0053}: Figure 2); first neural network 220, then, outputs this duration information (“target duration”) to second neural network 230;

“providing the target duration as input to the second RNN” – second neural network 220 may receive data that indicates a particular quantity of frames that are to be generated, or, in other words, a duration of time in which samples from the model to which it maps acoustic features 224 will occupy in synthesized speech may be communicated to second neural network 230; that is, duration information may be indicative of the number of acoustic features 224 which may be needed in order to generate each linguistic feature (¶[0053]: Figure 2); first neural network 220, then, outputs this duration information (“target duration”) to second neural network 230;
“providing the features and the embedded data as input to a third RNN” – a third neural network positioned upstream from both first neural network 220 and second neural network 230, but downstream from linguistic feature extractor 210, may be provided for estimating duration information, e.g., a quantity of frames of audio data to be generated; output of a third neural network that maps linguistic features 214 to duration information may be provided directly to second neural network 230, or may provide first neural network 220 with linguistic features 214 that is has received from linguistic feature extractor 210 (¶[0055]: Figure 2).
.

Claims 2, 9, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Fructuoso et al. (U.S. Patent Publication 2015/0186359) in view of Fructuoso et al. (U.S. Patent Publication 2016/0343366) as applied to claims 1, 8, and 15 above, and further in view of Chicote et al. (U.S. Patent No. 10,475,438).
Fructuoso et al. (‘366) teaches a recurrent neural network (RNN), but not a long short term memory RNN (LSTM-RNN).  However, LSTM neural networks are commonly used in the prior art of speech processing.  Specifically, Chicote et al. teaches contextual text-to-speech processing, where an encoder may be implemented as a recurrent neural network, which can be a long short-term memory RNN (LSTM-RNN) or a gated recurrent unit RNN (GRU-RNN).  An RNN is a tool whereby a network of nodes may be represented numerically and where each node representation includes information about the preceding portions of the network.  (Column 14, Lines 40 to 46)  A speech synthesis engine 914 may perform speech synthesis using unit selection, where a unit selection engine 930 matches a symbolic linguistic represented against a database of recorded speech.  Matching units are selected and concatenated together to form a speech output.  One benefit of unit selection is that, depending on the size of the database, a natural sounding speech output may be generated.  (Column 17, Lines 4 to 29: Figure 9)  Units may be selected based on a cost function which represents how well particular units fit the speech segments to be synthesized, where the cost function may represent a combination of a target cost and a join cost.  (Column 19, Line Chicote et al. to perform speech unit selection in Fructuoso et al. (‘359) for a purpose of providing more continuity between synthesized speech segments.

Claims 3, 10, and 16 to 17 are rejected under 35 U.S.C. 103 as being unpatentable over Fructuoso et al. (U.S. Patent Publication 2015/0186359) in view of Fructuoso et al. (U.S. Patent Publication 2016/0343366) as applied to claims 1, 8, and 15 above, and further in view of Chua et al. (U.S. Patent Publication 2017/0358293).
Fructuoso et al. (‘359) does not expressly disclose “wherein the embedded data comprises one or more vectors of speech unit embeddings (SUEs)”.  Still, Applicants’ “speech unit embeddings (SUEs)” is simply a coined term, similar to “embedded data”, which represents internal processing data of a neural network.  Chua et al. teaches pronunciation prediction for a text-to-speech system, where a neural network, e.g., a long short-term memory (LSTM) recurrent neural network (“an activation of a hidden layer of a long short term memory RNN (LSTM-RNN)”), may receive a spelling of a word as input and generate an output sequence with a stress pattern for a word.  A system may then generate an audible synthetization of the input word.  (¶[0016])  Pronunciation generation module 116 may generate a series of vectors that together represent word data 108, where each letter or grapheme of a word may be identified by a different vector.  These vectors may be one-hot vectors 118 for word data 108.  Pronunciation generation module 116 provides the one-hot input vectors 118 to a Chua et al., then, teaches “speech unit embeddings (SUEs)”, where “speech units” are letters or graphemes of a word and “embeddings” are output vectors 122.  An objective is to provide a pronunciation generation system to indicate the stress pattern and syllabification for a text-to-speech system.  (¶[0006])  It would have been obvious to one having ordinary skill in the art to provide speech unit embeddings as taught by Chua et al. as embedded data of Fructuoso et al. (‘359) for a purpose of indicating a stress pattern and syllabification for a text-to-speech system.

Claims 6 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Fructuoso et al. (U.S. Patent Publication 2015/0186359) in view of Fructuoso et al. (U.S. Patent Publication 2016/0343366) as applied to claims 1, 8, and 15 above, and further in view of Coorman et al. (U.S. Patent Publication 2005/0182629).
Fructuoso et al. (‘359) discloses speech synthesis by unit selection to estimate prosody targets for selection of units, where the unit is determined to minimize a cost function to evaluate different units.  (¶[0040]: Figure 1)  Here, minimizing a cost function is equivalent to “minimizes a loss function”.  That is, “a loss function” is defined equivalently to a cost function, where a cost is a loss.  Fructuoso et al. (‘359) does not expressly disclose using “dynamic programming optimization” to minimize this cost Coorman et al. teaches that concatenative synthesis is performed for generating speech waveforms by re-sequencing and concatenating digital segments that are extracted from recorded speech, where the speech segments are obtained from a constrained optimization program that is typically solved by dynamic programming.  (¶[0009] - ¶[0010])  The speech segments that are extracted from this data to generate speech are often referred to as ‘speech units’.  (¶[0017])  A dynamic programming algorithm is used to find the lowest cost path through all possible sequences of candidate basic speech units (BSUs) taking into account a well-chosen balance between target costs and concatenation costs.  (¶[0022])  Using dynamic programming, the best sequence of candidate speech units is selected for output to speech waveform concatenator 151.  (¶[0027]: Figure 1)  An objective is to generate synthesized speech through concatenation of speech segments that are derived from a large prosodically-rich corpus of speech segments.  (Abstract)  It would have been obvious to one having ordinary skill in the art to use dynamic programming as taught by Coorman et al. to minimize a cost function of Fructuoso et al. (‘359) for a purpose of generating synthesized speech through concatenation of speech segments from a large corpus of speech segments.  

Claims 7, 14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Fructuoso et al. (U.S. Patent Publication 2015/0186359) in view of Fructuoso et al.  as applied to claims 1, 8, and 15 above, and further in view of Senior et al. (U.S. Patent Publication 2015/0073804).
Fructuoso et al. (‘359) does not expressly disclose “determining a waveform based on the speech data, and generating the speech output based on the waveform.”  However, it is known that speech synthesis can generally be described as generating speech waveforms as output after unit selection to provide speech output.  Specifically, Senior et al. teaches deep neural networks for unit selection speech synthesis given linguistic features to predict acoustic features, where acoustic features can be a vector of elements that together represent a sound waveform.  The neural network may output target acoustic features that are a vector of elements that represent a waveform.  (¶[0007])  The acoustic features may be a vector of elements that together represent a sound waveform.  (¶[0029])  An objective is to produce artificial human speech in circumstances where it is difficult for people to read text.  (¶[0002])  It would have been obvious to one having ordinary skill in the art to generate speech output based on a waveform representing speech data as taught by Senior et al. to produce audio output of Fructuoso et al. (‘359) for a purpose of producing artificial human speech in circumstances where it is difficult for people to read text.

Response to Arguments
Applicants’ arguments filed 02 February 2021 have been considered but are moot pursuant to new grounds of rejection.  
Applicants amend independent claims 1, 8, and 15 to set forth new limitations directed to “unit-level” features that are provided to a “unit-level” recurrent neural 
Applicants present arguments directed against the prior rejection of independent claims 1, 8, and 15 as being obvious under 35 U.S.C. §103 over Senior et al. (U.S. Patent Publication 2015/0073804) in view of Fructuoso et al. (U.S. Patent Publication 2016/0343366).  Generally, Applicants attempt to distinguish these new limitations as directed to “a unit-level recurrent neural network” and “a frame-level neural network” over Fructuoso et al. (‘366).  Applicants note that the Specification describes a first recurrent neural network may operate at a unit level, which may be smaller than a word, e.g., a diphone, phone, half-phone, or demi-phone, but that a second recurrent neural network operates at an acoustic frame level, e.g., constant temporal intervals.  Additionally, Applicants summarize what was talked about during a telephone interview as to a hidden layer being construed as an output layer when a single neural network was being replaced by two neural networks in Fructuoso et al. (‘366).  Applicants allege that Fructuoso et al. (‘366) maps acoustic features to a particular model using a model identifier.  Applicants contend that Fructuoso et al. (‘366) teaches that phonetic units may have stress, but argue that this is not equivalent to prosody features of independent claims 4 and 5, and argues that even if speech output manifests a target prosody based on an analysis of text, there is nothing suggesting determining prosody using a neural network.  Applicants argue that a model identifier performs a function of an acoustic sample selector of Senior et al.  Additionally, Applicants maintain that a 
New claim objections are noted for independent claims 4 and 5.  
Generally, Applicants’ arguments are moot in view of new grounds of rejection directed against the independent claims for obviousness under 35 U.S.C. §103 over Fructuoso et al. (U.S. Patent Publication 2015/0186359) in view of Fructuoso et al. (U.S. Patent Publication 2016/0343366).  Here, Fructuoso et al. (‘359) is being substituted in the rejection of the independent claims for Senior et al.  These new grounds of rejection are maintained to be fairly necessitated by amendment insofar as independent claims 1, 8, and 15 are amended by Applicants, and Fructuoso et al. (‘359) better addresses at least a new limitation of “unit-level” features and a “unit-level” neural network.  The rejection of dependent claims 7, 14, and 20 continues to rely upon Senior et al.  Similarly, the rejection of some of the dependent claims continues to rely upon Chicote et al. (U.S. Patent No. 10,475,438), Chua et al. (U.S. Patent Publication 2017/0358293), and Coorman et al. (U.S. Patent Publication 2005/0182629).  Applicants should be advised that independent claims 1, 8, and 15 could be withdrawn under a doctrine of election by original presentation. 
 The main problem with Applicants’ arguments is that Fructuoso et al. (‘359) and Fructuoso et al. (‘366) do disclose and teach the new limitations of “unit-level features”, “a unit-level neural network”, and “a frame-level neural network” in accordance with the embodiments described in their Specification.  Applicants’ Specification, ¶[0041], Fructuoso et al. (‘359).  Here, a phone is equivalent to a phoneme, but the point is that all of these ‘units’ are really equivalent to ‘phonetic units’.  Fructuoso et al. (‘359) discloses that data including a set of linguistic features corresponding to text is a sequence of phonetic units, i.e., phonemes, and these linguistic features of phonetic units are provided to a trained neural network.  (¶[0056] - ¶[0057]: Figure 3: Steps 302 to 304)  Similarly, Fructuoso et al. (‘359) discloses determining a sequence of phonetic units 204a – 204g, which are provided as an input data set 230 to trained neural network 240.  (¶[0044] - ¶[0049]: Figure 2)  Fructuoso et al. (‘359)’s neural network, then, is “a unit-level neural network” that is receiving “unit-level features” because features are phonetic units in a same manner as the embodiments of Applicants’ Specification.
Now, Fructuoso et al. (‘359) only uses one neural network to convert phonetic units to acoustic units, but two neural networks are taught by Fructuoso et al. (‘366), and this second neural network is equivalent to “a frame-level recurrent neural network” in a manner equivalent to the embodiments described in the Specification.  Here, Applicants’ Specification, ¶[0041] - ¶[0042] and ¶[0062], characterizes frame-level features as spectral parameters, Mel-Cepstral Coefficients, fundamental frequency, log f0 values, variance of fundamental frequency, etc.  Again, this is precisely equivalent to acoustic features that are received by a second neural network of Fructuoso et al. (‘366).  That is, a first neural network takes phonetic units (“unit-level features”) and converts them into acoustic features (‘frame-level features”) that are provided to a Fructuoso et al. (‘366), ¶[0009], ¶[0032], ¶[0049], and ¶[0069], clearly states that a set of acoustic features includes one or more of spectrum parameters, fundamental frequency parameters, and mixed excitation parameters.  Fructuoso et al. (‘366)’s acoustic features, then, are the same as those defined as frame-level features by Applicants’ Specification.  Moreover, Fructuoso et al. (‘366) teaches a second neural network receives data indicating a quantity of frames of audio data that are to be generated.  (¶[0053] - ¶[0055])  Fructuoso et al. (‘366)’s second neural network, then, is equivalent to “a frame-level recurrent neural network” because it operates on the same frame-level features as described in Applicants’ Specification.  Accordingly, Applicants’ amendments to independent claims 1, 8, and 15 do not distinguish over a combination of Fructuoso et al. (‘359) and Fructuoso et al. (‘366).
Applicants’ remaining arguments directed against independent claims 1, 8, and 15 are being considered, but do not overcome the rejection.  Conceptually, the rejection is dividing a single neural network of Fructuoso et al. (‘359) into two neural networks of Fructuoso et al. (‘366).  One skilled in the art could appreciate that if a single neural network is divided into two neural networks, then this can have an advantage of task specialization.  Generally, this is what is being done by Fructuoso et al. (‘366), where a single task of converting linguistic features into acoustic features of Fructuoso et al. (‘359) is being supplemented with a second neural network of Fructuoso et al. (‘366), so that a second neural network can perform duration modeling.  Fructuoso et al. (‘359), ¶[0051], briefly suggests that duration modeling can be performed with only one neural Fructuoso et al. (‘366).  
Moreover, it is contended that both embedded data and output of a first neural network can be broadly construed as applied to a second neural network and a speech unit search of Fructuoso et al. (‘366).  Clearly, Fructuoso et al. (‘366) is receiving output of a first neural network at a second neural network because a first neural network outputs acoustic features to a second neural network.  (Fructuoso et al. (‘366), ¶[0037], notes that neural networks include hidden layers as is illustrated in Figure 1.)  However, it is contended that the second neural network implicitly receives embedded data from at least one hidden layer of a first neural network because the output data of a first neural network implicitly incorporates any intermediately-produced embedded data.  Unit selection is based on output of a second neural network and embedded data of a first neural network because, broadly construed, output of a second neural network is based on output of a first neural network, and output of a first neural network is based on embedded data from at least one of its hidden layers.  Granted, Applicants’ Figures 5, 8, and 9 may illustrate an alternative interpretation of this claim limitation as direct to “embedded data”, but limitations in a pending claim are read broadly without necessarily incorporating only an interpretation that is set forth in the Specification.  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
Similarly, one skilled in the art could understand that that if a single neural network is divided into two neural networks, then a given hidden layer is an output layer Fructuoso et al. (‘359) into two neural networks of Fructuoso et al. (‘366) would then conceptually convert the embedded data of the first neural network into output data of the first neural network.  Using embedded data representing internal processing data of a first neural would be an obvious consequence of dividing one neural network into two neural networks to obtain task specialization.
The combination can be supported under a rationale of KSR International Co. v. Teleflex Inc. (KSR), 550 U.S. 398, 82 USPQ2d 1385 (2007).  See MPEP §2141.  Fructuoso et al. (‘359) and Fructuoso et al. (‘366) are clearly mutually pertinent because they are commonly assigned and include common inventors.  A combination of these two references can be understood as (A) Combining prior art elements according to known methods to yield predictable results or (B) Simple substitution of one known element for another to obtain predictable results.  Generally, it would be predictable to use an architecture of two neural networks of Fructuoso et al. (‘366) as an alternative architecture for a single neural network of Fructuoso et al. (‘359), or to substitute two neural networks for a single neural network in a predictable way to perform unit selection speech synthesis.  Alternatively, Fructuoso et al. (‘366) teaches a standard motivation of enabling improved handling and synthesis of combinations of unseen linguistic features.  (Abstract)  And one skilled in the art would understand that using additional neural networks can optimize a division of tasks assigned to individual neural networks.   
Fructuoso et al. (‘366) is not significant.  It is true that stress in speech is one of the indicators prosody.  However, Fructuoso et al. (‘359), ¶[0040] and ¶[0049] - ¶[0050], expressly discloses that a neural network generates prosody information that is equivalent to Applicants’ “target prosody features” of independent claims 4 and 5.  Careful review of the limitations of these independent claims as set forth in the rejection shows how these limitations are obvious because target prosody, a target duration, and three neural networks are taught by Fructuoso et al. (‘366).
All of these new grounds of rejection are necessitated by amendment.  Applicants’ arguments are moot in view of these new grounds of rejection.  Accordingly, this rejection is properly FINAL.

Conclusion
Applicants’ amendment necessitated the new grounds of rejection presented in this Office Action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicants are reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached on Monday-Thursday 8:30 AM-6:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571) 272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).  If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/MARTIN LERNER/Primary Examiner
Art Unit 2657
February 22, 2021