EXAMINER'S AMENDMENT
An examiner’s amendment to the record appears below.  Should the changes and/or additions be unacceptable to Applicants, an amendment may be filed as provided by 37 CFR 1.312.  To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.

Authorization for this examiner’s amendment was given in an interview with J. Robin Rohlicek on 29 April 2021.

The application has been amended as follows: 

15. (Currently Amended) One or more non-transitory computer-readable storage media storing executable instructions that, when executed cause an apparatus to perform operations comprising: receiving text input; determining unit-level features based on the text input; providing the features as input to a unit-level recurrent neural network (RNN), the RNN during training having an input layer, one or more hidden layers including said hidden layer, and an output layer; providing an input to a frame-level recurrent neural network based on an output of the unit-level recurrent neural network, the frame-level recurrent neural network producing successive frame-level outputs corresponding to respective acoustic frames of speech output; determining embedded data from one or more activations of a hidden layer of the RNN; determining speech data based on a speech unit search, wherein the speech unit search selects, 

16. (Currently Amended)  The one or more non-transitory computer-readable storage media of claim 15, wherein the one or more activations of the hidden layer of the unit-level RNN comprises an activation of a hidden layer of a long short term memory RNN (LSTM-RNN).  

17. (Currently Amended)  The one or more non-transitory computer-readable storage media of claim 16, wherein the embedded data comprises one or more vectors of speech unit embeddings (SUEs).  

18. (Currently Amended)  The one or more non-transitory computer-readable storage media of claim 15, wherein the unit-level RNN is a first RNN, and the frame-level RNN is a second RNN, and wherein the operations further comprise: determining target prosody features using the second RNN; and wherein the speech unit search is performed using the unit-level embedded data and the target prosody features as inputs to the speech unit search.  

19. (Currently Amended)  The one or more non-transitory computer-readable storage media of claim 15, wherein the unit-level RNN is a second RNN, and the frame-level RNN is a third RNN, and wherein the operations further comprise: providing the features ration from output of the first RNN; providing the target duration as input to the second RNN; determining target prosody features from output of the third RNN; wherein the speech unit search is performed based on the unit-level embedded data and the target prosody features, and wherein the speech unit search comprises a dynamic programming optimization that minimizes a loss function.  

20. (Currently Amended)  The one or more non-transitory computer-readable storage media of claim 15, wherein the operations further comprise: determining a waveform based on the speech data; and generating the speech output based on the waveform.

EXAMINER'S STATEMENT OF REASONS FOR ALLOWANCE 
The following is an examiner’s statement of reasons for allowance:
Independent claims 1, 8, and 15 are allowable because the prior art of record does not disclose or reasonably suggest a method, apparatus, and computer-readable medium for speech synthesis comprising receiving text input, determining unit-level features based on the text input, providing the features as input to a unit-level recurrent neural network (RNN), determining unit-level embedded data from one or more activations of a hidden layer of the RNN, the RNN during training having an input layer, one or more hidden layers including said hidden layer, and an output layer, providing an input to a frame-level recurrent neural network based on an output of the unit-level recurrent neural network, the frame-level recurrent neural network producing successive frame-level outputs corresponding to respective acoustic frames of speech output, 
Fructuoso et al. (U.S. Patent Publication 2016/0343366) remains the closest prior art of record, but does not clearly disclose or reasonably suggest determining speech data based on a speech unit search “using the unit-level embedded data as input to the speech unit search”.  The prior rejection is being reconsidered for this limitation of “using the unit-level embedded data as input to the speech unit search” insofar as ‘embedded data’ input to a speech unit search is not provided in an entire combination with a first neural network and a second neural network by Fructuoso et al.   Here, terminology of “embedded data” is not well defined in the prior art for this context, but this denotes data from a neural network that is output from a hidden layer, and not from an output layer, as described and illustrated by Applicants’ Specification, ¶[0070] - ¶[0071], ¶[0092] - ¶[0093], ¶[0095] - ¶[0096], and ¶[0103]: Figures 5, 8, and 9.  This feature provides a patentable distinction over Fructuoso et al.  Applicants’ Figures 5, 8, and 9 all illustrate that embedded data from a hidden layer of a first neural network is used as input to a speech unit search, and data from an output layer of a first neural network might not be used at all in a speech unit search.  Fructuoso et al. similarly discloses first and second neural networks for speech synthesis, where a first neural network is construed as “a unit level neural network” and a second neural network is construed as “a frame-level neural network”, but only uses output from a second neural Fructuoso et al.
Applicants’ argument is not completely persuasive that “a frame-level neural network . . . producing successive frame-level outputs corresponding to respective acoustic frames of speech output” is not disclosed by Fructuoso et al.  Here, Fructuoso et al., at ¶[0053] and ¶[0073], is maintained to sufficiently disclose this limitation, where it is stated that a second neural network receives data that indicates a particular quantity of frames of audio data.  Generally, Fructuoso et al. makes a distinction between linguistic features and acoustic features, and discloses that a first neural network maps linguistic features to acoustic features.  Fructuoso et al. then provides a second neural network using these acoustic features to produce duration information, which is “the successive frame-level outputs”.  Applicants’ argument appears to be predicated on a hypothesis that acoustic features might be ‘averaged’ over a plurality of frames for a whole phonetic linguistic unit, so as to render these acoustic features inconsistent with fixed-length frame-level characteristics of spectrum and fundamental frequency as disclosed by ¶[0032] of Fructuoso et al.  However, there is no evidence of any ‘averaging’ performed by Fructuoso et al., and it is maintained to be a reasonable reading of that reference that a second neural network produces “successive frame-Fructuoso et al.
Chun et al. (U.S. Patent Publication 2018/0268806), ¶[0083] - ¶[0085], discloses a fairly close method and system of speech synthesis expressly including “frame-level embeddings” and “unit level embeddings”.  Moreover, Chun et al., at ¶[0038]: Figure 1 discloses a linguistic encoder 114 and an acoustic encoder 116 that are recurrent neural networks, but appears to perform unit selection only using unit-level embeddings and not frame-level embeddings.  However, Chun et al. does not actually appear to be prior art because its effective foreign priority date is 14 March 2017 as compared to Applicants’ effective provisional priority date of 04 October 2016 from Provisional Patent Application 62/403771.
The Specification, ¶[0004] - ¶[0006], states an objective of improving on processes of synthesizing speech that is more natural and does not have audible glitches.
Any comments considered necessary by Applicants must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached on Monday-Thursday 8:30 AM-6:00 PM.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MARTIN LERNER/Primary Examiner
Art Unit 2657                                                                                                                                                                                                        April 29, 2021