DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-2, 6-7, 11-12, 16-17 and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Wu et al. (US 2020/0051583).

Claims 1, 11 and 20,
Wu teaches a method for speech synthesis in parallel, comprising: splitting a piece of text into a plurality of segments; based on the piece of text, obtaining a plurality of initial hidden states of the plurality of segments for a recurrent neural network; and synthesizing the plurality of segments in parallel based on the plurality of initial hidden states and input features of the plurality of segments ([Figs. 1 & 5] [0015-0029] [0068-0073] speech synthesis system receives input text and processes the input text through a series of neural networks for generating speech; covert each character in the sequence of characters in the input text into a one-hot vector and embed each one-hot vector in a continuous vector; processing the input sequence of characters representation to generate a Mel-frequency spectrogram; processing the input text sequence using one or more convolutional layers followed by a LSTM layer; the neurons in each convolutional layer receives input from a small subset of neurons in a previous layer; this neuron connectivity allows the convolutional layers to learn filters that activate when particular hidden features appear in particular positions in a sequence of characters; TTS model is a parallel feed-forward neural network).

Claims 2 and 12,
Wu further teaches the method of claim 1, wherein each segment in the plurality of segments comprises any of a phoneme, a syllable and a prosodic word, and synthesizing the plurality of segments in parallel comprises: synthesizing each segment serially in an autoregressive manner based on the initial hidden state and the input feature of each segment ([Figs. 1 & 5] [0015-0029] [0068-0073] sequence of characters of the input text; TTS model is trained in an autoregressive neural network).

Claims 6 and 16,
Wu further teaches the method of claim 1, further comprising: training a speech synthesis model based on the recurrent neural network by using training data; and training a hidden state prediction model by using the training data and the trained speech synthesis model ([0035] speech synthesis system trained based on neural network; predicting Mel-frequency spectrograms for each time step, a concatenation of the output of the LSTM subnetwork and the fixed-length context vector is projected to a scalar and passed through a sigmoid activation to predict the probability that the output sequence of Mel frequency spectrograms has completed).

Claims 7 and 17,
Wu further teaches the method of claim 6, wherein training the speech synthesis model based on the recurrent neural network comprises: obtaining a frame-level input feature of a training text in the training data and a speech sample point of a training speech corresponding to the training text, in which, the frame-level input feature comprises at least one of phoneme context, prosody context, a frame position and a fundamental frequency; and training the speech synthesis model by using the frame-level input feature of the training text and the speech sample point of the training speech ([0023] [0029-0035] a neural network for generating  frame unit based Mel-frequency spectrogram; autoregressive neural network that is conditioned on Mel-frequency spectrograms generates time-domain audio waveforms; one or more convolutional layers processes the predicted Mel-frequency spectrogram for the time step to predict a residual to add to the predicted Mel-frequency spectrogram; generating a fixed-length context vector for each frame of a Mel-frequency spectrogram).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claims 3-5, 8-10, 13-15 and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Wu et al. (US 2020/0051583) and further in view of Wang (US 2018/0082675).

Claims 3 and 13,
Wu teaches wherein obtaining the plurality of initial hidden states of the plurality of segments for the recurrent neural network; predicting the initial hidden state of each segment by using a hidden state prediction model subjected to training ([Figs. 1 & 5] [0015-0029] [0068-0073] neural network having plurality of layer; neural network includes a pre-net through which a Mel-frequency spectrogram prediction for a pervious time step passes; pre-net include two fully-connected layers of hidden ReLUs). 
The difference between the prior art and the claimed invention is that Wu does not explicitly teach determining a phoneme-level input feature of each segment in the plurality of segments; and based on the phoneme-level input feature of each segment, predicting the initial hidden state of each segment by using a hidden state prediction model subjected to training.
Wang teaches determining a phoneme-level input feature of each segment in the plurality of segments; and based on the phoneme-level input feature of each segment, predicting the initial hidden state of each segment by using a hidden state prediction model subjected to training ([0018-0022] TTS system for parsing the series of text and generating plurality phonemes corresponding to the text series using HMM based speech synthesis).
Wu is analogous art with Wang because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed (Wang [0016]).

Claims 4 and 14,
Wu further teaches the method of claim 3, wherein synthesizing the plurality of segments in parallel comprises: determining a frame-level input feature of each segment in the plurality of segments; based on the frame-level input feature, obtaining a sample-point level feature by utilizing an acoustic condition model; and based on the initial hidden state and the sample-point level feature of each segment, synthesizing respective segments by using a speech synthesis model based on the recurrent neural network ([0023] [0029-0035] a neural network for generating  frame unit based Mel-frequency spectrogram; autoregressive neural network that is conditioned on Mel-frequency spectrograms generates time-domain audio waveforms; one or more convolutional layers processes the predicted Mel-frequency spectrogram for the time step to predict a residual to add to the predicted Mel-frequency spectrogram).

Claims 5 and 15,
Wu further teaches the method of claim 4, wherein obtaining the sample-point level feature by utilizing the acoustic condition model comprises: obtaining the sample-point level feature by repeating up-sampling ([0023] generating a fixed-length context vector for each frame of a Mel-frequency spectrogram).


Wu teaches all the limitations in claims 7 and 17 ([Figs. 1 & 5] [0015-0029] [0068-0073] neural network having plurality of layer; neural network includes a pre-net through which a Mel-frequency spectrogram prediction for a pervious time step passes; pre-net include two fully-connected layers of hidden ReLUs). 
The difference between the prior art and the claimed invention is that Wu does not explicitly teach wherein training the hidden state prediction model comprises: obtaining a phoneme-level input feature of the training text, in which the phoneme-level input feature comprises at least one of the phoneme context and the prosody context; obtaining a phoneme-level hidden state of each phoneme from the trained speech synthesis model; and training the hidden state prediction model by using the phoneme-level input feature and the phoneme-level hidden state.
Wang teaches wherein training the hidden state prediction model comprises: obtaining a phoneme-level input feature of the training text, in which the phoneme-level input feature comprises at least one of the phoneme context and the prosody context; obtaining a phoneme-level hidden state of each phoneme from the trained speech synthesis model; and training the hidden state prediction model by using the phoneme-level input feature and the phoneme-level hidden state ([0018-0022] TTS system for parsing the series of text and generating plurality phonemes corresponding to the text series using HMM based speech synthesis).
Wu is analogous art with Wang because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wu with teachings of Wang by modifying neural network for synthesizing speech for text prediction/training using plurality of phonemes of the parsed text as taught by Wang instead of word sequences as taught by Wu for the benefit of reducing the requirements on (Wang [0016]).

Claims 9 and 19,
Wang further teaches the method of claim 8, wherein training the hidden state prediction model further comprises: clustering the phoneme-level hidden state of each phoneme to generate a phoneme-level clustering hidden state; and training the hidden state prediction model by using the phoneme-level input feature and the phoneme-level clustering hidden state ([0025] [0029] each speech segment of the speech segments includes a plurality of text labels; text labels are for labelling relationships of the phonemes 1-M; training the HMM model based on text label in the speech segments).

Claim 10,
Wu further teaches the method of claim 8, wherein obtaining the phoneme-level hidden state of each phoneme from the trained speech synthesis model comprises: determining an initial hidden state of a first sample point in a plurality of sample points corresponding to each phoneme as the phoneme-level hidden state of each phoneme ([0023] generating a fixed-length context vector for each frame of a Mel-frequency spectrogram).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Wang (US 2018/0082675) teaches A text-to-speech method includes: receiving a text series, and generating a plurality of phonemes corresponding to the text series, wherein the phonemes form a phoneme series; inserting a pause phoneme into the phoneme series; dividing the phoneme series and the pause phoneme into a plurality of phoneme sub-series by using the pause phoneme as a dividing point, and generating a plurality of speech segments according to the phoneme sub-series; and performing a speech synthesis operation individually on the speech segments to generate a plurality of speech outputs corresponding to the plurality of speech segments. The pause phoneme is a last phoneme of the phoneme sub-series in which the pause phoneme locates. 
Wu et al. (US 2020/0051583) teaches generating, from an input character sequence, an output sequence of audio data representing the input character sequence. The output sequence of audio data includes a respective audio output sample for each of a number of time steps. One example method includes, for each of the time steps: generating a Mel-frequency spectrogram for the time step by processing a representation of a respective portion of the input character sequence using a decoder neural network; generating a probability distribution over a plurality of possible audio output samples for the time step by processing the Mel-frequency spectrogram for the time step using a vocoder neural network; and selecting the audio output sample for the time step from the possible audio output samples in accordance with the probability distribution.
Pollet et al. (US 2009/0048841) teaches a speech segment database references speech segments having various different speech representational structures. A speech segment selector selects from the speech segment database a sequence of speech segment candidates corresponding to a target text. A speech segment sequencer generates from the speech segment candidates sequenced speech segments corresponding to the target text. A speech segment synthesizer combines the selected sequenced speech segments to produce a synthesized speech signal output corresponding to the target text.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689.  The examiner can normally be reached on Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


SHREYANS A. PATEL

Art Unit 2657



/SHREYANS A PATEL/               Examiner, Art Unit 2656