DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to the claim(s) have been considered but they are not persuasive.
Examiner respectfully maintains that prior art of record fully teaches determining, by the device and using a duration model within the Tacotron system, respective temporal durations of each of the text components (Fig. 17’s Tacotron-WaveNet model, including using the Deep Voice 2 vocal model embodiment; para 111; Figs 3A, 3B depict multi-speaker Deep Voice 2 embodiments, including vocal model 355; para 80; and note that the vocal model 355 receives phoneme durations 260 from the trained duration model 340, which are used by the vocal model in synthesizing speech; para 99; Examiner notes here, the claimed “text components” are associated with the multiple phonemes 235, which in turn have corresponding phoneme durations 260.)

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1 – 3, 5 – 10, 12 – 17, 19 and 20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Arik et al. (hereinafter Arik, U.S. Patent Application Publication 2018/0336880), note Arik incorporates Van der oord "WaveNet: A Generative Model for Raw Audio," arXiv:1609.03499, 2016; see para 44 and is derived from the subject matter of Arik et al. "Deep Voice 2: Multi-speaker neural text-to-speech," arXiv preprint arXiv:1705.08947, 2017; U.S. 10,896,669, granted patent of the mapped publication ‘6880 detailing incorporation under “OTHER PUBLICATIONS” and col. 4 lines 1 – 3.

Regarding Claim 1, Arik discloses:
A method (e.g. operation of combined Tacotron-WaveNet model of Fig. 17; and in order to produce higher quality audio from Tacotron, a speaker-conditioned vocal model is used; para 111; note that the paper the Arik Patent Application Publication is derived from, ["Deep Voice 2: Multi-speaker neural text-to-speech," arXiv preprint arXiv:1705.08947, 2017 incorporated in its entirety under Arik et al. U.S. 10,896,669, granted patent of the mapped publication detailing incorporation under “OTHER PUBLICATIONS” in col. 2 of page 2, and col. 4 lines 1 – 3] details that the model used is equivalent to the Deep Voice 2 vocal model; see section 4.2.2), comprising:
receiving, by a device executing a Tacotron system, a text input that includes a sequence of text components (e.g. tacotron character-to-spectrogram; para 108; note also neural text-to-speech systems; abstract; and neural TTS models Deep Voice 2 and Tacotron)
determining, by the device and using a duration model within the Tacotron system, respective temporal durations of each of the text components (Fig. 17’s Tacotron-WaveNet model, including using the Deep Voice 2 vocal model embodiment; para 111; Figs 3A, 3B depict multi-speaker Deep Voice 2 embodiments, including vocal model 355; para 80; and note that the vocal model 355 receives phoneme durations 260 from the trained duration model 340, which are used by the vocal model in synthesizing speech; para 99; Examiner notes here, the claimed “text components” are associated with the multiple phonemes 235, which in turn have corresponding phoneme durations 260.);

generating, by the device and using the Tacotron system, a second set of spectra by replicating respective spectra of the first set of spectra using respective replication factors that are determined based on the respective temporal durations of the sequence of text components (e.g. Tacotron-WaveNet model, including using the Deep Voice 2 vocal model embodiment; para 111; Figs 3A, 3B as noted above, further, multi-speaker frequency model 325 produces the frequency profiles; para 97; further, the frequency model upsamples the predicted phoneme durations for the frequency model , input features [frequency information] corresponding to that phoneme will be repeated in two frames; para 64)
generating, by the device and using the Tacotron system, a spectrogram frame based on the second set of spectra (e.g. Tacotron-WaveNet model, including using the Deep Voice 2 vocal model embodiment; para 111; Figs 3A, 3B as noted above, further, note the vocal model for a character-to spectrogram; paras 108+, and 114)
generating, by the device and using the Tacotron system, an audio waveform based on the spectrogram frame (e.g. converting the spectrogram to audio using the Giffin-Lim or vocal model according to embodiments as detailed in Fig. 17; para 28); and


Regarding Claim 2, in addition to the elements stated above regarding claim 1, Arik further discloses:
wherein the text components are phonemes (e.g. text-to-speech system inputs text converted to phonemes 235; para 106)

Regarding Claim 3, in addition to the elements stated above regarding claim 1, Arik further discloses:
wherein the text components are characters (e.g. Tactron is a character to waveform model; para 107+; alternatively consider text-to-speech system inputs text converted to phonemes 235; para 106).

Regarding Claim 5, in addition to the elements stated above regarding claim 1, Arik further discloses:
wherein the second set of spectra comprise mel-frequency cepstrum spectra (e.g. note repeating as detailed above, including upsamling; para 64; and further In one or more embodiments, a mel-frequency cepstral coefficients (MFCCs) computed after resampling the input to a constant sampling frequency was used; para 138).

Claim 6, in addition to the elements stated above regarding claim 1, Arik further discloses:
training the duration model using a set of prediction frames and training text components (e.g. training procedure for the multi-speaker Deep Voice 2 embodiment; para 80; and note formulating duration prediction in the duration model; para 63).

Regarding Claim 7, in addition to the elements stated above regarding claim 1, Arik further discloses:
training the duration model using a hidden Markov Model forced alignment technique (e.g. estimation of phoneme locations in Deep Voice 2 embodiments; para 60, 87; note also alignment of phonemes with audio in claim 1 as well; and further consider initializing RNN hidden states in the duration model; para 89; note also the improvement on prior approaches including a traditional Hidden Markov model using hidden representations; para 48; consider alternatively that the combined Tacotron-WaveNet model of Fig. 17, Arik detailing the incorporation of Van der oord "WaveNet: A Generative Model for Raw Audio," arXiv:1609.03499, 2016; see para 44, Van detailing WaveNet using HMM in TTS applications; section 3.2).

Regarding Claim 20, in addition to the elements stated above regarding claim 15, Arik further discloses:
wherein the second set of spectra includes a different number of spectra than as compared to the first set of spectra  (e.g. Tacotron-WaveNet model, including using the Deep Voice 2 vocal model embodiment; para 111; Figs 3A, 3B as noted above, further, multi-speaker frequency 

Claims 8 and 15 are rejected under the same grounds stated above regarding claim 1.

Claims 9 and 16 are rejected under the same grounds stated above regarding claim 2.

Claims 10 and 17 are rejected under the same grounds stated above regarding claim 3.

Claims 12 and 19 are rejected under the same grounds stated above regarding claim 5.

Claim 13 is rejected under the same grounds stated above regarding claim 6.

Claim 14 is rejected under the same grounds stated above regarding claim 7.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to THOMAS H MAUNG whose telephone number is (571)270-5690.  The examiner can normally be reached on Monday-Friday, 9am-6pm, EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vivian Chin can be reached on 1-(571) 272-7848.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to 




/THOMAS H MAUNG/Primary Examiner, Art Unit 2654