DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 3, 6-8, 11, 14-17 and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Zhao et al. (US 2014/0257815).

Claim 1,
Zhao teaches a text-to-speech synthesis system, comprising: a speech engine; a processing unit; and a neural network; wherein, in a training mode: the speech engine is configured to generate synthetic speech data for a first input text; the processing unit is configured to compare the synthetic speech data to recorded reference speech data corresponding to the first input text, the processing unit ([0013-0019] pronunciation issue detector 26 determines possible pronunciation issues for synthesized speech generated by the TTS engine using evaluations performed at multiple levels; the pronunciation issue detector 26 evaluates results obtained at multiple levels of the TTS flow and the SR flow (e.g. phone, word, and signal level) by using the corresponding human recordings 104 as the reference for the synthesized speech generated from text 106, and outputs results 108 that list possible pronunciation issues; a signal level (e.g. signal level for phone sequences) may be used to determine similarities /differences between the human recorded speech and the TTS output; a model level checker may provide results to the pronunciation issue detector to check the similarities of the TTS and the SR phone set including mapping relations; results from a comparison of the SR output and the recordings may also be evaluation by the pronunciation issue detector; TTS flow 220 illustrates steps from input text 205 to the TTS output 240. SR flow 250 shows speech recognition steps from speech signals 244 to recognized text determined from the SR flow; the signal level includes the acoustic feature f0 (fundamental frequency); adjusting for the mismatch between the recognized text of the synthesized speech and the input text by comparing the similarity of the recognized text between synthesized speech and the corresponding recording).

Claims 11 and 20 contains subject matter similar to claim 1, and thus is rejected under similar rationale.

Claim 3,
Zhao further teaches the text-to-speech synthesis system of claim 1, wherein the text-to-speech synthesis system is a parametric text-to-speech synthesis system ([0015] Text-To-Speech (TTS) is functions of a human-machine speech interface).

Claim 6,
Zhao further teaches the text-to-speech synthesis system of claim 1, wherein in the training mode, the processing unit is further configured to align the synthetic speech data and the recorded reference speech data preceding the comparison ([0020-0021] detection modules of the text levels are based on the Dynamic Programing (DP) algorithm for the label sequences alignment by comparing the recognized text sequence with the reference ones, and also comparing the recognized text sequences of synthesized speech and recordings both on phone and word levels).

Claim 15 contains subject matter similar to claim 6, and thus is rejected under similar rationale.

Claim 7,
Zhao further teaches the text-to-speech synthesis system of claim 6, wherein the processing unit is further configured to implement one or more of pitch shifting, time normalization, and time alignment between the synthetic speech data and the recorded reference speech data ([0022] the detection is based on the fundamental frequency (f0) compare for the consistent of the synthesized speech and the corresponding recordings inside the phones; the phone segment information is based on the HTK forced alignment of the recognized phone sequence and the input speech signals).

Claim 16 contains subject matter similar to claim 7, and thus is rejected under similar rationale.

Claim 8,
Zhao further teaches the text-to-speech synthesis system of claim 1, wherein the at least one feature extracted include a sequence of excitation vectors corresponding to the at least one difference between the synthetic speech data and the recorded reference speech data for the first input text ([0018] [0022] comparing the synthesized speech and the recordings at multiple levels (e.g. text levels and signal level; the signal level includes the acoustic feature f0 and detection is based on the fundamental frequency f0).

Claim 14,
Zhao further teaches the text-to-speech synthesis method of claim 11, wherein the synthetic speech data generated is further based on, at least in part, the recorded reference speech data pre-recorded by a speaker ([0016] TTS system, SRAE framework 200 uses recordings 242 (e.g. human recording of text 205) as a reference).

Claim 17,
Zhao further teaches the text-to-speech synthesis method of claim 11 further comprising training a neural network based on, at least in part, the at least one feature to generate the speech gap filling model ([0019] using the constrained text may assist in removing errors from the SR engine by adjusting for the mismatch between the recognized text of the synthesized speech and the input text by comparing the similarity of the recognized text between synthesized speech and the corresponding recording).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 4-5 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao et al. (US 2014/0257815) and further in view of Stefan et al. (US 2011/0282668).

Claim 4,
Zhao teaches all the limitations in claim 1. The difference between the prior art and the claimed invention is that Zhao does not explicitly teach wherein the synthetic speech data, as generated by the speech engine, is based on, at least in part, at least one of a parametric acoustic model and a linguistic model pre-configured for a speaker.
Stefan teaches wherein the synthetic speech data, as generated by the speech engine, is based on, at least in part, at least one of a parametric acoustic model and a linguistic model pre-configured for a speaker ([0028] linguistic models, acoustic models and the like can be stored in memory of one of the servers and/or databases for TTS processing).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Zhao with teachings of Stefan by modifying the speech recognition assisted evaluation on text-to-speech pronunciation issue detection as (Stefan [0027]).

Claim 13 contains subject matter similar to claim 4, and thus is rejected under similar rationale.

Claim 5,
Zhao further teaches the text-to-speech synthesis system of claim 4, wherein the synthetic speech data, as generated by the speech engine, is further based on, at least in part, the recorded reference speech data pre-recorded by the speaker ([0016] TTS system, SRAE framework 200 uses recordings 242 (e.g. human recording of text 205) as a reference).

Allowable Subject Matter
Claims 2, 9-10, 12 and 18-19 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/Examiner, Art Unit 2656