DETAILED ACTION
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 3, 4, 6, 8, 14, 15 and 20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated  by Li U.S. PAP 2018/0254034 A1.


Regarding claim 1 Li teaches a system, comprising a processor to: 
receive a linguistic sequence and a prosody info offset (process input text, including pre-processing, word segmentation, part-of-speech tagging, polyphone prediction, prosodic hierarchy prediction, and the like, see par. [0004]); 
generate, via a trained prosody info predictor, combined prosody info comprising a plurality of observations based on the linguistic sequence, wherein the plurality of observations comprise linear combinations of statistical measures evaluating a prosodic component over a predetermined period of time (performing a prosody prediction on the text to be synthesized after the part-of-speech tagging via a prosody prediction model, see par. [0075]; the prosodic features, and the context features of the text to be synthesized can be input to an acoustic prediction model, so as to perform the acoustic prediction on 
and generate, via a trained neural network, an acoustic sequence based on the combined prosody info, the prosody info offset, and the linguistic sequence (the result of phonetic notation, the prosodic features, and context features of the text to be synthesized are input to the first target user acoustic model, and an acoustic prediction is performed on the text to be synthesized via the first target user acoustic model, to generate an acoustic parameter sequence of the text to be synthesized, see par. [0078]; after the obtaining module 120 obtains the speech data of the target user, based on the reference acoustic model, the second model training module 130 can train the first target user acoustic model using the speech data of the target user and via an adaptive technology (for example, via a long short-term memory (LSTM for short) neural network structure or a bidirectional LSTM neural network structure, see par. [0104]). 
Regarding claim 3 Li teaches the system of claim 1, wherein the processor is to train the prosody info predictor based on an embedded linguistic sequence generated by the system trained with the observed prosody info (the first target user acoustic model can be obtained by adaptive training and updating, such that the first target user acoustic model further has speech features of the target user as well as general information in the reference acoustic model, see par. [0054]). \
Regarding claim 4 Li teaches the system of claim 1, wherein the processor is to train the neural network based on observed spectra extracted from recordings during training, the neural network comprising a sequence-to-sequence neural network including a prosody info encoder (prosody prediction module), a linguistic encoder (word 
Regarding claim 6 Li teaches the system of claim 1, wherein the processor is to generate, via a linguistic encoder, an embedded linguistic sequence based on the linguistic sequence (The word segmentation module 220 is configured to perform word segmentation on the text to be synthesized. The part-of-speech tagging module 230 is configured to perform part-of-speech tagging on the text to be synthesized after the word segmentation, see par. [0122-0123]). 
 
Regarding claim 8 Li teaches a computer-implemented method (method for speech synthesis, see par. [0014]), comprising: 

receiving a linguistic sequence and a prosody info offset (process input text, including pre-processing, word segmentation, part-of-speech tagging, polyphone prediction, prosodic hierarchy prediction, and the like, see par. [0004]); 
generating, via a trained prosody info predictor, combined prosody info comprising a plurality of observations based on the linguistic sequence, wherein the plurality of observations comprise linear combinations of statistical measures evaluating a prosodic component over a predetermined period of time (performing a prosody prediction on the text to be synthesized after the part-of-speech tagging via a prosody 
Regarding claim 14 Li teaches the computer-implemented method of claim 8, comprising generating an audio based on the acoustic sequence (and generating a speech synthesis result of the text to be synthesized according to the acoustic parameter sequence, see par. [0020]). 
Regarding claim 15 Li teaches a computer program product for automatically controlling prosody, the computer program product comprising a computer-readable storage medium having program code embodied therewith, wherein the computer 
receive a linguistic sequence and a prosody info offset (process input text, including pre-processing, word segmentation, part-of-speech tagging, polyphone prediction, prosodic hierarchy prediction, and the like, see par. [0004]); 
generate, via a trained prosody info predictor, combined prosody info comprising a plurality of observations based on the linguistic sequence, wherein the plurality of observations comprise linear combinations of statistical measures evaluating a prosodic component over a predetermined period of time (performing a prosody prediction on the text to be synthesized after the part-of-speech tagging via a prosody prediction model, see par. [0075]; the prosodic features, and the context features of the text to be synthesized can be input to an acoustic prediction model, so as to perform the acoustic prediction on the text to be synthesized, and to generate the corresponding acoustic parameter sequence such as duration, a spectrum, fundamental frequency, and the like, see par. [0079]); 
and generate, via a trained neural network, an acoustic sequence based on the combined prosody info, the prosody info offset, and the linguistic sequence (the result of phonetic notation, the prosodic features, and context features of the text to be synthesized are input to the first target user acoustic model, and an acoustic prediction is performed on the text to be synthesized via the first target user acoustic model, to generate an acoustic parameter sequence of the text to be synthesized, see par. [0078]; after the obtaining module 120 obtains the speech data of the target user, based on the reference 
Regarding claim 20 Li teaches the computer program product of claim 15, further comprising program code executable by the processor to generate an audio based on the acoustic sequence (and generating a speech synthesis result of the text to be synthesized according to the acoustic parameter sequence, see par. [0020]). 

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
Claims 2, 5, 7, 9-13, 16-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li U.S. PAP 2018/0254034 A1, in view of Gopala U.S. PAP 2021/0074260 A1.


Regarding claim 2 Li not teach the system of claim 1, wherein the processor is to train the prosody info predictor based on observed prosody info extracted from unlabeled training data. 
unsupervised or unlabeled data, a few hundred thousand instances of audio content may be needed to train a neural network to identify fake media with 70% accuracy, see par. [0115].
It would have been obvious to one of ordinary skill in the art to combine the Li invention with the teachings of Gopalan for the benefit of using a diverse speech dataset for different people that can then be used to accurately simulate the specific voice of a particular individual, see par. [0002].
Regarding claim 5 Gopala teaches the system of claim 1, wherein the processor is to modify the plurality of observations based on the prosody info offset to adjust a prosody of the acoustic sequence in a particular predetermined manner (selectively adds the prosodic characteristic of the speech by the individual in the output speech based at least in part on the prediction, see abstract). 
It would have been obvious to one of ordinary skill in the art to combine the Li invention with the teachings of Gopalan for the benefit of using a diverse speech dataset for different people that can then be used to accurately simulate the specific voice of a particular individual, see par. [0002].
claim 7, Li does not teach the elements however, Gopala teaches the system of claim 1, wherein the prosodic component comprises a pace component, a pitch component, a loudness component, or any combination thereof (the voice synthesis engine predicts positions and duration of a prosodic characteristic of speech by the individual, and selectively adds the prosodic characteristic of the speech by the individual in the output speech based at least in part on the prediction. Note that the prosodic characteristic may include: pauses in the speech by the individual, and/or disfluences in the speech by the individual, see par. [0143]).
It would have been obvious to one of ordinary skill in the art to combine the Li invention with the teachings of Gopalan for the benefit of using a diverse speech dataset for different people that can then be used to accurately simulate the specific voice of a particular individual, see par. [0002].

Regarding claim 9 Li does not teach the elements however, Gopala teaches the computer-implemented method of claim 8, comprising: 
generating, via a trained encoder, an embedded linguistic sequence based on the linguistic sequence (the transformation may include a neural network and/or the representation may include word embedding or sense embedding of words in the audio content, see par. [0050]); 
and combining by summation or concatenation and encoding the plurality of observations to generate an embedded prosody info, and concatenating the embedded prosody info with the embedded linguistic sequence (the voice synthesis engine predicts positions and duration of a prosodic characteristic of speech by the individual, and 
It would have been obvious to one of ordinary skill in the art to combine the Li invention with the teachings of Gopalan for the benefit of using a diverse speech dataset for different people that can then be used to accurately simulate the specific voice of a particular individual, see par. [0002].
Regarding claim 10 Li does not teach the claimed elements, however, Gopala teaches the computer-implemented method of claim 8, comprising modifying the plurality of observations based on the prosody info offset (selectively adds the prosodic characteristic of the speech by the individual in the output speech based at least in part on the prediction, see abstract).
It would have been obvious to one of ordinary skill in the art to combine the Li invention with the teachings of Gopalan for the benefit of using a diverse speech dataset for different people that can then be used to accurately simulate the specific voice of a particular individual, see par. [0002].
Regarding claim 11 Gopala teaches the computer-implemented method of claim 10, wherein modifying the plurality of observations comprises adding the prosody info offset to corresponding observations (selectively adds the prosodic characteristic of the speech by the individual in the output speech based at least in part on the prediction, see abstract). 
claim 12 Li does not teach the elements however, Gopala teaches the computer-implemented method of claim 8, wherein the plurality of observations are evaluated at an utterance level (the prediction of the temporal positions and duration of the prosodic characteristic may be based at least in part on a predetermined histogram of occurrences of the prosodic characteristic as a function of time interval in the individual's speech, see par. [0090]. 
It would have been obvious to one of ordinary skill in the art to combine the Li invention with the teachings of Gopalan for the benefit of using a diverse speech dataset for different people that can then be used to accurately simulate the specific voice of a particular individual, see par. [0002].

Regarding claim 13, Li does not teach the elements however, Gopala teaches the computer-implemented method of claim 8, wherein the plurality of observations are evaluated locally and hierarchically at different temporal spans (the prediction of the temporal positions and duration of the prosodic characteristic may be based at least in part on a predetermined histogram of occurrences of the prosodic characteristic as a function of time interval in the individual's speech, see par. [0090]). 
It would have been obvious to one of ordinary skill in the art to combine the Li invention with the teachings of Gopalan for the benefit of using a diverse speech dataset for different people that can then be used to accurately simulate the specific voice of a particular individual, see par. [0002].
Regarding claim 16, Li does not teach the elements, however Gopalan teaches the computer program product of claim 15, further comprising program code executable by 
It would have been obvious to one of ordinary skill in the art to combine the Li invention with the teachings of Gopalan for the benefit of using a diverse speech dataset for different people that can then be used to accurately simulate the specific voice of a particular individual, see par. [0002].
Regarding claim 17, Li does not teach the elements however, Gopala teaches the computer program product of claim 15, further comprising program code executable by the processor to modify the plurality of observations based on the prosody info offset (selectively adds the prosodic characteristic of the speech by the individual in the output speech based at least in part on the prediction, see abstract). 
It would have been obvious to one of ordinary skill in the art to combine the Li invention with the teachings of Gopalan for the benefit of using a diverse speech dataset for different people that can then be used to accurately simulate the specific voice of a particular individual, see par. [0002].
claim 18 Li does not teach the elements however, Gopala teaches the computer program product of claim 15, further comprising program code executable by the processor to add the prosody info offset to corresponding observations of the prosody info (selectively adds the prosodic characteristic of the speech by the individual in the output speech based at least in part on the prediction, see abstract). 
It would have been obvious to one of ordinary skill in the art to combine the Li invention with the teachings of Gopalan for the benefit of using a diverse speech dataset for different people that can then be used to accurately simulate the specific voice of a particular individual, see par. [0002].
Regarding claim 19 Li does not teach the computer program product of claim 15, further comprising program code executable by the processor to train the prosody info predictor based on observed prosody info extracted from unlabeled training data. 
In the same field of endeavor Gopala teaches techniques for generating output speech that includes one or more prosodic characteristics of an individual, see par. [0001]. Gopala discloses a neural network using a diverse speech dataset for different people that can then be used to accurately simulate the specific voice of a particular individual, see par. [0002]. With a supervised or labeled data, a few hundred instances of audio content may be needed to train a neural network to identify fake media with 80-85% accuracy. Alternatively, with unsupervised or unlabeled data, a few hundred thousand instances of audio content may be needed to train a neural network to identify fake media with 70% accuracy, see par. [0118].
It would have been obvious to one of ordinary skill in the art to combine the Li invention with the teachings of Gopalan for the benefit of using a diverse speech dataset .
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Pertinent prior art available on form 892.
Ando ‘679 teaches an estimation model which uses a deep neural network as a prosodic feature estimation model.
Kim ‘998 teaches a speech synthesizer which uses a sequential prosody feature.
Fernandez ‘546 teaches prosody prediction using a parametric model.
Fructuoso ‘359 teaches methods for multilingual prosody generation which uses neural networks to provide output indicating prosody information.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Ortiz-Sanchez whose telephone number is (571)270-3711.  The examiner can normally be reached on Monday- Friday 9AM-6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications 






/MICHAEL ORTIZ-SANCHEZ/Primary Examiner, Art Unit 2656