DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Applicant’s claim for the benefit of a prior-filed application under 35 U.S.C. 119(e) or under 35 U.S.C. 120, 121, or 365(c) is acknowledged.   


Information Disclosure Statement
The references listed in the Information Disclosure Statement submitted on 01/23/2021, 02/17/2021, 02/18/2021, 02/19/2021, 02/22/2021, 02/23/2021, 02/10/2022 and 03/09/2022, have been considered by the examiner (see attached PTO-1449). 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over LUAN et al. (IDS: US 2015/0243275) hereinafter referenced as LUAN in view of HONEYCUTT (IDS: US 2012/0265533). 
As per claim 1, LUAN discloses ‘voice font speaker and prosody interpolation’ (title) for “synthesizing speech from text” (p(paragraph)5-p6), comprising: 
converting (or ‘parses’) an input text (‘the text’) to (‘into’) phonemes (‘a phoneme sequence’) corresponding to the input text, (Fig. 2A, ‘206’, p5, p22);   
inputting the phonemes and a duration [speaker] identifier (read on ‘duration weight for j-th voice font’) into a trained duration model (read on ‘predication models for the duration’, or ‘duration predication model’ of corresponding ‘voice font’, wherein an existing ‘voice font’ ‘includes’ the ‘prediction models’, is ‘trained (so that prediction models also be trained) from a recording corpus collected from a voice talent (i.e. speaker)’, and  ‘has a number of associated parameters’ defining ‘the sound’ to ‘render the computer generated speech’) that uses the phonemes (same as ‘i-th phoneme’ in the ‘phoneme sequence’) and the duration [speaker] identifier (same as stated above) to output phoneme durations corresponding to the phonemes of the input text and the duration [speaker] identifier, (Fig. 2A, ‘206’, p17, p27, p27); 
inputting the phonemes, the phoneme durations, and a frequency [speaker] identifier (read on ‘f0 weight for j-th voice font’) into a trained frequency model (read on ‘predication models for…fundamental frequency (f0)’, or ‘f0 predication model’ of corresponding ‘voice font’) to predict frequency profiles (read on ‘f0 contour’ or ‘f0’ for ‘each of frame’ of the ‘phoneme sequence’) for the phonemes, in which a frequency profile for a phoneme comprises (Fig. 2A, ‘206’, p17, p23, p30): 
a probability that the phoneme is voiced (read on ‘V/UV probability value for the phoneme’ (p23); and 
a fundamental frequency profile (read on ‘f0 contour’ or ‘f0’ for ‘each of frame’ of corresponding phoneme(s)), (p17, p23); and 
using a trained vocal model (read on ‘predication models for…spectral envelope’, ‘spectrum prediction model’, or ‘f0 spectral trajectory prediction model’ of corresponding ‘voice font’) that receives as an input a vocal [speaker] identifier (read on ‘spectrum weight for j-th voice font’), the phonemes (as state above), the phoneme durations (same as stated above), and the frequency profiles for the phonemes (same as stated above) to synthesize (‘generating’, ‘produced’, or ‘computer-generated’) a signal (‘speech’) representing synthesized speech of the input text, in which the synthesized speech has audio characteristics corresponding to the speaker (read on ‘having the desired speaker characteristics and prosody’) [identity] (Fig. 2A, ‘206’, pp16-17, p20-p23, p28). 
It is noted that even though LUAN discloses a ‘voice font’ is ‘trained with recording corpus’ obtained/collected ‘from a/one voice talent’ (p1 and p17), ‘generating interpolated voice fonts having the desired speaker characteristics and prosody’ (p16), and ‘adjusting the weight given to each voice font…alters the speaker characteristics and/or prosody of the computer-generated speech (p19 and p25), LUAN does not expressly disclose the processing/synthesizing the phonemes/speech with related models by using speaker identifiers or identity. However, the same/similar concept/feature is well known in the art as evidenced by HONEYCUTT who in the same field of endeavor, discloses ‘voice assignment for text-to-speech output’ (title), comprising ‘speaker profile’ including ‘information used to determine a voice characteristics’, ‘TTS engine’ performing ‘text-to-phoneme or grapheme-to-phoneme conversion’, ‘synthesizer’ incorporating ‘a model of human voice track or other voice characteristics’ and ‘computation of target prosody (i.e. pitch contour, phoneme durations)’, providing ‘a speaker’s voice’ recorded and analyzed to generate voice data, (p19-p22), providing ‘voice database’ with information in ‘speaker profile’ including ‘a unique identifier associated with as set of voice parameters or recorded speech that can be used by TTS engine to generate speech output based on the speaker profile (p27).  Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to combine teachings of LUAN and HONEYCUTT together by providing a mechanism of associating a unique identifier/identity related to a speaker (such as a speaker identifier or speaker profile) with a set of voice parameters (such as  parameters of voice fonts/data including related weight factors to be adjusted to the speaker characteristic and/or prosody of the speaker) used by TTS engine to generate/synthesize speech (i.e. computer generated speech, closely resembling the desired speaker characteristic and/or prosody of the speaker) corresponding to the unique identifier/identity related to the speaker, for the purpose (motivation) of producing speech output having voice characteristics that best match the speaker profile, and/or providing a speech output allows speaker recognition while providing a more enjoyable and entertaining experience for the listener (HONEYCUTT: abstract, p4). 

Double Patenting
The applicant indicated the instant application as a divisional (DIV) application of Application 15/974,397, which has been publicly patented as US 10,896669.  However, it is noted that the divisional application is not fully caused by a claim restriction/election required by any examiner of PTO, and the claims in the instant application are not shown all claimed inventions being patentably distinguishable comparing to that of its parent patent US 10,896669.  Therefore, a double patenting rejection is applicable to the instant application (see below).
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159.  See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.

Claims 10-15 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-6 of U.S. Patent No. 10,896669 hereinafter referenced as P669. Although the claims at issue are not identical, they are not patentably distinct from each other because of following reason(s).
Regarding claim 10-15, comparison of limitations between the claims of instant application and claims of P669 as following: 
Claims 
Claims of P669
10. A computer-implemented method for synthesizing speech from text, comprising: 

converting an input text to phonemes corresponding to the input text; 

inputting the phonemes and a duration speaker identifier into a trained duration model that uses the phonemes and the duration speaker identifier to output phoneme durations corresponding to the phonemes of the input text and the duration speaker identifier; 
    
inputting the phonemes, the phoneme durations, and a frequency speaker identifier into a trained frequency model to predict frequency profiles for the phonemes, in which                        a frequency profile for a phoneme comprises:
    a probability that the phoneme is voiced; and    
    a fundamental frequency profile; and 

using a trained vocal model that receives as an input a vocal speaker identifier, the phonemes, the phoneme durations, and the frequency profiles for the phonemes to synthesize a signal representing synthesized speech of the input text, in which the synthesized speech has audio characteristics corresponding to the duration speaker identifier, the frequency speaker identifier, and the vocal speaker identifier. 


11. The computer-implemented method of Claim 10 
wherein the duration speaker identifier, the frequency speaker identifier, and the vocal speaker identifier comprise a shared speaker embedding representation and, when required for a site-specific use in the trained duration model, the trained frequency model, or the trained vocal model, respectively, is transformed to an appropriate dimension and form.  

12. The computer-implemented method of Claim 11 
wherein, for the trained duration model, a first site-specific embedding representation of the duration speaker identifier is used to initialize a neural network's (NN) hidden states and a second site-specific embedding representation of the duration speaker identifier is provided as input to a first NN layer by concatenating it with feature representations of the phonemes.
  
13. The computer-implemented method of Claim 11 
wherein, for the trained frequency model, one or more recurrent layers are initialized with a first site-specific speaker embedding representation of the frequency speaker identifier and a fundamental frequency prediction of a phoneme is computing using a second site-specific speaker embedding of the frequency speaker identifier and trained model parameters.  

14. The computer-implemented method of Claim 11 
wherein, for the trained vocal model, a site- specific speaker embedding representation of the vocal speaker identifier is concatenated onto each input frame in a conditioner network of the trained vocal model. 
 
15. The computer-implemented method of Claim 10 
the trained duration model, the trained frequency model, and the trained vocal model were obtained by training a duration model, a frequency model, and a vocal model by performing steps comprising: 

converting an input training text to phonemes corresponding to the input training text, which is a transcription corresponding to training audio comprising utterances of a speaker, to phonemes corresponding to the input training text and training audio; 

using the training audio, the phonemes corresponding to the input training text, and at least a portion of a speaker identifier input indicating an identity of the speaker corresponding to the training audio to train a segmentation model to output segmented utterances by identifying phoneme boundaries in the training audio by aligning it with the corresponding phonemes; 

using the phonemes corresponding to the input training text, the segmented utterances obtained from a segmentation model, and at least a portion of the speaker identifier input indicating an identity of the speaker to train the duration model to output phoneme durations of the phonemes in the segmented utterances; 

using the training audio, the phonemes corresponding to the input training text, one or more frequency profiles of the training audio, and the at least a portion of speaker identifier input indicating an identity of the speaker to train the frequency model to output one or more frequency profiles; and 

using the training audio, the phonemes corresponding to the input training text, the segmented utterances, the one or more frequency profiles, and the at least a portion of speaker identifier input indicating an identity of the speaker to train the vocal model to output a signal representing synthesized speech, in which the synthesized speech has audio characteristics corresponding to the speaker.  

1. A computer-implemented method for synthesizing speech from text, comprising:

converting an input text to phonemes corresponding to the input text; 

inputting the phonemes and a duration speaker identifier into a trained duration model that uses the phonemes and the duration speaker identifier to output phoneme durations corresponding to the phonemes of the input text and the duration speaker identifier; 

inputting the phonemes, the phoneme durations, and a frequency speaker identifier into a trained frequency model to predict frequency profiles for the phonemes, in which a frequency profile for a phoneme comprises: 
     a probability that the phoneme is voiced; and 
     a fundamental frequency profile; and 

using a trained vocal model that receives as an input a vocal speaker identifier, the phonemes, the phoneme durations, and the frequency profiles for the phonemes to synthesize a signal representing synthesized speech of the input text, in which the synthesized speech has audio characteristics corresponding to the duration speaker identifier, the frequency speaker identifier, and the vocal speaker identifier; 




wherein the duration speaker identifier, the frequency speaker identifier, and the vocal speaker identifier comprise a shared speaker embedding representation and, when required for a site-specific use in the trained duration model, the trained frequency model, or the trained vocal model, respectively, is transformed to an appropriate dimension and form.

2. The computer-implemented method of claim 1 
wherein, for the trained duration model, a first site-specific embedding representation of the duration speaker identifier is used to initialize a neural network's (NN) hidden states and a second site-specific embedding representation of the duration speaker identifier is provided as input to a first NN layer by concatenating it with feature representations of the phonemes.

3. The computer-implemented method of claim 1 
wherein, for the trained frequency model, one or more recurrent layers are initialized with a first site-specific speaker embedding representation of the frequency speaker identifier and a fundamental frequency prediction of a phoneme is computing using a second site-specific speaker embedding of the frequency speaker identifier and trained model parameters.

4. The computer-implemented method of claim 1 
wherein, for the trained vocal model, a site-specific speaker embedding representation of the vocal speaker identifier is concatenated onto each input frame in a conditioner network of the trained vocal model.

6. The computer-implemented method of claim 1 wherein 
the trained duration model, the trained frequency model, and the trained vocal model were obtained by training a duration model, a frequency model, and a vocal model by performing steps comprising: 

converting an input training text to phonemes corresponding to the input training text, which is a transcription corresponding to training audio comprising utterances of a speaker, to phonemes corresponding to the input training text and training audio; 

using the training audio, the phonemes corresponding to the input training text, and at least a portion of a speaker identifier input indicating an identity of the speaker corresponding to the training audio to train a segmentation model to output segmented utterances by identifying phoneme boundaries in the training audio by aligning it with the corresponding phonemes; 

using the phonemes corresponding to the input training text, the segmented utterances obtained from a segmentation model, and at least a portion of the speaker identifier input indicating an identity of the speaker to train the duration model to output phoneme durations of the phonemes in the segmented utterances; 

using the training audio, the phonemes corresponding to the input training text, one or more frequency profiles of the training audio, and the at least a portion of speaker identifier input indicating an identity of the speaker to train the frequency model to output one or more frequency profiles; and 

using the training audio, the phonemes corresponding to the input training text, the segmented utterances, the one or more frequency profiles, and the at least a portion of speaker identifier input indicating an identity of the speaker to train the vocal model to output a signal representing synthesized speech, in which the synthesized speech has audio characteristics corresponding to the speaker.



Based on above limitation comparison, it is noted that each of limitations of claims of instant application would be read on or anticipated by each of corresponding limitations of claims of P669.  

Allowable Subject Matter
Claims 1-9 and 16-20 are allowed.

It is noted that, a prior art search has been conducted by the examiner (see attached search report and PTO-892 form).
 
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to QI HAN whose telephone number is (571)272-7604.  The examiner can normally be reached on 9-19:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
 

QH/qh
July 2, 2022
/QI HAN/Primary Examiner, Art Unit 2659