DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) were submitted on 06/02/2021, 08/27/2021, and 02/14/2022.  The submissions are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1, 10, and 19 are directed to an abstract idea. The claims recite “determining a phoneme feature and a prosodic word boundary feature of sample text data” and “inserting a pause character into the phoneme feature according to the prosodic word boundary feature to obtain a combined feature of the sample text data.” The claim elements under their broadest reasonable interpretation cover the concepts of determining a phoneme feature and a prosodic word boundary feature of text and inserting a pause character into the phoneme feature. These elements are mental processes and can be performed in the human mind or with pen and paper by a person reading a text, determining the phonemes and prosodic word boundaries of words in the document, and writing a pause character with a pen after phonemes in the document (see MPEP § 2106.04(a)(2), subsection III).
However, this judicial exception is integrated into a practical application. The claims also recite “training an initial speech synthesis model according to the combined feature of the sample text data, to obtain a 10target speech synthesis model.” This element is meaningful because it limits the use of the abstract idea to the practical application of creating a specific speech synthesis model. Therefore the claims are patent eligible.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-3 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Yang et al. (“Unsupervised Prosodic Labeling of Speech Synthesis Databases Using Context-Dependent HMMs”), hereinafter Yang.

Regarding claim 1, Yang teaches a method for training a speech synthesis model (summary, page 1449, lines 1-5), comprising: 
determining a phoneme feature (Sect. 3.1 para. 2, page 1452, lines 1-4; the text analysis module determines the phonetic labels, i.e. phoneme feature, for the utterances in the speech database, i.e. the sample text data) and a prosodic word boundary feature of sample text data (Sect. 3.1 para. 1, pages 1451-1452, lines 1-9; the prosodic word boundaries are determined); 
5inserting a pause character into the phoneme feature according to the prosodic word boundary feature to obtain a combined feature of the sample text data (Sect. 3.1 para. 2, page 1452, lines 4-7; a phonetic symbol, “sp,” standing for short pause, i.e. a pause character, is inserted into the phonetic transcription at each prosodic word. The phonetic transcription comprising the phonetic and prosodic labels and the pause symbol are considered to be the combined feature of the sample text data); and 
training an initial speech synthesis model according to the combined feature of the sample text data, to obtain a 10target speech synthesis model (Sect. 3.2 para. 1, page 1452, lines 1-2; the speech synthesis model is trained according to the context features listed in Table 1, which correspond to the elements of the combined feature of the sample text data as detailed above. Sect. 5.1.3 para. 1, page 1456, lines 1-4; speech synthesis systems, i.e. target speech synthesis models, are constructed according to the training described above).

Regarding claim 2, Yang further teaches wherein the inserting the pause character into the phoneme feature according to the prosodic word boundary feature to obtain the combined feature of the sample text data comprises: 
15determining a prosodic word position in the phoneme feature according to the prosodic word boundary feature (Sect. 3.2 para. 1, pages 1451-1452; the system determines whether each prosodic word boundary is a prosodic phrase boundary or not, i.e. if the prosodic word occurs at the end of the phrase or at some other location in the phrase, which is the prosodic word’s position in the phoneme feature); and 
inserting the pause character at the prosodic word position to obtain the combined feature of the sample text data (Sect. 3.1 para. 2, page 1452, lines 4-7; a phonetic symbol, “sp,” standing for short pause, i.e. a pause character, is inserted into the phonetic transcription at each prosodic word. The phonetic transcription comprising the phonetic and prosodic labels and the pause symbol are considered to be the combined feature of the sample text data).

Regarding claim 3, Yang further teaches wherein the training the initial speech synthesis model according to the combined feature of the sample text data comprises: 
determining a pause hidden feature distribution according to the combined feature and an acoustic feature 25of sample audio data, the sample audio data being associated with the sample text data (Sect. 3.1 para. 2, page 1452, lines 10-28; the system performs a state alignment between the phonetic transcription comprising the phonetic and prosodic labels and the pause symbol, which are considered to be the combined feature as detailed above, and the acoustic features [Sect. 2.2 para. 1, page 1451, lines 1-6; the method uses acoustic features extracted from the speech waveforms of the labeled utterance], i.e. acoustic features of sample audio data, the sample audio data being associated with the sample text data, to determine the duration of the pauses at each prosodic word boundary to establish a relationship between the pause durations and different classes of word boundaries. The relationship between pause duration and word boundary class is considered to be a pause hidden feature and the division of prosodic word boundaries into classes according to the pause hidden feature is considered to be a pause hidden feature distribution); and 
performing unsupervised training on the initial speech synthesis model according to the combined feature and the pause hidden feature distribution (Sect. 5.1.1 “CD-HMM-based unsupervised labeling”, page 1454, lines 1-9; unsupervised model training is performed using the methods for determining the combined feature and the pause hidden feature distribution described above on sample utterances).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 4-20 are rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Zhang et al. (Doc. ID US 20200380952 A1), hereinafter Zhang.

Regarding claim 4, Yang teaches the method according to claim 3, wherein the determining the pause hidden feature distribution according to the combined 3020A13205US feature and the acoustic feature of sample audio data comprises: 
aligning the combined feature and the acoustic feature of the sample audio data to obtain an acoustic feature of 5the pause character in the combined feature (Sect. 3.1 para. 2, page 1452, lines 10-28; the system performs a state alignment between the phonetic transcription comprising the phonetic and prosodic labels and the pause symbol, which are considered to be the combined feature as detailed above, and the acoustic features [Sect. 2.2 para. 1, page 1451, lines 1-6; the method uses acoustic features extracted from the speech waveforms of the labeled utterance], i.e. acoustic features of sample audio data, to determine the duration of the pause symbol “sp,” i.e. an acoustic feature of the pause character in the combined feature). 
Yang however fails to teach processing the acoustic feature of the pause character through a variational auto-encoder to obtain the pause hidden feature distribution. Zhang teaches a method for synthesizing speech from an input text sequence (Spec. page 1, [0005], lines 1-2). Zhang further teaches that the system includes a variational autoencoder which encodes latent factors, such as prosody and background noise, from input audio features into a latent embedding, e.g. a target mel spectrogram representation  of the input training utterance (Spec. page 4-5, [0037]). 
Adapting Yang’s method of aligning the combined feature and the acoustic feature of the sample audio data to use the variational autoencoder of Zhang produces the method according to claim 3, wherein the determining the pause hidden feature distribution according to the combined 3020A13205US feature and the acoustic feature of sample audio data comprises: 
aligning the combined feature and the acoustic feature of the sample audio data to obtain an acoustic feature of 5the pause character in the combined feature (Sect. 3.1 para. 2, page 1452, lines 10-28; the system performs a state alignment between the phonetic transcription comprising the phonetic and prosodic labels and the pause symbol, which are considered to be the combined feature as detailed above, and the acoustic features [Yang, Sect. 2.2 para. 1, page 1451, lines 1-6; the method uses acoustic features extracted from the speech waveforms of the labeled utterance], i.e. acoustic features of sample audio data, to determine the duration of the pause symbol “sp,” i.e. an acoustic feature of the pause character in the combined feature); and 
processing the acoustic feature of the pause character through a variational auto-encoder to obtain the pause hidden feature distribution (the system of Yang, now adapted to use the variational autoencoder of Zhang to process the acoustic features and generate a mel spectrogram representation of the pauses in the sample audio data; Zhang, Spec. page 4-5, [0037]; the system includes a variational autoencoder which encodes latent factors, such as prosody and background noise, from input audio features into a latent embedding, e.g. a target mel spectrogram representation  of the input training utterance. Pauses can be considered to be latent factors, and thus the mel spectrogram representation is considered to be a pause hidden feature distribution).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yang to incorporate the teachings of Zhang. Both Yang and Zhang are directed to speech synthesis models. Further, Yang recognizes the importance of precision in the labeling of speech databases for the construction of speech synthesis systems to improve the naturalness of the synthesized speech (Sect. 1 para. 1, page 1449). Zhang recognizes that latent factors such as prosody are generally not well represented in the input text sequences representing training utterances and would thus be missed in the conditioning of the decoder during training (Spec. page 4, [0037], lines 1-14), and provides a solution in the form of the variational autoencoder which includes the latent factors in the training, which may result in increased precision in recognizing these factors in target inputs. Therefore, it would have been predictable to one of ordinary skill in the art at the time of filing to combine the disclosures.

Regarding claim 5, Yang teaches the method according to claim 3 as detailed above, however Yang fails to teach wherein the performing 10unsupervised training on the initial speech synthesis model according to the combined feature and the pause hidden feature distribution comprises: determining a combined feature vector representation according to the combined feature; 15performing sampling on the pause hidden feature distribution to obtain a pause hidden feature; using the combined feature vector representation as an input of a decoder in the initial speech synthesis model, and concatenating an output of the decoder and the pause 20hidden feature to obtain a context vector; and encoding the context vector through an encoder in the initial speech synthesis model to obtain an acoustic feature outputted by the initial speech synthesis model.
Zhang teaches a method for synthesizing speech from an input text sequence (Spec. page 1, [0005], lines 1-2). Zhang further teaches that the system includes a text encoder, and in particular teaches 
determining a combined feature vector representation according to the combined feature (Spec. page 3, [0025], lines 3-8; the system includes a text encoder that receives a feature representation of input text and generates a context vector representation of it, i.e. the combined feature vector); 
15performing sampling on the pause hidden feature distribution to obtain a pause hidden feature (Spec. page 3, [0025], lines 10-14; a decoder outputs a mel-frequency spectrogram comprised of frames, each frame representing a sample of the input signal, thus the frames can represent a feature, e.g. the pause hidden feature, of the input signal sampled from the mel-frequency spectrogram, which can be considered to represent a distribution of features); 
using the combined feature vector representation as an input of a decoder (Spec. page 3, [0026], lines 1-3; the decoder neural network receives as input the context vectors, i.e. the combined feature vector) in the initial speech synthesis model, and concatenating an output of the decoder and the pause 20hidden feature to obtain a context vector (Spec. page 4, [0031], lines 1-4; the pre-net passes through a mel-frequency spectrogram comprised of frames, similar to the frame representing the pause hidden feature as detailed above. [0032], lines 1-6; the decoder architecture includes a Long Short-Term Memory (LSTM) subnetwork. The LSTM subnetwork receives a concatenation of the output of the pre-net [the mel-frequency spectrogram frame, i.e. the pause hidden feature as described above] and the combined feature vector. The concatenation is considered to be the context vector of the claim limitation); and 
encoding the context vector through an encoder in the initial speech synthesis model to obtain an acoustic feature outputted by the initial speech synthesis model (Spec. page 4, [0032], lines 8-10; the decoder neural network includes a linear projection which takes as input the context vector output of the LSTM subnetwork and outputs a mel-frequency spectrogram prediction, i.e. an acoustic feature).
Adapting the method as taught by Yang for performing unsupervised training on the initial speech synthesis model according to the combined feature and the pause hidden feature distribution to incorporate these features provides the claimed features of claim 5. It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yang to incorporate the teachings of Zhang. Both Yang and Zhang are directed to speech synthesis models. Yang describes a method for training the model using the acoustic features extracted from sample speech waveforms combined with the phonetic and prosodic labeling (Sect. 2.3, page 1451). Zhang discloses the details on obtaining those acoustic features necessary for the model to train on. Therefore, it would have been obvious to combine the features of both disclosures to provide the acoustic features. 

Regarding claim 6, Yang teaches the method according to claim 1 as detailed above, however Yang fails to teach training an initial vocoder by using an output of the initial speech synthesis model and sample audio data, to obtain a target vocoder.
Zhang teaches a method for synthesizing speech from an input text sequence (Spec. page 1, [0005], lines 1-2). Zhang further teaches the use of a vocoder as part of a speech synthesis system to synthesize the speech (Spec. page 3, [0024]). Adapting the teachings of Yang with the disclosure of Zhang provides method according to claim 1, further comprising: 
25training an initial vocoder (Zhang, Spec. page 6, [0046], lines 14-19) by using an output of the initial speech synthesis model and sample audio data (Yang, Sect. 2.1 para. 1, page 1450; a new speech synthesis model, i.e. an initial speech synthesis model, is trained iteratively using acoustic features from sample audio data. As the process is iterative, the input for the second iteration is the output of initial speech synthesis model. The model is adapted to use the vocoder from Zhang for speech synthesis), to obtain a target vocoder (the result of the vocoder training as detailed in Zhang above).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yang to incorporate the teachings of Zhang. Both Yang and Zhang are directed to speech synthesis models. Yang describes synthesizing utterances using the speech synthesis model (Sect. 5.1.3 para. 1, page 1456, lines 6-9) but does not detail a structure for performing the synthesis within the model. Zhang however provides that the wave synthesizer for the speech synthesis model may be a vocoder, and moreover details that the use of a vocoder produces high fidelity audio (Spec. page 6, [0046], lines 19-22). Therefore it would have been obvious to combine the features of both disclosures to provide higher quality synthesized speech.

Regarding claim 7, Yang teaches a method for synthesizing a speech, comprising: 
determining a phoneme feature (Sect. 5.1.2 para. 3, page 1456, lines 17-22; the labeling results of the models run after the training were compared, thus the processes described during the training of the model above with sample text data were repeated with the trained models on target input utterances; Sect. 3.1 para. 2, page 1452, lines 1-4; the text analysis module determines the phonetic labels, i.e. phoneme feature, for the utterances to be synthesized, i.e. the target text data) and a prosodic word 30boundary feature of target text data (Sect. 3.1 para. 1, pages 1451-1452, lines 1-9; the prosodic word boundaries are determined); and 
inserting a pause character into the phoneme feature 3120A13205USaccording to the prosodic word boundary feature to obtain a combined feature of the target text data (Sect. 3.1 para. 2, page 1452, lines 4-7; a phonetic symbol, “sp,” standing for short pause, i.e. a pause character, is inserted into the phonetic transcription at each prosodic word. The phonetic transcription comprising the phonetic and prosodic labels and the pause symbol are considered to be the combined feature of the target text data).
However, Yang fails to teach obtaining, based on the target speech synthesis model obtained according to claim 1, an acoustic feature according 5to the combined feature of the target text data, and synthesizing a target speech by using the acoustic feature. Zhang teaches a method for synthesizing speech from an input text sequence (Spec. page 1, [0005], lines 1-2). Zhang further teaches obtaining a mel spectrogram, i.e. an acoustic feature, from the input text to be synthesized and synthesizes a target speech by using the acoustic feature (Spec. page 2-3, [0024]). Adapting the teachings of Yang with the disclosure of Zhang provides a method for synthesizing a speech (Sect. 5.1.3 para. 1, page 1456, lines 6-9), comprising: 
determining a phoneme feature (Sect. 5.1.2 para. 3, page 1456, lines 17-22; the labeling results of the models run after the training were compared, thus the processes described during the training of the model above with sample text data were repeated with the trained models on target input utterances; Sect. 3.1 para. 2, page 1452, lines 1-4; the text analysis module determines the phonetic labels, i.e. phoneme feature, for the utterances to be synthesized, i.e. the target text data) and a prosodic word 30boundary feature of target text data (Sect. 3.1 para. 1, pages 1451-1452, lines 1-9; the prosodic word boundaries are determined); 
inserting a pause character into the phoneme feature 3120A13205USaccording to the prosodic word boundary feature to obtain a combined feature of the target text data (Sect. 3.1 para. 2, page 1452, lines 4-7; a phonetic symbol, “sp,” standing for short pause, i.e. a pause character, is inserted into the phonetic transcription at each prosodic word. The phonetic transcription comprising the phonetic and prosodic labels and the pause symbol are considered to be the combined feature of the target text data) ; and 
obtaining, based on the target speech synthesis model obtained according to claim 1, an acoustic feature according 5to the combined feature of the target text data, and synthesizing a target speech by using the acoustic feature (the method of Yang, now adapted to use the system of Zhang, Spec. page 2-3, [0024], to obtain a mel spectrogram, i.e. an acoustic feature, from input text to be synthesized and synthesize a target speech by using the acoustic feature). 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yang to incorporate the teachings of Zhang. Both Yang and Zhang are directed to speech synthesis models. Yang describes training the model in detail and touches on synthesizing utterances using the trained speech synthesis model (Sect. 5.1.3 para. 1, page 1456, lines 6-9) but does not detail obtaining the acoustic feature of the target text data nor synthesizing speech from it at run-time. Zhang however provides detail on the process the speech synthesis system undergoes at run-time (Spec. page 3, [0024], lines 19-24; the system synthesizes speech from input text). Therefore it would have been obvious to combine the features of both disclosures to synthesis target speech from target text data.

Regarding claim 8, in additional to the elements stated above regarding claim 7, the combination of Yang and Zhang further teaches wherein the inserting the pause character into the phoneme feature according to the prosodic word boundary feature to obtain the combined 10feature of the target text data comprises: 
determining a prosodic word position in the phoneme feature according to the prosodic word boundary feature (Sect. 3.2 para. 1, pages 1451-1452; the system determines whether each prosodic word boundary is a prosodic phrase boundary or not, i.e. if the prosodic word occurs at the end of the phrase or at some other location in the phrase, which is the prosodic word’s position in the phoneme feature); and 
inserting the pause character at the prosodic word position to obtain the combined feature of the target text 15data (Sect. 3.1 para. 2, page 1452, lines 4-7; a phonetic symbol, “sp,” standing for short pause, i.e. a pause character, is inserted into the phonetic transcription at each prosodic word. The phonetic transcription comprising the phonetic and prosodic labels and the pause symbol are considered to be the combined feature of the target text data. Sect. 5.1.2 para. 3, page 1456, lines 17-22; the labeling results of the models run after the training were compared, thus the processes described during the training of the model above were repeated with the trained models on target input utterances).

Regarding claim 9, in additional to the elements stated above regarding claim 7, the combination of Yang and Zhang further teaches wherein the obtaining, based on the target speech synthesis model, the acoustic feature according to the combined feature of the target text data comprises: 
20determining a target pause hidden feature according to a target pause duration desired by a user and an association relationship between a pause duration and a pause hidden feature, the association relationship being obtained at a training stage of the target speech synthesis model (Sect. 5.1.2 para. 3, page 1456, lines 17-22; the labeling results of the models run after the training were compared, thus the processes described during the training of the model above were repeated with the trained models on target input utterances; Sect. 3.1 para. 2, page 1452, lines 10-28; as detailed above with respect to claim 3, during the training stage, the system performs a state alignment between the combined feature and the acoustic features to determine the duration of the pauses at each prosodic word boundary to establish a relationship between the pause durations and different classes of word boundaries, i.e. a pause hidden feature. A normalization is applied during the determination of the pause duration to account for other features which may affect the pause duration, which can be considered to be determining the pause hidden feature according to a target pause duration desired by a user applying the normalization); and 
25obtaining, based on the target speech synthesis model, the acoustic feature according to the combined feature of the target text data and the target pause hidden feature (Sect. 3.1 para. 2, page 1452, lines 10-28; the system obtains the duration of the pause symbol “sp,” i.e. an acoustic feature, according to the combined feature and the pause hidden feature as detailed above with respect to claim 4).

Regarding claim 10, the claim is directed to an electronic device, comprising: 
at least one processor; and 
a memory, communicatively connected with the at least one processor, 
the memory storing instructions executable by the at least one processor, and the instructions, when executed by the at least one processor, causing the at least one processor to perform operations, the operations comprising the claimed method of claim 1. While Yang does not disclose a processor and memory communicatively connected with the processor storing instructions, Yang does disclose all of the claimed functions of claim 1 as detailed above.
Zhang discloses an electronic device, comprising: 
at least one processor (Spec. page 7, [0061], lines 1-6); and 
a memory, communicatively connected with the at least one processor (Spec. page 7, [0061], lines 1-9), 
the memory storing instructions executable by the at least one processor (Spec. page 7, [0060-61), and the instructions, when executed by the at least one processor, causing the at least one processor to perform the disclosed methods for training a model and synthesizing speech.
Adapting Yang to incorporate the electronic device of Zhang discloses an electronic device, comprising: 
at least one processor (Spec. page 7, [0061], lines 1-6); and 
a memory, communicatively connected with the at least one processor (Spec. page 7, [0061], lines 1-9), 
the memory storing instructions executable by the at least one processor (Spec. page 7, [0060-61), and the instructions, when executed by the at least one processor, causing the at least one processor to perform operations, the operations comprising the claimed method of claim 1 (the method of Yang as detailed above with respect to claim 1, now adapted to operate on the electronic device disclosed by Zhang).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yang to incorporate the teachings of Zhang. Both Yang and Zhang are directed to speech synthesis models. Yang describes training the model in detail and touches on synthesizing utterances using the trained speech synthesis model (Sect. 5.1.3 para. 1, page 1456, lines 6-9) but does not detail the structure on which the methods and model operates. Zhang discloses the structure for carrying out the methods of training a model and synthesizing speech. Therefore it would have been obvious to combine the features of both disclosures to produce an electronic device for training an initial speech synthesis model according to the combined feature of the sample text data, to obtain a target speech synthesis model.

Regarding claim 11, the claim is directed to the electronic device according to claim 10 for performing the claimed method of claim 2, and is rejected on the same grounds.

Regarding claim 12, the claim is directed to the electronic device according to claim 10 for performing the claimed method of claim 3, and is rejected on the same grounds.

Regarding claim 13, the claim is directed to the electronic device according to claim 12 for performing the claimed method of claim 4, and is rejected on the same grounds.

Regarding claim 14, the claim is directed to the electronic device according to claim 12 for performing the claimed method of claim 5, and is rejected on the same grounds.

Regarding claim 15, the claim is directed to the electronic device according to claim 10 for performing the claimed method of claim 6, and is rejected on the same grounds.

Regarding claim 16, the claim is directed to an electronic device, comprising: 3420A13205US 
at least one processor; and 
a memory, communicatively connected with the at least one processor, 
the memory storing instructions executable by the at 5least one processor, and the instructions, when executed by the at least one processor, causing the at least one processor to perform operations, the operations comprising the claimed method of claim 7. While Yang does not disclose a processor and memory communicatively connected with the processor storing instructions, Yang does disclose all of the claimed functions of claim 7 as detailed above.
Zhang discloses an electronic device, comprising: 
at least one processor (Spec. page 7, [0061], lines 1-6); and 
a memory, communicatively connected with the at least one processor (Spec. page 7, [0061], lines 1-9), 
the memory storing instructions executable by the at least one processor (Spec. page 7, [0060-61), and the instructions, when executed by the at least one processor, causing the at least one processor to perform the disclosed methods for training a model and synthesizing speech.
Adapting Yang to incorporate the electronic device of Zhang discloses an electronic device, comprising: 
at least one processor (Spec. page 7, [0061], lines 1-6); and 
a memory, communicatively connected with the at least one processor (Spec. page 7, [0061], lines 1-9), 
the memory storing instructions executable by the at least one processor (Spec. page 7, [0060-61), and the instructions, when executed by the at least one processor, causing the at least one processor to perform operations, the operations comprising the claimed method of claim 7 (the method of Yang as detailed above with respect to claim 7, now adapted to operate on the electronic device disclosed by Zhang).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yang to incorporate the teachings of Zhang. Both Yang and Zhang are directed to speech synthesis models. Yang describes training the model in detail and touches on synthesizing utterances using the trained speech synthesis model (Sect. 5.1.3 para. 1, page 1456, lines 6-9) but does not detail the structure on which the methods and model operates. Zhang discloses the structure for carrying out the methods of training a model and synthesizing speech. Therefore it would have been obvious to combine the features of both disclosures to produce an electronic device for obtaining, based on the target speech synthesis model obtained according to claim 1, an acoustic feature according 15to the combined feature of the target text data, and synthesizing a target speech by using the acoustic feature.


Regarding claim 17, the claim is directed to the electronic device according to claim 16 for performing the claimed method of claim 8, and is rejected on the same grounds.

Regarding claim 18, the claim is directed to the electronic device according to claim 16 for performing the claimed method of claim 9, and is rejected on the same grounds.

Regarding claim 19, the claim is directed to a non-transitory computer readable storage medium, storing computer instructions, the computer instructions being used to cause a computer to perform operations, the operations 10comprising the claimed method of claim 1. While Yang does not disclose a non-transitory computer readable storage medium, storing computer instructions, Yang does disclose all of the claimed functions of claim 1 as detailed above.
Zhang discloses a non-transitory computer readable storage medium, storing computer instructions, the computer instructions being used to cause a computer to perform the disclosed methods for training a model and synthesizing speech (Spec. page 7, [0060], [0062]).
Adapting Yang to incorporate the non-transitory computer readable storage medium of Zhang discloses a non-transitory computer readable storage medium, storing computer instructions, the computer instructions being used to cause a computer to perform operations (Spec. page 7, [0060], [0062]), the operations 10comprising the claimed method of claim 1 (the method of Yang as detailed above with respect to claim 1, now adapted to operate with the non-transitory computer readable storage medium disclosed by Zhang).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yang to incorporate the teachings of Zhang. Both Yang and Zhang are directed to speech synthesis models. Yang describes training the model in detail and touches on synthesizing utterances using the trained speech synthesis model (Sect. 5.1.3 para. 1, page 1456, lines 6-9) but does not detail the structure on which the methods and model operates. Zhang discloses the structure for carrying out the methods of training a model and synthesizing speech. Therefore it would have been obvious to combine the features of both disclosures to produce a non-transitory computer readable storage medium for training an initial speech synthesis model according to the combined feature of the sample text data, to obtain a target speech synthesis model.


Regarding claim 20, the claim is directed to a non-transitory computer readable storage medium, storing computer instructions, the computer instructions being used to cause a computer to perform operations, the operations 10comprising the claimed method of claim 7. While Yang does not disclose a non-transitory computer readable storage medium, storing computer instructions, Yang does disclose all of the claimed functions of claim 7 as detailed above.
Zhang discloses a non-transitory computer readable storage medium, storing computer instructions, the computer instructions being used to cause a computer to perform the disclosed methods for training a model and synthesizing speech (Spec. page 7, [0060], [0062]).
Adapting Yang to incorporate the non-transitory computer readable storage medium of Zhang discloses a non-transitory computer readable storage medium, storing computer instructions, the computer instructions being used to cause a computer to perform operations (Spec. page 7, [0060], [0062]), the operations 10comprising the claimed method of claim 7 (the method of Yang as detailed above with respect to claim 7, now adapted to operate with the non-transitory computer readable storage medium disclosed by Zhang).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yang to incorporate the teachings of Zhang. Both Yang and Zhang are directed to speech synthesis models. Yang describes training the model in detail and touches on synthesizing utterances using the trained speech synthesis model (Sect. 5.1.3 para. 1, page 1456, lines 6-9) but does not detail the structure on which the methods and model operates. Zhang discloses the structure for carrying out the methods of training a model and synthesizing speech. Therefore it would have been obvious to combine the features of both disclosures to produce a non-transitory computer readable storage medium for obtaining, based on the target speech synthesis model obtained according to claim 1, an acoustic feature according 30to the combined feature of the target text data, and synthesizing a target speech by using the acoustic feature.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Singh et al. (“Automatic Pause Marking for Speech Synthesis”) teaches and automatic system for inserting a pause tag after word boundaries for the training of speech synthesis models for improved quality of speech synthesis (Abstract, page 1790).
Zhang et al. (Doc. ID. CN 110534089 A) teaches a method for training a model and synthesizing speech by marking text input using phoneme and prosodic information (Abstract). 
Arik et al. (Pub. No. US 2019/0122651 A1) teaches architecture embodiments for speech synthesis (Spec. page 2, [0024], lines 1-2) with which input text is preprocessed by inserting pause characters (Spec. pages 3-4, [0046], [0050]).
한민수 et al. (Doc. ID. KR 100959494 B1) teaches a method for providing a speech synthesizer which can synthesis non-registered words by processing input text according to prosodic and word boundaries and inserting break information at the boundaries of the non-registered word (Abstract).
Chen et al. (Doc. ID CN 1604183 A) teaches a method for automatically identifying a natural speech pause in a text string for text-to-speech conversion comprising analyzing the text and inserting the natural speech pause into the text string of the synthesized voice signal output representation (Abstract).
Chicote et al. (Doc. ID US 10475438 B1) teaches systems and methods for performing text-to-speech synthesis on text data (Spec. Col. 3, lines 34-37) comprising inserting pauses at breaks in the text (Spec. Col. 13, lines 10-28). 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PARKER L MAYFIELD whose telephone number is (571)272-4745. The examiner can normally be reached Monday - Friday 7:30 AM-5:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PARKER L MAYFIELD/
Examiner
Art Unit 2655


/ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655