EXAMINER’S AMENDMENT
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to the amendment filed July 27, 2022.  Claims 1, 7, and 13 have been amended.  Claims 3, 4, 9, 10, 15, and 16 have been cancelled.  Claims 1, 2, 5-8, 11-14, and 17 -18 remain pending.

An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an interview with Rebecca Rudolph  on August 23, 2022.

The application has been amended as follows: 
IN THE CLAIMS:  the claims have been amended as follows, where deletions are shown as underlined.

The following listing of claims will replace all prior versions and listings of claims in the application.


1. (Currently Amended) A computer-implemented method for training a speech synthesis model, comprising: 
taking a syllable input sequence, a phoneme input sequence and a Chinese character input sequence of a current sample as inputs of an encoder of the speech synthesis model to be trained, to obtain encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the encoder by:
inputting the syllable input sequence, the phoneme input sequence and the Chinese character input sequence to three independent convolution layer transformation modules, respectively; 
obtaining a convolutional-transformed syllable input sequence, a convolutional-transformed phoneme input sequence and a convolutional-transformed Chinese character input sequence at output ends of the three independent convolution layer transformation modules, respectively; and
taking the convolutional-transformed syllable input sequence, the convolutional-transformed phoneme input sequence and the convolutional-transformed Chinese character input sequence as inputs of a sequence transformation neural network module, to obtain the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the sequence transformation neural network module;
fusing the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence, to obtain a weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence;
taking the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence as an input of an attention module, to obtain a weighted average of the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at each moment at an output end of the attention module; and
taking the weighted average of the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at each moment as an input of a decoder of the model to be trained, to obtain a speech Mel spectrum of the current sample at an output end of the decoder; and
outputting, from a speaker, synthesized speech based on an output of the decoder.

2. (Original) The method according to claim 1, wherein taking the syllable input sequence, the phoneme input sequence and the Chinese character input sequence of the current sample as inputs of the encoder of the model to be trained, to obtain the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at the output end of the encoder, comprises:
inputting the syllable input sequence, the phoneme input sequence and the Chinese character input sequence to a shared encoder; and 
obtaining the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the shared encoder.
3-4. (Cancelled)
5. (Original) The method according to claim 1, prior to taking the syllable input sequence, the phoneme input sequence and the Chinese character input sequence of the current sample as inputs of the encoder of the model to be trained, further comprising:
converting phonemes, syllables and Chinese characters in the current sample into respective vector representations of a fixed dimension, respectively;
converting vector representations of the syllables and the Chinese characters into vector representations having the same length as the vector representation of the phonemes, to obtain the syllable input sequence, the phoneme input sequence and the Chinese character input sequence, and 
performing the step of taking the syllable input sequence, the phoneme input sequence and the Chinese character input sequence as inputs of the encoder of the model to be trained.
6. (Original) The method according to claim 1, wherein, 
the phoneme input sequence comprises: a tone input sequence, a rhotic accent input sequence, a punctuation input sequence and input sequences of 35 independent finals; the phoneme input sequence comprises 106 phoneme units; each phoneme unit comprises 106 bits, wherein a value of a significant bit in 106 bits is 1 and a value of a non-significant bit is 0; the Chinese character input sequence comprises: input sequences of 3000 Chinese characters; the syllable input sequence comprises: input sequences of 508 syllables.
7. (Currently Amended) A training apparatus for a speech synthesis model, comprising: 
at least one processors; and
a memory communicatively connected with the at least one processor; wherein,
the memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can execute the method comprising:
taking a syllable input sequence, a phoneme input sequence and a Chinese character input sequence of a current sample as inputs of an encoder of a model to be trained, to obtain encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the encoder by
inputting the syllable input sequence, the phoneme input sequence and the Chinese character input sequence to three independent convolution layer transformation modules, respectively and obtaining a convolutional-transformed syllable input sequence, a convolutional-transformed phoneme input sequence, and a convolutional-transformed Chinese character input sequence at output ends of the three independent convolution layer transformation modules, respectively; taking the convolutional-transformed syllable input sequence, the convolutional-transformed phoneme input sequence and the convolutional-transformed Chinese character input sequence as inputs of a sequence transformation neural network module, to obtain the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the sequence transformation neural network module;
fusing the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence, to obtain a weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence; and taking the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence as an input of an attention module, to obtain a weighted average of the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at each moment at an output end of the attention module; and
taking the weighted average of the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at each moment as an input of a decoder of the model to be trained, to obtain a speech Mel spectrum of the current sample at an output end of the decoder; and
outputting, from a speaker, synthesized speech based on an output of the decoder.
8. (Original) The apparatus according to claim 7, wherein taking the syllable input sequence, the phoneme input sequence and the Chinese character input sequence of the current sample as inputs of the encoder of the model to be trained, to obtain the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at the output end of the encoder, comprises:
inputting the syllable input sequence, the phoneme input sequence and the Chinese character input sequence to a shared encoder; and obtaining the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the shared encoder.
9-10. (Cancelled)
11. (Original) The apparatus according to claim 7, wherein prior to taking the syllable input sequence, the phoneme input sequence and the Chinese character input sequence of the current sample as inputs of the encoder of the model to be trained, the method further comprises: 
converting phonemes, syllables and Chinese characters in the current sample into respective vector representations of a fixed dimension, respectively; and converting vector representations of the syllables and the Chinese characters into vector representations having the same length as the vector representation of the phonemes, to obtain the syllable input sequence, the phoneme input sequence and the Chinese character input sequence;
performing the step of taking the syllable input sequence, the phoneme input sequence and the Chinese character input sequence as inputs of the encoder of the model to be trained.
12. (Original) The apparatus according to claim 7, wherein, the phoneme input sequence comprises: a tone input sequence, a rhotic accent input sequence, a punctuation input sequence and input sequences of 35 independent finals; the phoneme input sequence comprises 106 phoneme units; each phoneme unit comprises 106 bits, wherein a value of a significant bit in 106 bits is 1 and a value of a non-significant bit is 0; the Chinese characters input sequence comprises: input sequences of 3000 Chinese characters; the syllable input sequence comprises: input sequences of 508 syllables.
13. (Currently Amended) A non-transitory computer-readable storage medium having computer instructions stored, wherein, the computer instructions are used to cause a computer to execute the method comprising:
taking a syllable input sequence, a phoneme input sequence and a Chinese character input sequence of a current sample as inputs of an encoder of a model to be trained, to obtain encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the encoder by
inputting the syllable input sequence, the phoneme input sequence and the Chinese character input sequence to three independent convolution layer transformation modules, respectively and obtaining a convolutional-transformed syllable input sequence, a convolutional-transformed phoneme input sequence, and a convolutional-transformed Chinese character input sequence at output ends of the three independent convolution layer transformation modules, respectively; taking the convolutional-transformed syllable input sequence, the convolutional-transformed phoneme input sequence and the convolutional-transformed Chinese character input sequence as inputs of a sequence transformation neural network module, to obtain the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the sequence transformation neural network module;
fusing the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence, to obtain a weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence; and taking the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence as an input of an attention module, to obtain a weighted average of the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at each moment at an output end of the attention module; and
taking the weighted average of the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at each moment as an input of a decoder of the model to be trained, to obtain a speech Mel spectrum of the current sample at an output end of the decoder; and
outputting, from a speaker, synthesized speech based on an output of the decoder.
14. (Original) The non-transitory computer-readable storage medium according to claim 13, wherein taking the syllable input sequence, the phoneme input sequence and the Chinese character input sequence of the current sample as inputs of the encoder of the model to be trained, to obtain the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at the output end of the encoder, comprises:
inputting the syllable input sequence, the phoneme input sequence and the Chinese character input sequence to a shared encoder; and obtaining the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the shared encoder.
15-16. (Cancelled)
17. (Original) The non-transitory computer-readable storage medium according to claim 13, wherein prior to taking the syllable input sequence, the phoneme input sequence and the Chinese character input sequence of the current sample as inputs of the encoder of the model to be trained, the method further comprises: 
converting phonemes, syllables and Chinese characters in the current sample into respective vector representations of a fixed dimension, respectively; and converting vector representations of the syllables and the Chinese characters into vector representations having the same length as the vector representation of the phonemes, to obtain the syllable input sequence, the phoneme input sequence and the Chinese character input sequence;
performing the step of taking the syllable input sequence, the phoneme input sequence and the Chinese character input sequence as inputs of the encoder of the model to be trained.
18. (Original) The non-transitory computer-readable storage medium according to claim 13, wherein, the phoneme input sequence comprises: a tone input sequence, a rhotic accent input sequence, a punctuation input sequence and input sequences of 35 independent finals; the phoneme input sequence comprises 106 phoneme units; each phoneme unit comprises 106 bits, wherein a value of a significant bit in 106 bits is 1 and a value of a non-significant bit is 0; the Chinese characters input sequence comprises: input sequences of 3000 Chinese characters; the syllable input sequence comprises: input sequences of 508 syllables.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANGELA A ARMSTRONG whose telephone number is (571)272-7598. The examiner can normally be reached M,T,TH,F 11:30-8:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

ANGELA A. ARMSTRONG
Primary Examiner
Art Unit 2659




/ANGELA A ARMSTRONG/Primary Examiner, Art Unit 2659