Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Should applicant desire to obtain the benefit of foreign priority under 35 U.S.C. 119(a)-(d) prior to declaration of an interference, a certified English translation of the foreign application must be submitted in reply to this action.  37 CFR 41.154(b) and 41.202(e).
Failure to provide a certified translation may result in no benefit being accorded for the non-English application.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 06/19/2020 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-18 are directed to non-statutory subject matter because the claims as a whole, considering all claim elements both individually and in combination, do not amount to significantly more than an abstract idea.
The independent claims 1, 7, and 13 are directed to the abstract idea of:
“A training method for a speech synthesis model, comprising: taking a syllable input sequence, a phoneme input sequence and a Chinese character input sequence of a current sample as inputs of an encoder of a model to be trained, to obtain 5encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the encoder; fusing the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence, to obtain a weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input 10sequence; taking the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence as an input of an attention module, to obtain a weighted average of the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at each moment at an 15output end of the attention module; and taking the weighted average of the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at each moment as an input of a decoder of the model to be trained, to obtain a speech Mel spectrum of the current sample at an output end of the decoder.”

	The limitation of “taking …”, “fusing …”, “taking …”, and “taking …”, under broadest reasonable interpretation, as drafted, covers a human performing mental processes and utilizing relevant concepts. The limitation of taking the inputs such as syllables, phoneme, and Chinese characters for encoding is nothing more than collecting multiple input data and using mathematical concepts to code the inputs into new data format. For example, a human would be able get three relevant input data corresponding to the three input sequences and then use mathematical encoding concepts such as binary coding to convert the inputs into coded format using pen and paper. As for the limitation of fusing the encoded input to get weighted combination, it is nothing more than a human placing encoded forms of all three inputs into a singular vector or a matrix; furthermore, limitation of obtaining weighted combination is nothing more than getting a weighted vector of the fused encoded input using relevant mathematical concept. As for the limitation of taking the found weighted combination to an attention module to get weighted average, it can be performed by a human using mathematical computation of averaging the weights present in the weighted combination. Lastly, the limitation of weighted average being used to get the decoded Mel spectrum of the sample is nothing more than human performing mathematical decoding processes on the encoded weighted average and getting the Mel-spectrum coefficient using pen and paper. This coefficient can further be used to draw the spectrum waveform by hand. 
	The judicial exception is not integrated into a practical application. In particular claim 7 and 13 recites additional elements – “processor”, “memory”, “non-transitory computer-readable storage medium”. All of these elements are cited at a very high level of generality and do not add meaningful limits to the abstract ideas being performed. The additional element, “processor”, is used to perform generic abstract idea which can be performed by a human. Furthermore, a processor is considered as a generic element that can be found in most computer devices. As for the additional element, “memory”, it adds no meaningful limits to the claim and is only used for data and instruction storing purposes. Similar to a processor, the element of memory is considered as generic and is known to be found in most computer devices. Lastly, the additional element, “non-transitory computer-readable storage medium” (CRM), is considered as a well-known generic component used conventionally in most of the generic computer devices. Also, it is known that CRM or computer implementation of an abstract idea is not a factor that weighs in favor of patentability under subject matter eligibility. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
	The dependent claims 2, 8, and 14 are directed towards the abstract idea of inputting the three input sequences into shared encoder and obtaining encoded representation. These limitation cover nothing more than a human utilizing mathematical concept of encoding. A person would be able get three mentioned input and code them into encoded format by implementing relevant concepts or mathematical encoding concept using pen and paper. 
	The dependent claims 3, 9, and 15 are directed towards the abstract idea of inputting three inputs into three convolution layers for transformation and using the transformed inputs for the encoding process. Convolutional transformation and encoding are known concepts performed by utilizing mathematical concepts which can be utilized by a human using pen and paper.  Therefore, an individual skill in the art would be able to perform the claimed limitation by hand.
	The dependent claims 4, 10, and 16 are directed towards the abstract idea of inputting three inputs into three encoders and getting three encoded input sequences. This limitation similar to above mentioned claims as it is solely aimed at encoding which is something a human would be able to perform by hand by utilizing relevant mathematical concepts. As for getting three encoded output, this can be done by simply encoding all the three input separately by hand one by one. 
	The dependent claims 5, 11, and 17 are directed towards the abstract idea of converting the three input into vector form and using the vector form for encoding process. The limitation of converting the input data into vector form is nothing more than performing mathematical concept of vectorization which can also be performed by a human using pen and paper. As for the encoding aspect of the claim, similar to what is mentioned for above rejections, the encoding is nothing more than a person utilizing concept to code inputs into a coded form using pen and paper.
	The dependent claims 6, 12, and 18 are directed towards the abstract idea of specifying what type and how much data the inputs consist. This limitation is simply used to present detail about the inputs. Using the mentioned details, a human would be able to portion the data into three mentioned input sequences accordingly using pen and paper.
Dependent claims 2-6, 8-12, and 14-17 do not impose the judicial exception being integrated into a practical application and further fails to include additional elements that are sufficient to amount
to significantly more than the judicial.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 7-8, 13-14, are rejected under 35 U.S.C. 103 as being unpatentable over Dong (Document ID: “Representing raw linguistic information in chinese text-to-speech system”) in view of Weweler (Document ID: “SINGLE-SPEAKER END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS”).
Regarding Claims 1, 7, and 13, Dong teaches a training method for a speech synthesis model, comprising:
taking a syllable input sequence, a phoneme input sequence and a Chinese character input sequence (section II, Page 168, Paragraph 1-3, mentions of input sequences which include syllables, phoneme, and Chinese characters; Also see Fig 1 and table 1 for its representation) of a current sample as inputs of an encoder of a model to be trained, to obtain encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the encoder (Section II, Page 168, Paragraph 6, lines 1-9, mentions of neural network/ auto encoder being used to get long context vector; Also see Section II, Page 168, Paragraph 2-5 and fig 1 that presents detail on the input); 
taking the combination of encoded input sequence at each moment as an input of a decoder of the model to be trained, to obtain a speech Mel spectrum of the current sample at an output end of the decoder (Page 168, Section III; paragraph 3, lines 13-19 mention of getting a waveform from the found parameters using DNN neural network).
Dong fails to specifically mention the claimed limitation of: “fusing the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence, to obtain a weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence; taking the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence as an input of an attention module, to obtain a weighted average of the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at each moment at an output end of the attention module; and taking the weighted average of the weighted combination of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at each moment as an input of a decoder of the model to be trained.”
Weweler does teach the claimed limitation of fusing the encoded the inputs, to obtain a weighted combination of the syllable input sequence, (Fig 10 and its description, show multiple inputs being sent into the encoder to get hidden sates; Page 22, Paragraph 2 and Equation 16, mentions of hidden state being combined into a sequence of alignment weights which can be equated to weighted combination); taking the weighted combination as an input of an attention module, to obtain a weighted average of the weighted combination of at each moment at an output end of the attention module (Fig 10 and its description, show weighted average being found in form of context vector; Also, see Page 22, Paragraph 2-3 and Page 23, Eq 17); taking the weighted average of the weighted combination at each moment as an input of a decoder (Fig 10 show the weighted average being sent into the decoder as input). This encoding-decoding model using weighted combination taught by Weweler can be implement with model taught by Dong by simply using the input sequences mentioned in the claim as inputs for the model. Weweler is considered analogous to the claimed invention because it is also aimed towards speech synthesis system. Therefore, it would have been obvious to one skilled in the art before the effective filling date of the claimed invention to have modified Dong to implement encoder-decoder model taught by Weweler. The use of the model shown in Weweler can help improve the computational speed and dimensionality. (Page 11, Paragraph 2-3).
As seen in the claim set, claims 1, 7, and 13 cover similar scope of invention. However, claim 1 is a method claim while claims 7 and 13 are apparatus and computer readable medium (CRM) claim respectively. Claims 1 method of using correspond with each claimed element mentioned in claim 7 and 13. Furthermore, by looking at the experiment result presented By Dong in Section IV, it would have been obvious that a computer system that consist of CRM could be used to implement the method presented in claim 1. Therefore, claims 7 and 13 are rejected under same rationale as applied above to claim 1.
Regarding claim 2, 8, and 14, Dong in view of Weweler teaches the method according to claim 1, The apparatus according to claim 7, and the non-transitory computer-readable storage medium according to claim 13; wherein taking the syllable input sequence, the phoneme input sequence and the Chinese character input sequence of the current sample as inputs of the encoder of the model to be trained, to obtain the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at the output end of the encoder, comprises: 
inputting the syllable input sequence, the phoneme input sequence and the Chinese character input sequence to a shared encoder; and obtaining the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at an output end of the shared encoder (Dong, Section II, Page 168, Paragraph 6, lines 1-9, mention of neural network/ auto encoder being used to get long context vector; Also see Section II, Page 168, Paragraph 1-5 and fig 1 that presents detail on the input which include Chinese character, syllable, and phoneme). 
Claims 4, 10, and 16, are rejected under 35 U.S.C. 103 as being unpatentable over Dong (Document ID: “Representing raw linguistic information in chinese text-to-speech system”) in view of Weweler (Document ID: “SINGLE-SPEAKER END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS”) in view of Banerjee (Document ID : “Relation Extraction Using Multi-Encoder LSTM Network on a Distant Supervised Dataset”).
Regarding claim 4, 10, and 16, Dong in view of Weweler teaches the method according to claim 1, The apparatus according to claim 7, and the non-transitory computer-readable storage medium according to claim 13; wherein taking the syllable input sequence, the 20phoneme input sequence and the Chinese character input sequence of the current sample as inputs of the encoder of the model to be trained, to obtain the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at the output end of the encoder, comprises: 
However, Dong in view of Weweler fails to specifically mention the claimed limitation of: “inputting the syllable input sequence, the phoneme input sequence and the Chinese character input sequence to three independent encoders, respectively; and obtaining the encoded representations of the syllable input sequence, the phoneme input sequence and the Chinese character input sequence at output ends of the three independent encoders, respectively.”
Banerjee does teach the claimed limitation of: inputting input sequences three independent encoders, respectively (See fig 2 which shows a multi-encoder model being used where three inputs have three separate encoder module; also Page 236-237, Section B “MEM: Multi-Encoder Model”);  and obtaining the encoded representations of the three independent encoders, respectively (see Fig 2, Page 237, col 1, lines 1-7, and Equation 2, show three encoder representation being generated). Even though Banerjee is not encoding exactly the same input sequences mentioned in the claim, it does still present an multi encoder model that takes in multiple text features same as the claimed invention before the effect filling date. Therefore, it would have been obvious to one skilled in the art before the effective filling date of the claimed invention to have modified Dong in view of Weweler to include multiple encoder model as taught by Banerjee as it can improve performance when compared to regular LSTM encoder (Page 237, Col 2, Section B, Lines 16-17).
Claims 5, 11, and 17, are rejected under 35 U.S.C. 103 as being unpatentable over Dong (Document ID: “Representing raw linguistic information in chinese text-to-speech system”) in view of Weweler (Document ID: “SINGLE-SPEAKER END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS”) in view of Li (Document ID: "Multi-language multi-speaker acoustic modeling for LSTM-RNN based statistical parametric speech synthesis").
Regarding claim 5, 11, and 17, Dong in view of Weweler teaches the method according to claim 1, The apparatus according to claim 7, and the non-transitory computer-readable storage medium according to claim 13; prior to taking the syllable input sequence, the phoneme input sequence and the Chinese character input sequence of the current sample as inputs of the encoder of the model to be trained: 
However, Dong in view of Weweler fails to specifically mention the claimed limitation of: “converting phonemes, syllables and Chinese characters in the current sample into respective vector representations of a fixed dimension, respectively; converting vector representations of the syllables and the Chinese characters into vector representations having the same length as the vector representation of the phonemes,”
Li does teach the claimed limitation of getting vector representations of input features of a fixed dimension of same length as the vector representation of other input feature (Page 1 and 2, Section 2, Paragraph 2, lines 1-3 and lines 10-14; mention vector representation of input features being obtained and that vector representations are zero-padded to same length for universal representation.);  10performing step of taking the inputs to the encoder (Fig 1 and Page 2, col 2, line 5- 13; show encoder type module being used to get features to represent vocoder parameters). Even though Li does not mention the use of same input sequences as shown in the claimed invention, it does mention linguistic features being inputted which can be equated to the input sequences mentioned in the claim set. Moreover, it would have been obvious to one skilled in the art to have use of zero-padding implemented on the inputs to get a universal representation similar to what is mentioned by Li. Li is considered analogous to the claimed invention because it is also aimed towards speech synthesis system. Therefore, it would have been obvious to one skilled in the art before the effective filling date of the claimed invention to have modified Dong in view of Weweler to implement zero-padding of inputs as taught by Li. Furthermore, one of ordinary skill in the art would have recognized that result of the combination was predictable since the use of that known technique provides the rationale to arrive at a conclusion of obviousness. See KSR International Co. v. Teleflex lnc., 82 USPQ2D 1385 (U.S. 2007).
Claims 6, 12, and 18, are rejected under 35 U.S.C. 103 as being unpatentable over Dong (Document ID: “Representing raw linguistic information in chinese text-to-speech system”) in view of Weweler (Document ID: “SINGLE-SPEAKER END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS”) in view of Lin (Document ID: " A novel prosodic-information synthesizer based on recurrent fuzzy neural network for the Chinese TTS system ") in view of Document ID: “Pinyin Chart” in view of Hu (Document ID: CN 109065044 A).
Regarding claims 6, 12, and 18 Dong in view of Weweler teaches the method according to claim 1, The apparatus according to claim 7, and the non-transitory computer-readable storage medium according to claim 13; wherein the Chinese character input sequence comprises: input sequences of 3000 Chinese characters (Dong, Page 168, col 1, Paragraph 5, lines 1-3, mention the most commonly used Chinese characters exceed 3000. Even though, it is not shown that 3000 characters are used in the system, adding the necessary amount of Chinese characters to the system could lead to the amount of 3000 as a matter of design choice.)
However, Dong in view of Weweler fails to specifically mention the claimed limitation of “the phoneme input sequence comprises: a tone input sequence, a rhotic accent input sequence, a punctuation input sequence and input sequences of 35 independent finals; the 15phoneme input sequence comprises 106 phoneme units; each phoneme unit comprises 106 bits, wherein a value of a significant bit in 106 bits is 1 and a value of a non-significant bit is 0; the syllable input sequence comprises: input sequences of 508 syllable.”
Lin does teach the phoneme input sequence, where it consists of tone, vowel, punctuation, vowels, and consonant (table II and III; Page 317, Col 2, Paragraph 2 and Page 318, Col 1, Paragraph 1). Here, the vowel can be equated to the 35 independent finals (Table III and Page 318, Col 1, Paragraph 1, lines 10-12), and combination of vowel and consonant present in the binary digit arrangement can be equated to rhotic accent sequence. Furthermore, Lin also teaches the bit format of the input sequence mentioned in the claim with 0 correspond to non-significance and 1 corresponding significance (table II and III; Page 317, Col 2, Paragraph 2 and Page 318, Col 1, Paragraph 1). However, it fails to cover the exact digit of 106 bit corresponded to the input sequence. Lin is considered analogous to the claimed invention as it is also aimed towards the speech synthesis. Therefore, it would have been obvious to one skilled in the art before the effective filling date of the claimed invention to have modified Dong in view of Weweler to implement input mentioned by taught by Lin use of which can improve the intelligibility and naturalness of synthetic speech in Chinese TTS system (Page 323, Col 1, lines 13-18).
However, Dong in view of Weweler in view of Lin fails to specifically mention the exact number of 106 equated to phoneme and the 508-syllables input sequence.
According to the evaluation of the Phoneme input sequence, it is inherent that the number 106 is simply the sum of all of the input mentioned which include tone, rhotic accent, punctuation, and 35 finals. The number for tone, Punctuation, and final is covered in Lin see table II and III. Where there are 5 tones, 35 vowels/ finals, and 8 punctuations. The rhotic accent is missing from Lin which commonly known in the Chinese language (or any known language) as ‘r-sound’ based on pitch contours shown in Lin, Fig 3.  To get the total number of rhotic accent a simple multiplication needed of the number of combined initials(consonant) and finals(vowels) with ‘r-sound’ to the number of tones. The NPL: “Pinyin Chart” show all the combinations of the initial (consonant) and finals (vowel). Here combination with ‘r’ can be equated to r-sound (rhotic) combination such as ‘rao’, ‘ran’, ‘rang’, which can be pronounced in 4 chinese tone/contours mentioned above. This gives a total rhotic accent of 15*4=60. Now, the final number corresponding to phone sequence can be found which is as follow 5 tone + 35 finals + 60 contour + 8 punctuation which gives 108. Although the reference does not teach the exact number 106, removing certain information could lead to the 106 as a matter of design choice. Implementing the document ID: “Pinyin Chart” for calculation of rhotic sound should have been obvious to one skilled in the art as it is common knowledge accounted to Chinese/pinyin language. Furthermore, one of ordinary skill in the art would have recognized that result of the combination was predictable since the use of that known technique provides the rationale to arrive at a conclusion of obviousness. See KSR International Co. v. Teleflex lnc., 82 USPQ2D 1385 (U.S. 2007).
Dong in view of Weweler in view of Lin in view of Document ID: “Pinyin Chart” fails to teach the 508 syllables mentioned in the claim which is supposedly shown to representing without tone syllables. HU does teach without tone Chinese character syllable to be about 400 to 500 (Page 6, Paragraph 9, lines 4). Using the fact that the without tone syllable are about 500. It can be said that the number 500 can be increased or decreased depending on the design and implementation choice of the speech synthesis system so that syllable input sequence can represent 508 syllables. The number equated to the without tone syllables is something that should be considered as known fact as it relates to a persistently known language, Chinese. Therefore, it should have been obvious to one skilled in the art to have about 500 without tone syllable as input. Furthermore, one of ordinary skill in the art would have recognized that result of the combination was predictable since the use of that known technique provides the rationale to arrive at a conclusion of obviousness. See KSR International Co. v. Teleflex lnc., 82 USPQ2D 1385 (U.S. 2007). 

Allowable Subject Matter
Claims 3, 9, and 15 are objected to as being dependent upon rejected base claim.
The following is a statement of reasons for indication of allowable subject matter:
None of the cited prior art teaches three independent convolution layers used for three input sequences for transformation. Furthermore, none of the prior art specifically show or mention the three convolutionally transformed input sequences being encoded using a neural network.
Claims 3, 9, and 15 would be allowable if rewritten to over the 35 U.S.C 101 rejection for claim 1-18 and if rewritten to overcome any objection as being dependent upon rejected claim. One way to overcome the objection of being dependent on rejected claim is to rewrite the objected claims in independent form including all of the limitation of the base claim and any intervening claims.
Conclusion
The analogous prior art made of record but not relied upon is considered to applicant’s disclosure.
Zhang (Document ID: “Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet”) teaches TTS system with multi source multi encoder model
Hu (Document ID: CN 108447486 A) teaches speech synthesis system of speech translation which include encoding and decoding of input sequence such as Chinese characters, Phoneme, Syllables.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NEEL P. KARELIA whose telephone number is (571)272-4377. The examiner can normally be reached Monday-Friday 6:30 am - 4:00 pm (every other Friday Off)).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NEEL PIYUSHKUMAR KARELIA/Examiner, Art Unit 2659     

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659