DETAILED ACTION
This communication is in response to the Application filed on 02/21/2020. Claims 1-15 are pending and have been examined, with claims 1, 6, and 11 being independent. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 08/21/2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the 


Claim 1, 6, and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et. al. (US Patent No. US 11,205,417 B2), hereinafter Lee, in view of Mandal et. al. (US Patent No. US 10,726,830 B1), hereinafter Mandal. 
Regarding Claim 6, 1 and 11, Lee discloses:
A speech synthesis device/method, comprising: (Lee, Figure 2: the speech recognition verification device 100)
one or more processors; and (Lee, Figure 2, Column 13, Lines 1-3: the control module 170 may include devices of all kinds that are capable of processing data, such as a processor) 
a storage device configured to store one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to: (Lee, Column 15, Lines 19-28: computer programs executable through various components on a computer, and such computer programs may be recorded in computer-readable media [as storage device] e.g. hardware devices that are specially configured to store and execute program codes, such as ROM, RAM, and flash memory devices) 
input text information into an encoder of an acoustic model, to output a text feature of a current time step; (Lee, Column 11, Lines 32-43: speech synthesis module 130 may use a Tacotron algorithm to convert the verification target text item [as input text information] to a speech spectrogram, and apply a preset utterance condition to the speech spectrogram to 
input the spliced feature of the current time step into an decoder of the acoustic model to obtain a spectral feature of the current time step; and (Lee, Column 11, Lines 41-43: decoder part (not illustrated) synthesizing a speech from the text (a verification target text item) outputted from the encoder; Lee, Column 11, Lines 52-57: In the decoder, as an input value at a time step t for a decoder network [as input the spliced feature of the current time step into a decoder], the sum of a weighted sum of text encoding vectors and the last decoder output value in a time step t-1 hour. The output value of the decoder may output an R number of vectors at each time step by mel-scale spectrogram [as obtain a spectral feature of the current time step])

a non-transitory computer-readable storage medium (Lee, Column 15, Lines 19-21: computer programs executable through various components on a computer, and such computer programs may be recorded in computer-readable media)
Lee does not disclose a speech synthesis device/method to:
splice the text feature of the current time step with a spectral feature of a previous time step to obtain a spliced feature of the current time step, and 
However, Mandal teaches a speech synthesis device/method to:
splice the text feature of the current time step with a spectral feature of a previous time step to obtain a spliced feature of the current time step, and 
(Mandal, Column 23, Lines 54-60: the conventional front-end 900 may perform feature normalization 920 to normalize the feature vector [as splice the text feature]. For example, the feature normalization 920 may perform causal and global mean-variance normalization. In some examples, the feature normalization 920 may subtract the mean of each coefficient from 
Lee and Mandal are considered to be analogous to the claimed invention because they are in the same field of speech synthesis. Accordingly, it would have been obvious to one of ordinary skill in the art at the time the invention was effectively filed to have combined Lee (directed to a speech synthesis device to input the spectral feature of the current time step into a neural network vocoder, to output speech) and Mandal (directed to splice the text feature) and arrived at a speech synthesis device to splice the text feature and to input the spectral feature of the current time step into a neural network vocoder, to output speech. One of ordinary skill in the art would have been motivated to make such a combination because the feature normalization 920 balances the spectrum and improves a Signal-to-Noise ratio (SNR) or other signal quality metric of the output of the conventional front-end 900 (Mandal, Column 23, Lines 60-63).

Claim 2, 7, and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Mandal, and further in view of Arik et. al. (US PGPub No. US 2018/0247636 A1), hereinafter Arik.
Regarding Claim 7, 2, and 12, Lee in view of Mandal discloses all of Claim 6, 1, and 11 limitations above, with the exception of the following that Lee in view of Mandal does not disclose the speech synthesis device/method to:
pass the text information through at least one fully connected layer and a gated recurrent unit in the encoder, to output the text feature of the current time step.

pass the text information through at least one fully connected layer and a gated recurrent unit in the encoder, to output the text feature of the current time step. (Arik, Paragraph 36: The grapheme-to-phoneme model 115/215 converts from written text (e.g., English characters) [as pass the text information] to phonemes (e.g., encoded using a phonemic alphabet such as ARPA BET); Paragraph 45: multi-layer bidirectional encoder with a gated recurrent unit (GRU) [as gated recurrent unit] nonlinearity and an equally deep unidirectional GRU decoder; Paragraph 54: The input to embodiments of the model is a sequence of phonemes with stresses, with each phoneme and stress being encoded as a one-hot vector; Embodiments of the architecture comprise two fully connected layers [as fully connected layer] with 256 units each followed by two unidirectional recurrent layers with 128 GRU cells each and finally a fully-connected output layer; Paragraph 55: The final layer produces three estimations for every input phoneme: the phoneme duration [as output the text feature], the probability that the phoneme is voiced (i.e., has a fundamental frequency), and 20 time-dependent F0 values, which are sampled uniformly over the predicted duration; Paragraph 24: Embodiments convert text to phonemes and then use an audio synthesis model to convert linguistic features [as text feature] into speech; features used by embodiments herein are phonemes with stress annotations, phoneme durations, and fundamental frequency (F0))
Lee, Mandal, and Arik are considered to be analogous to the claimed invention because they are in the same field of speech synthesis. Accordingly, it would have been obvious to one of ordinary skill in the art at the time the invention was effectively filed to have combined Lee in view of Mandal (directed to a speech synthesis device) and Arik (directed to pass text .
Allowable Subject Matter
Claim 3-5, 8-10, and 13-15 would be allowable if rewritten to include all of the limitations of the base claim and any intervening claims.  
Regarding Claim 8, 3, and 13, Lee in view of Mandal discloses all of Claim 6, 1, and 11 limitations above. Furthermore, Lee discloses a speech synthesis device/method to:
input the first spectral feature of the previous time step into another fully connected layer, to obtain a second spectral feature of the previous time step; (Lee, Column 11, Lines 13-16: applicator 131-2 may apply the second utterance condition (for example, a male voice in his teenage years) to the speech spectrogram [as first spectral feature] to generate a second speech spectrogram [as second spectral feature])
input the spliced feature of the current time step into the decoder of the acoustic model, to obtain a first spectral feature of the current time step. (Lee, Column 11, Lines 41-43: decoder part (not illustrated) synthesizing a speech from the text (a verification target text item) outputted from the encoder; Lee, Column 11, Lines 52-57: In the decoder, as an input value at a time step t for a decoder network [as input the spliced feature of the current time step into a decoder], the sum of a weighted sum of text encoding vectors and the last decoder 
However, Lee in view of Mandal does not disclose a speech synthesis device/method to:
input the spliced feature of the previous time step into at least one gated recurrent unit and a fully connected layer in the decoder to output a first spectral feature of the previous time step; 
splice the text feature of the current time step with the second spectral feature of the previous time step, to obtain the spliced feature of the current time step; 
Claims 4-5, 9-10, and 14-15 are dependent claims of Claims 3, 8, and 13, are similar in scope and content, and therefore, are similarly allowed under the same rationale as applied above with respect to the functions depicted by Claims 3, 8, and 13. 
Hence, none of the prior art of record teaches or makes obvious the combination of limitations as presently recited in claims 3-5, 8-10, and 13-15. 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: 
Ping et. al. (US PGPub No. US 2019/0180732 A1) teaches a first loss function 1210 by minimizing a linear combination of a regularized Kullback-Leibler (KL) divergence (between output distributions of the teacher network and student network) and a second loss function 1220 by minimizing a spectrogram frame-level loss (using ground truth dataset 1206); ground-truth mel-spectrogram 1202; frame-level loss 1220 and Kullback-Leibler (KL) divergence loss 
Ward et. al. (US Patent No. US 10,210,860 B1) teaches customizing a neural network for a custom dataset, specifically where the output from CNN stack 202 and first fully-connected layer 203 is a set of features that describe acoustic features of the audio input. Ward also teaches a Recurrent Neural Network (RNN) stack 204 receives these features from first fully-connected stack 203 and produces a third set of features (Ward, Column 10, Lines 3-8).
Wang et. al. (Non-Patent Literature Document, Wang, Tacotron: Towards End-to-End Speech Synthesis, 2017, Interspeech, Google, Inc, P1-8) teaches a tacotron seq2seq model with attention, which includes an encoder, an attention-based decoder, and a post-processing net, and the model takes characters as input and produces spectrogram frames, which are then converted to waveforms (Wang, Page 3, Lines 5-8). 
Van den Oord et. al. (US Patent No. US 11,080,591 B2) teaches a neural machine translation system 900 including a convolutional neural network encoder 908 that takes as input the source embedding sequence 906 and generates as output an encoded source representation 910. (Van den Oord, Column 21, Lines 13-16).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANUP CHANDORA whose telephone number is (571)272-4202.  The examiner can normally be reached on Full-time.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/ANUP CHANDORA/Examiner, Art Unit 2658                                                                                                                                                                                                        

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658