Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 6/21/2019 is being considered by the examiner.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1,2,4-5, 9, 10, 12- 13, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Fan et al. (Fan, Yuchen; Qian, Yao; Xie, Feng-Long; Soong, Frank K. (2014): "TTS synthesis with bidirectional LSTM based recurrent neural networks", In INTERSPEECH-2014, 1964-1968.), hereinafter referred to as Fan, in view of Reber et al. (US 20180336882 A1), hereinafter referred to as Reber.
With respect to claims 1, Fan teaches training (p3 col 1 line 6: The weights of DNN are trained by using pairs of input and output features extracted from training data to minimize the errors between the mapped output from the given input and the target output.) a larger model using long short-term memory and multi-layer perceptron feed- forward hidden layer modeling (Abstract ll 13-21: Experimental results show that a hybrid system of DNN and BLSTM-RNN, i.e., lower hidden layers with a feed-forward structure which is cascaded with upper hidden layers with a bidirectional RNN structure of LSTM, can outperform either the conventional, decision tree-based HMM…); identifying and extracting fundamental frequency values for voiced and unvoiced ( p3 col 2, Sec 4, 3rd para: In the DNN-based TTS, the input feature vector contains355 dimensions, where 319 are binary features for categorical linguistic contexts and the rest are numerical linguistic contexts. The output feature vector contains a voiced/unvoiced flag, log F0, LSP, gain, their dynamic counterparts, totally 127 dimensions) regions from the larger model; 
Fan does not teach transferring and applying the fundamental frequency values for voiced regions extracted from the larger model in training a smaller model; and applying the smaller model in the text to speech system for real-time speech synthesis output.  
Reber teaches transferring and applying the fundamental frequency values for voiced regions extracted from the larger model in training a smaller model; and applying the smaller model ([0028] Embodiments described herein are directed to a technique for improving training and speech quality of the TTS system 200 having an artificial intelligence, such as a neural network… The front-end subsystem 220 is configured to provide analysis and conversion of text input 202 (i.e., symbols representing alphanumeric characters) into input vectors, each having at least a pitch contour for a phoneme, e.g., a base frequency (referred to as “f.sub.o”) 224, a phenome duration (D) 226, and a phoneme sequence 222 (e.g., a context, ph) that is processed by the back-end subsystem 230. ) in the text to speech system for real-time speech synthesis output (p4 col 1 section 4.2 line 1-5: Objective and subjective measures are used to evaluate the performance of three TTS systems on the test data. Synthesis quality is measured objectively in terms of distortions between natural test utterances of the original speaker and the synthesized speech where oracle state durations).

With respect to claims 2 and 10, Fan and Reber teach all the limitations as in claims 1 and 9 above. Furthermore Fan teaches wherein the training of the larger model utilizes three feed-forward hidden layers (page 4 col1 Sec 4.1 last para: DNN_B: 3 hidden layers with 1024 nodes per layer.) 
With respect to claims 4 and 12, Fan and Reber teach all the limitations as in claims 1 and 9 above. Furthermore Fan teaches wherein the fundamental frequency values are continuous values (p3 col2 Sec 4, para 3 ll 6-9: Voiced/unvoiced flag is a binary feature that indicates the voicing of the current frame. An exponential decay function is used to interpolate F0 in unvoiced speech regions.)
With respect to claims 9, Fan teaches training (p3 col 1 line 6: The weights of DNN are trained by using pairs of input and output features extracted from training data to minimize the errors between the mapped output from the given input and the target output.) a first model using feed- forward hidden layer modeling (page 4 col1 Sec 4.1 last para: DNN_B: 3 hidden layers with 1024 nodes per layer); identifying and extracting fundamental frequency values for a plurality of regions of speech input ( p3 col 2, Sec 4, 3rd para: In the DNN-based TTS, the input feature vector contains355 dimensions, where 319 are binary features for categorical linguistic contexts and the rest are numerical linguistic contexts. The output feature vector contains a voiced/unvoiced flag, log F0, LSP, gain, their dynamic counterparts, totally 127 dimensions) using the first model; 
Fan does not teach transferring and applying the fundamental frequency values for the specified regions of the plurality of regions extracted from the first model in training a second model; and applying the second model in the text to speech system for real-time speech synthesis output.  
Reber teaches transferring and applying the fundamental frequency values for the specified regions of the plurality of regions extracted from the first model in training a second model; and applying the second model ([0028] Embodiments described herein are directed to a technique for improving training and speech quality of the TTS system 200 having an artificial intelligence, such as a neural network… The front-end subsystem 220 is configured to provide analysis and conversion of text input 202 (i.e., symbols representing alphanumeric characters) into input vectors, each having at least a pitch contour for a phoneme, e.g., a base frequency (referred to as “f.sub.o”) 224, a phenome duration (D) 226, and a phoneme sequence 222 (e.g., a context, ph) that is processed by the back-end subsystem 230. ) in the text to speech system for real-time speech synthesis output (p4 col 1 section 4.2 line 1-5: Objective and subjective measures are used to evaluate the performance of three TTS systems on the test data. Synthesis quality is measured objectively in terms of distortions between natural test utterances of the original speaker and the synthesized speech where oracle state durations).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Fan to include the teachings of Reber, motivation being to improve  training and speech quality of a text-to-speech (TTS) system using artificial intelligence (Reber [0005]).
With respect to claims 17, Fan does not teach wherein the first model is larger than the second model.
([0006] Unlike prior systems that employ large and complex neural networks to implement direct input vector-to-generated speech from hundreds of hours of speech samples, the technique described herein substantially reduces neural network complexity and processing requirements by focusing efforts on capturing errors and inaccuracies in the generated speech from the pre-existing knowledgebase in the neural network. That is, instead of attempting to capture in a neural network how to generate speech directly from sound samples as in the prior art, the technique captures an error signal that is applied to previously generated speech from the pre-existing knowledgebase so as to correct imperfections (e.g., reduce perceived flaws) in the generated speech. As such, a significantly smaller neural network may be deployed in the TTS.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Fan to include the teachings of Reber, motivation being to use error signals from previously generated speech in a database to reduce the size of the network (Reber [0006]).
Claims 3 and 11  are rejected under 35 U.S.C. 103 as being unpatentable over Fan, in view of  Reber and further in view of Li et al.( R. Li, Z. Wu, Y. Huang, J. Jia, H. Meng and L. Cai, "Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 5129-5133, doi: 10.1109/ICASSP.2018.8461748.) hereinafter referred to as Li.
Regarding claims 3 and 11 Fan and Reber do not specifically recite wherein the three feed-forward hidden layers comprise one or more of: 1024 nodes and a long short-term memory hidden layer comprising 512 nodes.  
(p3 Col 4 Sec 3.1 2nd to last para last 4 lines: Each DNN based model contains 4 hidden layers, 1024 nodes per layer. Each BLSTM based model contains 4 hidden layers, 512 nodes per layer (256 forward nodes and 256 backward nodes).)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Fan and Reber to include the teachings of Li, motivation being to generate emphatic speech with bidirectional LSTM networks (Li Section 1 para 4).
Claims  5 and 13  are rejected under 35 U.S.C. 103 as being unpatentable over Fan, in view of  Reber and further in view of Toth et al.( B. P. Tóth and T. G. Csapó, "Continuous fundamental frequency prediction with deep neural networks," 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 2016, pp. 1348-1352, doi: 10.1109/EUSIPCO.2016.7760468. (Year: 2016)), hereinafter referred to as Toth.
Regarding claims 5 and 13 Fan and Reber do not specifically recite wherein the zero and undefined values for unvoiced regions are not applied.  
Toth recites wherein the zero and undefined values for unvoiced regions are not applied (p1 col 2 para 3 : Traditionally, using standard pitch tracking methods in vocoders, the F0 contour is discontinuous at voiced-unvoiced (V-UV) and unvoiced-voiced (UV-V) boundaries, because F0 is not defined in unvoiced sounds.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Fan and Reber to include the teachings of Toth, motivation being to generate more natural synthesized speech (Toth p1 col 2 para 3).
Claims 6, 7, 14 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Fan, in view of Reber and further in view of Zhu et al. (Zhu, Pengcheng; Xie, Lei; Chen, Yunlin (2015): "Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings", In INTERSPEECH-2015, 2192-2196) hereinafter referred to as Zhu.
Regarding claims 6 and 14 Fan and Reber do not specifically recite wherein the training of the smaller model utilizes three feed-forward hidden layers.    
Zhu recites wherein the training of the smaller model utilizes three feed-forward hidden layers (p4 Col 1 Sec4 .2 ll 1-5: We tested the articulatory inversion performance of a set of network topologies with different hidden layers (F: feed forward, B: BLSTM) and node sizes (64, 128, 256). Results show that the 3-hidden-layer structures outperform the 1- and 2-hidden layer structures in general.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Fan and Reber to include the teachings of Zhu, motivation being that three hidden layer structures outperform the 1, 2 hidden layer structures.  (Zhu Section 4.2 ll 3-5).
Regarding claims 7 and 15, Fan and Reber do not specifically recite wherein the three feed-forward hidden layers comprise one or more of: 128 nodes and a long short-term memory hidden layer comprising 256 nodes.
Zhu recites wherein the three feed-forward hidden layers comprise one or more of: 128 nodes and a long short-term memory hidden layer comprising 256 nodes (p4 Col 1 Sec4 .2 ll 1-5: We tested the articulatory inversion performance of a set of network topologies with different hidden layers (F: feed forward, B: BLSTM) and node sizes (64, 128, 256).)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Fan and Reber to include the teachings of Zhu, motivation being that BLSTM networks outperform feedforward networks (Zhu Abstract).
Claims 8 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Fan, in view of Reber and further in view of Qian et al. (Y. Qian, Y. Fan, W. Hu and F. K. Soong, "On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014, pp. 3829-3833, doi: 10.1109/ICASSP.2014.6854318.) hereinafter referred to as Qian.
Regarding claims 8 and 16 Fan teaches the limitations as recited for Claim 1 and 9. Reber further teaches transferring and applying the fundamental frequency as recited for Claim1 and 9. Fan and Reber do not specifically recite comprises applying a hyperbolic tangent activation function in the lower layers and a linear activation function at the output layer.  
Qian recites applying a hyperbolic tangent activation function in the lower layers and a linear activation function at the output layer (page 4 col 1, 2nd para: The sigmoid and hyperbolic tangent activation functions are used for the hidden layers of DNN, while the linear activation function is employed for the output layer.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Fan and Reber to include the teachings of Qian motivation being that hyperbolic tan allows for faster convergence (Qian page 4 col 1, 2nd para).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675.  The examiner can normally be reached on Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/A.N.P./               Examiner, Art Unit 2657                                                                                                                                                                                         
/Paras D Shah/               Primary Examiner, Art Unit 2659                                                                                                                                                                                         
03/23/2021