Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 8 and 15 are independent.
This Application was published as U.S. 2021/0174781.
            Apparent priority: 17 January 2019.

	The example of the one-hot feature vector in paragraphs [0025]-[0027] of the published Application drops the word “possesses.”  This may be an artifact of translation and the Applicant may correct that as supported by the original Chinese text.
333Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1, 8, and 15 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Wu (U.S. 2020/0051583) (filed August 8, 2018).
Regarding Claim 1, Wu teaches:
1. A text-based speech synthesis method, comprising: 
obtaining target text to be recognized; [Wu, Figure 1, “input text 104.”  “[0017] The speech synthesis system 100 includes an input/output subsystem 102 configured to receive input text 104 as input and to provide speech 106 as output. The input text 104 includes a sequence of characters in a particular natural language, e.g., English, Spanish, or French. The sequence of characters can include letters, numbers, punctuation marks, and/or other special characters. …”  “[0018] The input/output subsystem 102 can include an optical character recognition (OCR) unit to convert images of typed, handwritten, or printed text into machine-encoded text….”] [Figure 5, 510: “Receive an input text.”]
discretely characterizing each character in the target text to generate a feature vector corresponding to each character; [Wu, Figure 1, “input subsystem 102” to “TTS model 108” generating “Mel-Frequency Spectrogram.”  “[0019] The input/output subsystem 102 is also configured to convert each character in the sequence of characters in the input text 104 into a one-hot vector and embed each one-hot vector in a continuous vector….” “[0021] In particular, an encoder neural network 110 of the TTS model 108 is configured to receive the character embeddings from the input/output subsystem 102 and generate a fixed-length context vector for each mel-frequency spectrogram that a decoder neural network 114 will later generate….”  “[0023] … That is, the attention network can generate a fixed-length context vector for each frame of a mel-frequency spectrogram that a decoder neural network 114 will later generate. ….” ] [Figure 5, 520.  “Process the input text to generate an input representation of a respective portion of the input text.”]
inputting the feature vector into a pre-trained spectrum conversion model to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; and [Wu, Figure 1, “input subsystem 102” to “TTS model 108” generating “Mel-Frequency Spectrogram.”  “[0024] The decoder neural network 114 is configured to receive as input the fixed-length context vectors and generate, for each fixed-length context vector, a corresponding frame of a mel-frequency spectrogram….” ] [Figure 5, 530.  “process the input representation to generate a me-frequency spectrum.”]
converting the Mel-spectrum into speech to obtain speech corresponding to the target text. [Wu, Figure 1, “Speech 106” as output.  “[0025] Finally, the TTS model 108 includes a vocoder network 116. The vocoder network 116 can be any network that is configured to receive mel-frequency spectrograms and generate audio output samples based on the mel-frequency spectrograms….”] [Figure 5, 540, 550.  “[0072] The system generates a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network (540)….”  “[0073] Finally, the system selects the audio sample for the time step from the plurality of possible audio samples in accordance with the probability distribution (550). Selecting one of the possible audio samples in accordance with the probability distribution for the time step can involve sampling from the probability distribution.”]

Claim 8 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.  Additionally:
8. A computer device, comprising: a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, causes the processor to implement: [Wu, “[0080] Computers suitable for the execution of a computer program can be based on general or special-purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special-purpose logic circuitry.”]
…

Claim 15 is a CRM system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.  Additionally:
16. The non-transitory computer-readable storage medium as claimed in claim 15, wherein the computer program, when executed by the processor, further causes the processor to implement: [Wu, “21. One or more non-transitory computer-readable storage media storing instructions that are operable, when executed by one or more computers, to cause the one or more computers to perform operations comprising ….”]
…
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 2-4, 9-11, and 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Wu in view of Khoury (U.S. 2019/0333521).
Regarding Claim 2, Wu teaches or suggests:
2. The method as claimed in claim 1, further comprising before inputting the feature vector into the pre-trained spectrum conversion model to obtain the Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model: [Wu teaches training of the model by a “teacher-forcing” method: “[0027] The encoder neural network 110 and decoder neural network 114 are trained together using a maximum likelihood training procedure. That is, during training, the decoder neural network 114 receives as input a correct output from the previous time step. This is known as teacher-forcing. The training data includes sample input texts with known mel-frequency spectrograms. The vocoder network 116 is trained separately.”]
obtaining a preset number of training text and matching speech corresponding to the training text; [Wu, “[0027] … The training data includes sample input texts with known mel-frequency spectrograms….”  “[0029] In summary, the speech synthesis system 100 can generate speech from text using neural networks trained on sample input texts and corresponding mel-frequency spectrograms of human speech alone….”]
discretely characterizing the training text to obtain a feature vector corresponding to each character in the training text; [Wu, the one-hot vector teaches a “feature vector” for “each character.”  “[0019] The input/output subsystem 102 is also configured to convert each character in the sequence of characters in the input text 104 into a one-hot vector and embed each one-hot vector in a continuous vector. …”   Figure 5, 520.  “[0070] The system processes the sequence of characters in the input text to generate an input representation of a respective portion of the sequence of characters for each of a plurality of time steps (520). The input representation can be a fixed-length context vector for the time step. Such a fixed-length context vector can be generated by processing the input character sequence using an encoder neural network, e.g., the encoder neural network 110 of FIG. 1.”]
inputting the feature vector corresponding to each character in the training text into a spectrum conversion model to be trained to obtain a Mel-spectrum output by the spectrum conversion model to be trained; and [Wu, Figure 1, “Mel Frequency Spectrum” and Figure 5, 530.  The steps of training and execution of the trained model are the same.  “[0029] In summary, the speech synthesis system 100 can generate speech from text using neural networks trained on sample input texts and corresponding mel-frequency spectrograms of human speech alone….”] 
when an error between the Mel-spectrum output by the spectrum conversion model to be trained and a Mel-spectrum corresponding to the matching speech is less than or equal to a preset threshold, obtaining the trained spectrum conversion model. [Wu, this is a standard training step and is suggested by teaching of the “gradient value”:  “[0004] …  Training a neural network involves continually performing a forward pass on the input, computing gradient values, and updating the current values of the set of parameters for each layer. Once a neural network is trained, the final set of parameters can be used to make predictions in a production system.”  “[0053] Generally, the system can perform the training to determine the trained values of the parameters using conventional supervised learning techniques, e.g., a stochastic gradient descent with backpropagation based technique….”  The Encoder/Decoder (110, 114) which generate the mel-frequency spectrum from text are trained together.  The vocoder 116 is trained separately.  This Claim ends with training the encoder/decoder which output the mel-spectrogram.]
While the updating of any model is based on a comparison of the output or cost or loss or difference with a threshold, Wu is not express regarding comparing the output against a threshold.
Khoury teaches:
when an error between the Mel-spectrum output by the spectrum conversion model to be trained and a Mel-spectrum corresponding to the matching speech is less than or equal to a preset threshold, obtaining the trained spectrum conversion model. [Khoury, Figure 2A shows the training and Figure 2B shows the use of the trained neural network:  “[0049] The training system 200A in FIG. 2A includes an input 210, an acoustic channel simulator (also referenced as a channel-compensation device or function) 220, a feed forward convolutional neural network (CNN) 230, a system analyzer 240 for extracting handcrafted features, and a loss function 250. A general overview of the elements of the training system 200A is provided here, followed by details of each element. …The CNN 230 is configured to provide features (coefficients) 232 corresponding to the recognition speech signal. In parallel, the signal analyzer 240 extracts handcrafted acoustic features 242 from the recognition speech signal 212. The loss function 250 utilizes both the features 232 from the CNN 230 and the handcrafted acoustic features 242 from the signal analyzer 240 to produce a loss result 252 and compare the loss result to a predetermined threshold. If the loss result is greater than the predetermined threshold T, the loss result is used to modify connections within the CNN 230, and another recognition speech signal or utterance is processed to further train the CNN 230. Otherwise, if the loss result is less than or equal to the predetermined threshold T, the CNN 230 is considered trained, and the CNN 230 may then be used for providing channel-compensated features to the speaker recognition subsystem 20. (See FIG. 2B, discussed in detail below.).”]
Wu and Khoury pertain to training a neural network for the purpose of recognizing an input vector and while Khoury pertains to speech recognition and Wu to speech synthesis, the steps of training of a neural network are the same once the input is converted to a vector of features and it would have been obvious to combine the comparison of the deviation of the output from the goal with a threshold in order to determine whether the model is adequately trained from Khoury with the model training of Wu which performs the same steps impliedly.  In reality, there is no need for Khoury and Wu inherently or impliedly teaches the steps of training in Claims 2-4. However, to expedite prosecution a second reference is added.

Regarding Claim 3, Wu teaches and suggests:
3. The method as claimed in claim 2, wherein inputting the feature vector corresponding to each character in the training text into the spectrum conversion model to be trained to obtain the Mel-spectrum output by the spectrum conversion model to be trained comprises: 
coding the training text through the spectrum conversion model to be trained to obtain a hidden state sequence corresponding to the training text, [Wu, Figure 1, “Encoder Neural Network 110.”  Noting that the training and execution are done by the same process.  The hidden layers of a neural network each generate “a hidden state sequence” corresponding to the input which in this case “the training text.”   “3. …  processing, by an encoder neural network, the input character sequence to generate a feature representation of the input character sequence; and processing, by an attention network, the feature representation, to generate a fixed-length context vector for the time step.”  “[0002] …  Neural networks typically include one or more hidden layers situated between an input layer and an output layer. ….”] wherein the hidden state sequence comprises at least two hidden nodes; [Wu, Neural Network encoders have several layers and Wu expressly mentions one or more hidden layers.  Each layer generally has many nodes/states.  Additionally, the “context” taught by Wu and the “local structure of the input character sequence around a particular character in the input sequence” which refers to “context” indicates at least one node other than the node representing the input to provide “context” for it.  “[0032] The architecture 200 also includes an LSTM subnetwork 220 with two LSTM layers. At each time step, the LSTM subnetwork 220 receives a concatenation of the output of the pre-net 210 and a fixed-length context vector 202 for the time step….”   “4. The method of claim 3, wherein the encoder neural network comprises a convolutional subnetwork and a bidirectional long short-term memory (LSTM) layer, and wherein the feature representation is a sequential feature representation that represents a local structure of the input character sequence around a particular character in the input character sequence.”  In the hidden middle layers of a neural network such as the encoder and decoder of Figure 1, the nodes are hidden.]
according to a weight of a hidden node corresponding to each character, weighting the hidden node to obtain a semantic vector corresponding to each character in the training text and decoding the semantic vector corresponding to each character, and [Wu, Figure 2. The input is a vector and the process undertaken by a neural network such as “pre-net 210” in Figure 2 or the “LSTM subnetwork 220” includes assigning weights to nodes of the neural network which are the elements of the vector input from each layer to the next.  “[0004] Each layer generates one or more outputs using the current values of a set of parameters for the layer. Training a neural network involves continually performing a forward pass on the input, computing gradient values, and updating the current values of the set of parameters for each layer. Once a neural network is trained, the final set of parameters can be used to make predictions in a production system.”  The set of parameters that are updated are the “weights” which are not expressly called by that name.  Each character is represented by a one-hot vector (like the instant Application) and these vectors are combined in a continuous vector which corresponds to the sequence of characters.  “[0019] The input/output subsystem 102 is also configured to convert each character in the sequence of characters in the input text 104 into a one-hot vector and embed each one-hot vector in a continuous vector. That is, the input/output subsystem 102 can represent each character in the sequence as a one-hot vector and then generate an embedding, i.e., a vector or other ordered collection of numeric values, of the character.”  The vector with weighted nodes is called the semantic vector by the Claim.  As provided above, the process of training and executing the trained model are similar.  The trained model applies a series of steps to the input and the process of training goes through the same steps to obtain the appropriate parameters to use at each step.]
outputting the Mel-spectrum corresponding to each character. [Wu, Figure 2, “Mel-Frequency Spectrum 204” corresponding to the input “Fixed-Length Context Vector 202” is the output.]
Applying weights to the nodes and adjusting the weights are well-known aspects of neural networks but the weights are not expressly mentioned in Wu.
Khoury teaches:
according to a weight of a hidden node corresponding to each character, weighting the hidden node to obtain a semantic vector corresponding to each character in the training text and decoding the semantic vector corresponding to each character, and [Khoury teaches that modifying and updating the weights of the nodes of the neural network is part of the training process: “0064] … As noted above, the loss result 252 may be used to update connection weights for nodes of the first CNN 230 when the loss result is greater than a predetermined threshold. If the loss result is less than or equal to the threshold, the training is complete….”]
Rationale for combination as provided for Claim 3.

Regarding Claim 4, Wu teaches the general process of training a neural network in [0002]-[0004].  Note that weighting is inherent in the teaching of neural network by Wu because a neural network works by adjusting the weights applied to the nodes until it gets them right.  Khoury was added because it uses the word “weight” and goes through the process of training of a NN in more detail.  Wu teaches that it generates one-hot vectors corresponding to characters and puts them together in an embedding vector and the previous Claim defines this vector with weighted nodes as the semantic vector.  The encoder/decoder (110/114) of Figure 1 converts the “semantic vector” of the Claim into “Mel-Frequency Spectrogram” correspond to the input text 104.  The steps of updating the weights and applying the updated weights to the nodes which are part of the training of a neural network and also part of its operation are not express in Wu.
Khoury teaches:
4. The method as claimed in claim 2, further comprising after inputting the feature vector corresponding to each character in the training text into the spectrum conversion model to be trained to obtain the Mel-spectrum output by the spectrum conversion model to be trained: [Khoury, Figure 4, “[0065] FIG. 4 is a flowchart for a training operation or method 400 for training a channel-compensated feed forward convolutional neural network (e.g., 230) according to exemplary embodiments of the present disclosure….”]
when the error between the Mel-spectrum output by the spectrum conversion model to be trained and the Mel-spectrum corresponding to the matching speech is greater than the preset threshold, updating the weight of each hidden node; [Khoury, Figure 4, S450.  “[0066] …In operation S450, a loss result is calculated from the channel-compensated features and the handcrafted features….”]
weighting the hidden node whose weight is updated to obtain a semantic vector corresponding to each character in the training text; [Khoury, Figure 4, S460 and 470.  “[0066] … However, if the calculated loss is greater than the threshold, the calculated loss is used to modify connection weights (S470) of the first … CNN ….”]
decoding the semantic vector corresponding to each character, and outputting the Mel-spectrum corresponding to each character; and [Khoury, Figure 4, this corresponds to the loop from 470 where the weights of the CNN are modified back to another round of calculation of the features by the CNN at S430.]
when the error between the Mel-spectrum corresponding to each character and the Mel-spectrum corresponding to the matching speech is less than or equal to the preset threshold, stopping the updating the weight of each hidden node, and obtaining the trained spectrum conversion model.  [Khoury, Figure 4, S460. “[0068] Those having skill in the art will recognize that the threshold comparison at operation S460 may alternatively consider training complete when the calculated loss is less than the threshold, and incomplete when the calculated loss is greater than or equal to the threshold.”]
Rationale for combination as provided for Claim 3.

Claim 9 is a system claim with limitations corresponding to the limitations of Claim 2 and is rejected under similar rationale.
Claim 10 is a system claim with limitations corresponding to the limitations of Claim 3 and is rejected under similar rationale.
Claim 11 is a system claim with limitations corresponding to the limitations of Claim 4 and is rejected under similar rationale.

Claim 16 is a system claim with limitations corresponding to the limitations of Claim 2 and is rejected under similar rationale.
Claim 17 is a system claim with limitations corresponding to the limitations of Claim 3 and is rejected under similar rationale.
Claim 18 is a system claim with limitations corresponding to the limitations of Claim 4 and is rejected under similar rationale.

Claims 5, 12, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Wu in view of Kim (U.S. 20200082807).
Claims 5, 6, and 7 have the same limitation but depend from different Claims 1, 2, and 3.
Regarding Claim 5, Wu, Figure 1, the Mel-Frequency Spectrogram is input to the “Autoregressive Neural Network 116”:  ‘[0025] Finally, the TTS model 108 includes a vocoder network 116. The vocoder network 116 can be any network that is configured to receive mel-frequency spectrograms and generate audio output samples based on the mel-frequency spectrograms…. Alternatively, the vocoder network 116 can be an autoregressive neural network.”
Kim teaches:
5. The method as claimed in claim 1, wherein converting the Mel-spectrum into speech to obtain the speech corresponding to the target text comprises: 
performing an inverse Fourier transform on the Mel-spectrum through a vocoder to convert the Mel-spectrum into a speech waveform signal in a time domain to obtain the speech. [Kim, Figure 8, “[0104] Through the above-described process, a speech may be generated for each unit of the text. According to an embodiment, the text-to-speech synthesis system may acquire a speech of a mel-spectrogram for the whole text by concatenating mel-spectrograms for the time-steps in chronological order. The speech of the mel-spectrogram for the whole text may be output to a vocoder 830.”   “[0105] … Thus, the CNN or RNN of the vocoder 830 may output a linear-scale spectrogram. For example, the linear-scale spectrogram may include a magnitude spectrogram. As shown in FIG. 8, the vocoder 830 may predict the phase of the spectrogram through a Griffin-Lim algorithm. The vocoder 830 may output a speech signal in the time domain by using an inverse short-time Fourier transform.”]
Wu and Kim pertain to speech synthesis from text by first converting the input text to a mel-frequency spectrogram of the text and then using a vocoder to convert the spectrogram to audio and it would have been obvious to replace the vocoder of Wu which is an autoregressive neural network vocoder with the vocoder of Kim with converts the mel-frequency spectrogram to text via a vocoder uses ISTFT as Wu itself states that “vocoder network 116 can be any network that is configured to receive mel-frequency spectrograms and generate audio output samples” and therefore its autoregressive vocoder can be replaced with the inverse short-term Fourier transform vocoder of Kim.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 12 is a system claim with limitations corresponding to the limitations of Claim 5 and is rejected under similar rationale.
Claim 19 is a system claim with limitations corresponding to the limitations of Claim 5 and is rejected under similar rationale.

Claims 6-7, 13-14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wu and Khoury in view of Kim.
Claims 5, 6, and 7 have the same limitation but depend from different Claims 1, 2, and 3.  
Accordingly, the limitations of Claim 6 and 7 are mapped to Kim as was the limitation of Claim 5 above.  
Additionally, the rationale for combination of Wu/Khoury with Kim remains similar to the rationale for combination of Wu with Kim.  Khoury was combined for the details of training and does not teach the conversion by IFT or ISTFT.  The Wu/Khoury combination is combined with Kim and the vocoder is in Wu.

6. The method as claimed in claim 2, wherein converting the Mel-spectrum into speech to obtain the speech corresponding to the target text comprises: 
performing an inverse Fourier transform on the Mel-spectrum through a vocoder to convert the Mel-spectrum into a speech waveform signal in a time domain to obtain the speech. 
7. The method as claimed in claim 3, wherein converting the Mel-spectrum into speech to obtain the speech corresponding to the target text comprises: 
performing an inverse Fourier transform on the Mel-spectrum through a vocoder to convert the Mel-spectrum into a speech waveform signal in a time domain to obtain the speech.

Claim 13 is a system claim with limitations corresponding to the limitations of Claim 6 and is rejected under similar rationale.
Claim 14 is a system claim with limitations corresponding to the limitations of Claim 7 and is rejected under similar rationale.

Claim 20 is a system claim with limitations corresponding to the limitations of Claim 6 and is rejected under similar rationale.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Wu also includes:
 “[0022] In some implementations, the encoder neural network 110 can include one or more convolutional layers followed by a bi-directional long short-term memory ("LSTM") layer….   The bi-directional LSTM layer can be configured to process the hidden features generated by the final convolutional layer to generate a sequential feature representation of the sequence of characters. A sequential feature representation represents a local structure of the sequence of characters around a particular character. A sequential feature representation may include a sequence of feature vectors.”
“[0027] … That is, during training, the decoder neural network 114 receives as input a correct output from the previous time step….”  

Raitio (U.S. 20170345411) teaches:
1. A text-based speech synthesis method, [Raitio, Figure 5, “text-to-speech module 500”.]
comprising: 
obtaining target text to be recognized; [Raitio, Figure 5, “text-to-speech module 500” receiving “text” as input.]
discretely characterizing each character in the target text to generate a feature vector corresponding to each character; [Raitio, Figure 5, “text analysis module 502.”  “[0159] … Text analysis module 502 is configured to convert the text into a sequence of target units representing the spoken pronunciation of the text….”]
inputting the feature vector into a pre-trained spectrum conversion model to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; and [Raitio, Figure 5, “unit selection module 504.”  “[0159] … The sequence of target units with corresponding linguistic features is forwarded to unit-selection module 504.”  “[0161] Unit-selection module 504 is configured to pre-select suitable speech segments from speech segment database 508 that best match the sequence of target units….”  “[0160] Speech segment database 508 includes a plurality of speech segments derived from recorded speech and a corresponding corpus of text. Each speech segment includes linguistic features and acoustic features (e.g., spectral shape, pitch, duration, Mel-frequency cepstral coefficients, fundamental frequency, etc.). The plurality of speech segments are indexed and stored in speech segment database 508 according to the linguistic features and acoustic features….”]
converting the Mel-spectrum into speech to obtain speech corresponding to the target text. [Raitio, Figure 5, “speech synthesizer module 510” generating “speech waveform” as output.  “[0165] Speech synthesizer module 510 is configured to receive the selected subset of pre-selected candidate speech segments from unit-selection module 504 and join the sequence of speech segments into a continuous speech waveform….”  The speech units include their acoustic features that are MFCC. ]

Raitio teaches with respect to neural networks:  “[0182] Each layer of mixture density network 900 includes multiple units. The units are the basic computational elements of mixture density network 900 and are referred to as dimensions, neurons, or nodes. As shown in FIG. 9, input layer 902 includes input units 908, hidden layers 906 include hidden units 910, and output layer 904 includes output units 912. Hidden layers 906 each include any number of hidden units 910. In a specific example, hidden layers 906 each include 512 hidden units 910. The units are interconnected by connections 914. Specifically, connections 914 connect the units of one layer to the units of a subsequent layer. Further, each connection 914 is associated with a weighting value and a bias followed by a nonlinear activation function. For simplicity, the weighting values and biases are not shown in FIG. 9.”

Arik (U.S. 20180336880) also teaches using short-time Fourier transform and its inverse (which is claimed) for conversion of spectrogram to waveform: “[0111] The original Tacotron implementation in Wang et al. uses the Griffin-Lim algorithm to convert spectrograms to time-domain audio waveform by iteratively estimating the unknown phases. Estimation of the unknown phases may be done by repeatedly converting between frequency and time domain representations of the signal using the short-time Fourier transform and its inverse, substituting the magnitude of each frequency component to the predicted magnitude at each step. …”  And it does teach:  “[0138] … In one or more embodiments, a mel-frequency cepstral coefficients (MFCCs) computed after resampling the input to a constant sampling frequency was used….”  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659