DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Balakrishnan et al. (US 2022/0005457).

Claim 1,
Balakrishnan teaches a method for training a speech spectrum generation model, the method comprising: inputting a first text sequence into the speech spectrum generation model to generate an analog spectrum sequence corresponding to the first text sequence, and to obtain a first loss value of the analog spectrum sequence according to a preset loss function; inputting the analog spectrum ([Fig. 8] [0019] [0072-0081] training a text-to-speech (TTS) system (e.g., neural network) and an automatic speech recognition (ASR) system (e.g., neural network) using a generative adversarial network technique; training system receiving a first text sample as input to a TTS neural network; the TTS neural network may generate a first audio sample representing the first text sample (the decoded mel spectrogram may be the audio sample representing the text sample); the ASR neural network may generate a second text sample representing the first audio sample; the decoded mel spectrogram from the TTS neural network 125 may be provided to the ASR neural network 130; the training system may calculate a first loss; the training system may calculate a second loss; the training system may train the TTS neural network by adjusting parameters of the TTS neural network based at least in part on the first and second losses; the training system may feed the first and second losses back into the TTS neural network 125; the TTS neural network 125 may adjust parameters of the text encoder 215 and the audio decoder 220 to minimize the loss values from the loss functions on future executions of the TTS neural network 125).

Claims 8 and 15 contains subject matter similar to claim 1, and thus is rejected under similar rationale.

Claim 2,
Balakrishnan further teaches the method according to claim 1, wherein prior to inputting the analog spectrum sequence corresponding to the first text sequence into the adversarial loss function model to obtain the second loss value of the analog spectrum sequence, the method further comprises: ([Figs. 7-8] [0031-0032] [0077-0079] claim 2 refers to using additional training data, a second text sequence, for training the TTS model; the audio GAN loss subsystem 150 receives the output from the audio discriminator 135 and the mel spectrogram from the audio embedder 120 that is the “real” mel spectrogram; the audio GAN loss subsystem 150 then uses the difference between the audio discriminator 135 results and the mel spectrogram to generate a loss used to train the audio discriminator 135 and the TTS neural network 125 components; at step 830 the ASR neural network may generate a third text sample representing the second audio sample; at step 835 the TTS neural network may generate a third audio sample representing the third text sample; for example, the decoded text embedding may be fed into the TTS neural network 125; the decoded mel spectrogram may be the third audio sample; at step 840 the training system may calculate a second loss based on the difference between the second audio sample and the third audio sample; the audio loss cycle 2 subsystem 160 may calculate the second loss based on the difference between the mel spectrogram created by the audio embedder 120 from the audio sample obtained from the audio corpus 110 and the decoded mel spectrogram generated by the TTS neural network 125 from the decoded text embedding generated by the ASR neural network 130 from the original audio sample).

Claims 9 and 16 contains subject matter similar to claim 2, and thus is rejected under similar rationale.

Claim 3,
Balakrishnan further teaches the method according to claim 2, wherein training the adversarial loss function model based on the real spectrum sequence corresponding to the second text sequence and the analog spectrum sequence corresponding to the second text sequence comprises: inputting the real spectrum sequence corresponding to the second text sequence and the analog spectrum sequence corresponding to the second text sequence into the adversarial loss function model separately to obtain a third loss value; and training the adversarial loss function model based on the third loss value, wherein the third loss value represents a loss of the analog spectrum sequence corresponding to the second text sequence relative to the real spectrum sequence corresponding to the second text sequence ([0031] [0077-0079] the text GAN loss subsystem 155 is a third loss calculation subsystem that is used in the third training cycle).

Claims 10 and 17 contains subject matter similar to claim 3, and thus is rejected under similar rationale.

Claim 4,
Balakrishnan further teaches the method according to claim 1, wherein inputting the analog spectrum sequence corresponding to the first text sequence into the adversarial loss function model to obtain the second loss value of the analog spectrum sequence comprises: inputting the analog spectrum sequence corresponding to the first text sequence into the adversarial loss function model to obtain an original loss value; down-sampling the analog spectrum sequence corresponding to the first text ([Fig. 5] [0048-0052] the decoded mel spectrogram generated by passing the audio sample through the ASR neural network 130 and the TTS neural network 125 would match the mel spectrogram from the audio embedder 120 if the TTS neural network 125 and the ASR neural network 130 were functioning perfectly; as in cycle 1, the end result may be a garbled non-sensical sample; accordingly, the audio loss cycle 2 subsystem 160 compares the decoded mel spectrogram to the mel spectrogram from audio embedder 120 to generate a second cycle loss; this cycle may also be performed many times during training).

Claims 11 and 18 contains subject matter similar to claim 4, and thus is rejected under similar rationale.

Claim 5,
Balakrishnan further teaches the method according to claim 1, wherein the adversarial loss function model adopts a deep convolutional neural network model ([0019] text-to-speech (TTS) system (e.g., neural network) and an automatic speech recognition (ASR) system (e.g., neural network) using a generative adversarial network technique).

Claims 12 and 19 contains subject matter similar to claim 5, and thus is rejected under similar rationale.


Balakrishnan further teaches the method according to claim 1, wherein the speech spectrum generation model comprises a Tacotron model ([0025] TTS neural network 125 is a text-to-speech network that receives a text embedding as input and outputs a decoded mel spectrogram representing the text embedding; Tacotron model as defined in Applicant’s specifications, see paragraph [0020]).

Claim 13 and 20 contains subject matter similar to claim 6, and thus is rejected under similar rationale.

Claim 7,
Balakrishnan further teaches the method according to claim 1, wherein the speech spectrum generation model comprises a Text To Speech (TTS) model ([0019] text-to-speech neural network).

Claims 14 and 20 contains subject matter similar to claim 7, and thus is rejected under similar rationale.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Chen et al. (US 2021/0350786) teaches a method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.
Sinha et al. (US 2021/0192357) teaches a gradient adversarial training, an auxiliary neural network can be trained to classify a gradient tensor that is evaluated during backpropagation in a main neural network that provides a desired task output. The main neural network can serve as an adversary to the auxiliary network in addition to a standard task-based training procedure. The auxiliary neural network can pass an adversarial gradient signal back to the main neural network, which can use this signal to regularize the weight tensors in the main neural network. Gradient adversarial training of the neural network can provide improved gradient tensors in the main network. Gradient adversarial techniques can be used to train multitask networks, knowledge distillation networks, and adversarial defense networks.
Saito et al. (“Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra” pages 347-363) teaches a novel training algorithms for vocoder-free text-to-speech (TTS) synthesis based on generative adversarial networks (GANs) that compensate for short-term Fourier transform (STFT) amplitude spectra in low/multi frequency resolution. Vocoder-free TTS using STFT amplitude spectra can avoid degradation of synthetic speech quality caused by the vocoder-based parameterization used in conventional TTS.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/               Examiner, Art Unit 2656