Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
2.	In response to the office action mailed on 12/22/2021, applicant filed an amendment on 03/17/2022, amending claims 1, 10, and 19.  The pending claims are 1-20. 

Response to Arguments
3.	Applicant's arguments filed 03/17/2022 have been fully considered but they are not persuasive. 
	As per claim 1, applicant argues that the prior art does not teach receiving a second audio sample as input to the ASR neural network; generating, using the ASR neural network, a third text sample representing the second audio sample; and calculating a second loss based on a second difference between the second audio sample and the third audio sample.
	In addition to the prior art (Chen) paragraphs cited in the previous office action, applicant is referred to paragraph [0018], wherein said,  encoding, by the speech recognition model, the synthetic speech representation of the corresponding training text utterance output by the GAN-based TTS model; encoding, by the speech recognition model, one of the non-synthetic speech representations selected from the set of spoken training utterances; determining, using another adversarial discriminator, another adversarial loss term between the encoded synthetic speech representation and the encoded one of the non-synthetic speech representations; and updating parameters of the speech recognition model based on the other adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training , Fig. 3A, [0069], generating a supervised loss between the unspoken text utterance 302a and speech recognition result (second text) output by the ASR model 200).
As per the rest of the claims, and combinations of prior art reference, applicant has no further arguments beside the ones mentioned above.  Therefore, all the combinations of prior art reference mentioned above are valid, and all other claims are rejected for the same reasons as set above. 

	If any points remain in issue which Applicant feels may be best resolved through a telephone interview, Applicant is kindly requested to contact the Examiner at the telephone number listed below.




Claim Rejections - 35 USC § 103
4.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-18 are rejected under 35 U.S.C. 103 as being unpatentable over Chen (US 20210350786) in view of Baskar (Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text, Cornell University, arXiv.org > eess > arXiv:1905.01152, 2019).
As per claim 1, Chen teaches a method for training a text-to-speech (TTS) neural network and an automatic speech recognition (ASR) neural network (Abstract), the method comprising: 
receiving a first text sample as input to the TTS neural network ([0047], FIG. 3A, obtaining a plurality of training text utterances 302 that includes unspoken text); 
generating, using the TTS neural network, a first audio sample representing the first text sample ([0047], FIG. 3A, converting the unspoken text into synthetic speech representations 306); 
generating, using the ASR neural network, a second text sample representing the first audio sample that represents the first text sample ([0069], wherein said the ASR model 200 receives synthetic speech 306 representing unspoken text 302a and generates speech recognition results representing the unspoken (first) text sample); 
calculating a first loss based on a first difference between the first text sample and the second text sample ([0014], a synthetic speech loss term is based on the first probability distribution over possible synthetic speech recognition hypotheses for the corresponding synthetic speech representation and the corresponding training text utterance from which the corresponding synthetic speech representation is generated.  See also, Fig. 3A, [0069], generating a supervised loss between the unspoken text utterance 302a and speech recognition result (second text) output by the ASR model 200); 
receiving a second audio sample as input to the ASR neural network ([0069], Fig. 3B, receiving non-synthetic speech by the ASR model 200);
generating, using the ASR neural network, a third text sample representing the second audio sample ([0069], FIG. 3B, generating speech recognition based on non-synthetic speech received by the ASR neural network 200); 
generating, using the TTS neural network, a third audio sample representing the third text sample ([0014], [0018], [0058], generating the TTS model synthetic speech representing transcribed text.  The transcribed text was previously generated from a non-spoken speech, as in [0047]); 
calculating a second loss based on a second difference between the second audio sample and the third audio sample ([0018]  teaches encoding, by the speech recognition model, the synthetic speech representation of the corresponding training text utterance output by the GAN-based TTS model; encoding, by the speech recognition model, one of the non-synthetic speech representations selected from the set of spoken training utterances; determining, using another adversarial discriminator, another adversarial loss term between the encoded synthetic speech representation and the encoded one of the non-synthetic speech representations; and updating parameters of the speech recognition model based on the other adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.  Furthermore, [0003], [0058], determining, by the data processing hardware, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance;
training the TTS neural network by adjusting parameters of the TTS neural network based at least in part on the first loss and the second loss ([0003], [0057]-[0060], [0080], updating, by the data processing hardware, parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances. See also [0018], updating parameters of the speech recognition model based on the other adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances); and 
training the ASR neural network by adjusting parameters of the ASR neural network based at least in part on the first loss and the second loss ([0057], FIG. 3A, the GAN-based TTS model 310 used for training the ASR model 200).
Chen may not explicitly disclose the exact language of converting a first speech to text and converting the text back to a second speech, and calculating a loss based on the first speech and second speech.  However, Chen teaches adversarial loss term between a synthetic speech and non-synthetic speech, wherein the synthetic speech is generated based on text that is paired with the non-synthetic speech in the training data set, [0018], [0047].  More, in order to expedite prosecution, Baskar, in the same field of endeavor , was introduced for teaching deriving training procedures and losses by leveraging unpaired speech and/or text data by combining ASR with Text-to-Speech (TTS) models, wherein a text sample is converted to an audio sample, and the audio sample is converted back to another text sample; and on the other hand an audio sample is see Abstract, and paragraph 2. Cycle consistency training).  Therefore, it would have would have been obvious at the time the application was filed to use the cycle consistency training features of Baskar with the system of Chen, in order to leverage both unpaired speech and text data to outperform recently proposed related speech recognition techniques.
As per claim 2, Chen teaches wherein the TTS neural network comprises a text encoder and an audio decoder, and wherein training the TTS neural network comprises adjusting one or more parameters of the text encoder or one or more parameters of the audio decoder ([0052], Fig. 3A. encoder neural network 312 and decoder neural network 314, and [0056] for adjusting one or more parameters of the text encoder or one or more parameters of the audio decoder).
As per claim 3, Chen teaches wherein the ASR neural network comprises an audio encoder and a text decoder, and wherein training the ASR neural network comprises adjusting one or more parameters of the audio encoder or one or more parameters of the text decoder ([0036]-[0037], [0075], FIG. 3C, updating parameters of the ASR model 200).
As per claim 4, Chen teaches generating, using the ASR neural network, a fourth text sample representing a fourth audio sample received as input to the ASR neural network ([0070], generating by the automatic speech recognition text samples representing received audio);
receiving, as input to a text discriminator, the fourth text sample and a fifth text sample from a textual source ([0071]-[0072], receiving the generated speech recognition samples by a the text discriminator); 
generating, by the text discriminator, a third loss based on the fourth text sample and the fifth text sample; further training the TTS neural network based at least in part on the third loss ([0071]-[0072], generating loss values based on the received speech recognition result output by the speech recognition 200); and 
further training the ASR neural network based at least in part on the third loss (0068]-[0079], training the ASR).
As per claim 5, Chen teaches wherein the text discriminator is trained to output a first value for a fake text sample, wherein the fake text sample is generated from an audio sample, and wherein the text discriminator is trained to output a second value for a real text sample, wherein the real text sample is generated from a textual source ([0060]-[0062],outputting values corresponding the plurality of text samples corresponding to spoken text utterances generated by automatic speech recognition ASR and unspoken text utterances, not paired with any corresponding spoken utterance (real).  See also, Fig. 3C and corresponding description).
As per claim 6, Chen teaches generating, using the TTS neural network, a fourth audio sample representing a fourth text sample received as input to the TTS neural network ([0061]-[0062], generating by the TTS neural network a synthetic speech representation of the corresponding training text utterance.  See also, ([0084]-[0085] and corresponding description);
receiving, as input to an audio discriminator, the fourth audio sample and a fifth audio sample from an audio source (receiving synthetic speech, non-synthetic speech, and reference synthetic speech, [0060]-[0062]and Fig. 3A); 
generating, by the audio discriminator, a third loss based on the fourth audio sample and the fifth audio sample; further training the TTS neural network based at least in part on the third loss ([0061]-[0062], generating the audio discriminator loss values based on the received audio samples to fine tune the TTS model) ; and 
([0061], back-propagating both the adversarial loss term 320 and the consistency loss term (e.g., MSE loss) 324 through the post net 316 to teach the post-net 316 to drive the resulting synthetic speech representations 306 to have similar acoustics as the non-synthetic speech representations 304 in the set of spoken training utterances 305).
As per claim 7, Chen teaches wherein the audio discriminator is trained to output a first value for a fake audio sample, wherein the fake audio sample is generated from a text sample, and wherein the audio discriminator is trained to output a second value for a real audio sample, wherein the real audio sample is generated from an audio source (Fig. 3A, wherein a fake audio sample is synthetic speech generated by the TTS, and real audio sample is non-synthetic speech).
As per claim 8, Chen teaches deploying the trained TTS neural network ([0057]).
As per claim 9, Chen teaches deploying the trained ASR neural network ([0048], [0063]).
As per claims 10-18, system claims 10-18 and method claims 1-9 are related as apparatus and the method of using same, with each claimed element's function corresponding to the claimed method step.  Accordingly, claims 10-18 are similarly rejected under the same rationale as applied above with respect to method claims 1-9.   Further, Chen teaches a text-to-speech (TTS) neural network; an automatic speech recognition (ASR) neural network; one or more processors; and a memory having stored thereon instructions (Fig. 3A, [0089]).
Claim Rejections - 35 USC § 102
5.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the 
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim19 is rejected under 35 U.S.C. 102(a)(2) as being anticipated by Chen (US 20210350786).
As per claim 19, Chen teaches a text-to-speech (TTS) neural network ([0003]); an automatic speech recognition (ASR) neural network; an automatic speech recognition (ASR) neural network ([0003]); a text discriminator and an audio discriminator (Figs. 3A-C, [0014], [0018], [0058], [0071], [0081]); and
a generative adversarial network ([0003]) configured to: 
calculate a first loss based on in part an output of the text discriminator ([0014], a synthetic speech loss term is based on the first probability distribution over possible synthetic speech recognition hypotheses for the corresponding synthetic speech representation and the corresponding training text utterance from which the corresponding synthetic speech representation is generated.  See also, Fig. 3A, [0069], generating a supervised loss between the unspoken text utterance 302a and speech recognition result (second text) output by the ASR model 200); 
calculate a second loss based on in part an output of the audio discriminator ([0018]  teaches encoding, by the speech recognition model, the synthetic speech representation of the corresponding training text utterance output by the GAN-based TTS model; encoding, by the speech recognition model, one of the non-synthetic speech representations selected from the set of spoken training utterances; determining, using another adversarial discriminator, another adversarial loss term between the encoded synthetic speech representation and the encoded one of the non-synthetic speech representations; and updating parameters of the speech recognition model based on the other adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances, and 
simultaneously train the TTS neural network and the ASR neural network using the first loss and the second loss ([0023], Fig. 3A is a schematic view of an example training process for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in parallel; and [0063], the GAN-based TIS model 310 and the ASR model 200 are trained in unison.).  
Claims 20 is rejected under 35 U.S.C. 103 as being unpatentable over Chen (US 20210350786).
As per claim 20, Chen teaches that the TTS neural network and the ASR neural network could be located within the same device or separate device ([0034]-[0035]).  Chen may not explicitly disclose the TTS neural network and the ASR neural network are deployed independently.  However, it’s well known in the art for TTS neural network and ASR neural network to be deployed independently. However, it’s well known in the art for TTS neural network and ASR neural network to be deployed independently.  Therefore, it would have been obvious at the time the application was filed for the TTS neural network of Chen to be deployed independent of the ASR neural network.  This would reduce computational and memory costs.

Conclusion
5.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO-892.
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDELALI SERROU whose telephone number is (571)272-7638. The examiner can normally be reached M-F 9 Am - 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

ABDELALI . SERROU
Primary Examiner
Art Unit 2659



/ABDELALI SERROU/Primary Examiner, Art Unit 2659