DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/07/2020.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Objections
Claims 10-11 objected to because of the following informalities:  
Claims 10 and 11 rely on the text embeddings and acoustic embeddings. Therefore, given that claim 9 is where the text embedding and acoustic embeddings are introduced, claims 10 and 11 should be dependent on claim 9 instead of claim 8.  The interpretation on this action is relying on claims 10 and 11 being dependent on claim 9.
Appropriate correction is required.
	Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-3, 12-15, and 17-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Lugosch, Loren, et al. "Using Speech Synthesis to Train End-to-End Spoken Language Understanding Models." arXiv preprint arXiv:1910.09463 (2019) (hereinafter referred to as Lugosch et al.).
As to independent claim 1, Lugosch et al. teaches a method for training an end-to-end (E2E) spoken language understanding (SLU) system (see ¶ 3 of Introduction: “In this paper, we propose a method for reducing, or avoiding entirely, the need to record audio data ”),
 the method comprising steps of:
receiving a training corpus comprising a set of text classified using one or more sets of semantic labels but unpaired with speech (see ¶ 3 of Introduction: “In this paper, we propose a method for reducing, or avoiding entirely, the need to record audio data to train an end-to-end SLU model. Given a .” Here, the dataset of semantically labeled text is interpreted to be analogous to the text classified using one or more sets of semantic labels. Also, the unpaired with speech portion is interpreted as analogous to the reduction or elimination of audio data used in Lugosch et al., therefore, no one-to-one relationship between text and audio data.); and
using the set of unpaired text to train the E2E SLU system to classify speech using at least one of the one or more sets of semantic labels (see ¶ 4 of Introduction: “Given a dataset of ”. Here, it is interpreted that the semantically labeled data is analogous to the unpaired text which is used to train the model.
As to independent claims 19 and 20 Lugosch et al. further teaches:
 an apparatus for training an end-to-end (E2E) spoken language understanding (SLU) system, the apparatus comprising:
a memory;
a processor coupled to the memory. and,
a computer program product for training an end-to-end (E2E) spoken language understanding (SLU) system, the computer program product comprising a non-transitory machine-readable P201909786US01 (150-776)28storage medium having machine-readable program code embodied therewith (see ¶ 1 of Introduction: “End-to-end models have several advantages over the conventional SLU setup: they have reduced ”).
Regarding claim 2, Lugosch et al. teaches a method wherein the spoken language understanding system comprises a speech- to-intent system (see abstract: “End-to-end models are an attractive new approach to spoken language understanding (SLU) in which the ”. Here, the meaning of the utterance inferred directly from the audio is interpreted as a speech-to-intent system.), and 
wherein the set of unpaired text comprises text-to-intent data (see 3. Proposed Method section: “a ”. Here, the input text dataset (i.e., transcript) with intent label is interpreted as a text-to-intent system.). 
Regarding claims 3, Lugosch et al. teaches a method wherein the training corpus further comprises a set of speech (see ¶ 1 of 4.4. Results combining real and synthetic speech section: “We next present results for when the model is ”),
the method further comprising:
training a text-to-intent (T21) model using the unpaired text and the labels (see ¶1 (3. proposed method): “a . […] An end-to-end SLU model can then be trained using the generated dataset.”); and
training a speech-to-intent (S21) model using the speech and the labels (see ¶ 1 and 2 of 4.1 Datasets section: “Fluent Speech Commands is a dataset”. There are 248 distinct sentences, each spoken by multiple speakers in both the training set and validation/test sets. […] We also use the Snips SLU Dataset […], so the model is tested entirely on sentences it has never heard before and must generalize to them to achieve high accuracy.”), and
Regarding claim 12, Lugosch et al. teaches a method wherein using the set of unpaired text to train the system comprises:
using a text-to-speech (TTS) system to generate synthetic speech from the unpaired text (see ¶ 3 of Introduction: “Given a dataset of semantically labeled text ”);
and training the E2E SLU system using the synthetic speech and the labels (see ¶ 3-4 of Introduction: “train an end-to-end SLU model using only synthetic speech”. Here, the labels are interpreted to be the labels in the semantically labeled text data which is used to generate the synthetic speech.).
Regarding claims 13 and 14, Lugosch et al. teaches a method wherein the TTS system comprises a single-speaker (claim 13) and multi-speaker (claim 14) TTS system (see ¶ 1 of 3. Proposed method section: “If the TTS has multiple speakers, each speaker is used to synthesize the transcript, so that multiple training examples per transcript are generated.” Here, TTS may have one or multiple speakers. For the single-speaker scenario, it can be interpreted that since each speaker is used individually (reason why there may be multiple training examples) it could be analogous to using a single speaker., and
wherein training the E2E SLU system comprises training the system using single-speaker (claim 13) and multi-speaker (claim 14) synthetic speech (see ¶ 1 of 4.3. Results for purely synthetic training sets section: “we train models using the data from one speaker, two speakers, and so on, ”).
Regarding claim 15 and 18, Lugosch et al. teaches a method wherein the training corpus further comprises a set of speech, and wherein the E2E SLU system (claim 15) and the S2I model (claim 18) are trained using both the set of the speech and the synthetic speech (see ¶ 1 of 3. Proposed method section: “If spoken training examples from real speakers are available, the real and synthetic datasets can be  An end-to-end SLU model can then be trained using the generated dataset.” Here, the S2I model is interpreted as analogous to (or as part of) the E2E SLU system.).
Regarding claim 17, Lugosch et al. further teaches a method wherein
training a text-to-intent (T2I) model using the set of unpaired text and the labels see ¶1 (3. proposed method): “a . […] An end-to-end SLU model can then be trained using the generated dataset.”; and
training a speech-to-intent (S2I) model using the text-to-intent model and the synthetic speech (see ¶ 1 of 4.4. Results combining real and synthetic speech section: “We next present results for when the model is .” Here, the model is interpreted to be a speech-to-intent model (E2E SLU) and the synthetic speech is interpreted to be analogous or associated with the text-to-intent processing to obtain the synthetic speech as described in claim 1.).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5-8 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Lugosch, Loren, et al. "Using Speech Synthesis to Train End-to-End Spoken Language Understanding Models." arXiv preprint arXiv:1910.09463 (2019) (hereinafter referred to as Lugosch et al.) as applied to claims 1-3 above, and further in view of Christian Fuegen et al. (US 11107462 B1; hereinafter referred to as Fuegen et al.) and Coucke, Alice, et al. "Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces." arXiv preprint arXiv:1805.10190 (2018). (hereinafter referred to as Coucke et al.). 
Regarding claim 5, Lugosch et al. teach all of the limitations as in claim 1, above, and further teaches wherein the training corpus further comprises a set of speech and text paired with at least a portion of the set of speech (see ¶ 1 of Proposed method section: “If spoken training examples from real speakers are available, the real and synthetic datasets can be concatenated to form a single larger dataset.”).
However, Lugosch et al. does not explicitly teach wherein the method further comprises:
training a natural language understanding (NLU) model using the unpaired text and the labels;

training a spoken language understanding (SLU) model using the NLU model and the ASR model.
Fuegen et al. does teach wherein the method further comprises:
training a natural language understanding (NLU) model using the unpaired text and the labels (see Fig. 1 and Col. 3, lines 11-24: “Returning to FIG. 1, a natural language understanding component 104 may be trained using tagged transcripts 110. […] The transcripts 110 may or may not corresponding to the audio data 106 used to train the SR", Here, the tagged transcripts, which are not necessarily corresponding to audio data, are interpreted as analogous to the labeled unpaired text.);
Lugosch et al. and Fuegen et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in language / speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified to incorporate the teachings of Fuegen et al. of training a NLU model using the unpaired and/or paired text, which allow for more a more accurate translation performed in a more resource-efficient manner (particularly in terms of processing resources) (abstract of Christian Fuegen et al. (US 11107462 B1)). 
However, Lugosch et al. in combination with and Fueguen et al. do not explicitly teach wherein the method further comprises:
training an automatic speech recognition (ASR) model using the paired text and speech; 

Coucke et al. does teach wherein training an automatic speech recognition (ASR) model using the paired text and speech (see Fig. 2, ¶3 of page 3 and ¶4 of page 4: ¶3 of page 3: “The ASR engine translates a spoken utterance into text through an acoustic model, mapping raw audio to a phonetic representation, […].”; ¶4 of page 4: “To train the acoustic model, we need several hundreds to thousands of hours of audio data with corresponding transcripts.” Here, since the acoustic model is within the ASR, it can be interpreted that training the acoustic model is part of training the ASR model.); 
training a spoken language understanding (SLU) model using the NLU model and the ASR model (see 1.1 The Snips Ecosystem section: “train the corresponding Spoken Language Understanding (SLU) engine, made of an Automatic Speech Recognition (ASR) engine and a Natural Language Understanding (NLU) engine”).  
Lugosch et al. in combination with Fuegen et al. and Coucke et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in language / speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lugosch et al. in combination with Fuegen et al.  to incorporate the teachings of Coucke et al. of training an ASR model using the paired text and speech and training a SLU model using the NLU model and the ASR model, which helps the resulting SLU engine (or model) being lightweight and fast to execute, making it fit for deployment on small (Coucke, Alice, et al. "Snips voice platform: an 
Regarding claim 6, Lugosch et al. in combination with Fuegen et al. and Coucke et al. teach all of the limitations as in claim 1 and 5, above.
Fuegen et al. further teaches wherein the NLU model is trained using the unpaired text and the paired text (see Fig. 1 and  Col. 3, lines 11-24: "... The transcripts 110 may or may not corresponding to the audio data 106 used to train the SR 102."). 
Lugosch  et al. in combination with Fuegen et al. and Coucke et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in language / speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lugosch et al. in combination with and Fuegen et al.  and and Coucke et al to incorporate further teachings of Fuegen et al. of training the NLU model using the unpaired text and the paired text which allow for more a more accurate translation performed in a more resource-efficient manner (abstract of Christian Fuegen et al. (US 11107462 B1)).
Regarding claim 7, Lugosch et al. in combination with Fuegen et al. and Coucke et al. teach all of the limitations as in claim 1 and 5, above.
Fuegen et al.  further teaches wherein  the NLU model comprises a text-to-intent (T2I) model, and wherein the SLU model comprises a speech-to-intent (S2I) model. (see Fig. 1 and Col. 3, lines 11-47: "...Using the tagged transcripts 110, the NLU component 104 may train various modules, such as a domain identification module 112, Here, it can be interpreted that the NLU, which has a transcript (i.e., text) as an input to the intent determination module, comprises a text-to-intent model, while the ASR, which has audio (i.e., speech) as an input to the intent determination module, comprises a speech-to-intent model)). 
Lugosch  et al. in combination with Fuegen et al. and Coucke et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in language / speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lugosch et al. in combination with and Fuegen et al.  to incorporate further teachings of Fuegen et al. of having a NLU model comprising of a T2I model, and a SLU model comprising of a S2I model, which allow the system to choose to allocate processing resources and consequently a more resource-efficient manner (Col. 3, line 59 – Col 4, line 2 of Christian Fuegen et al. (US 11107462 B1)).
Regarding claims 8, Lugosch et al. in combination with Fuegen et al. and Coucke et al. teach all of the limitations as in claim 1 and 5, above.
Fuegen et al. further teaches wherein the method further comprises using the paired text and speech to jointly train the NLU model and the SLU model. (see Fig. 1 and Col. 3, lines 11-24 and Col. 6, lines 51-64 : " Returning to FIG. 1, a natural language understanding component 104 may be trained using tagged transcripts 110. The transcripts 110 may or may not corresponding to the audio data 106 used to train  Here, 100 is the SLU model and 104 is the NLU model, where the NLU model is a component of the SLU model, hence it is interpreted that training happens jointly.).
Lugosch  et al. in combination with Fuegen et al. and Coucke et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in language / speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lugosch et al. in combination with Fuegen et al.  to incorporate further teachings of Fuegen et al. of using the paired text and speech to train the NLU model and the SLU model, which allow the system to choose to allocate processing resources and consequently a more resource-efficient manner (Col. 3, line 59 – Col 4, line 2 of Christian Fuegen et al. (US 11107462 B1)).
Regarding claim 16, Lugosch et al. in combination with Fuegen et al. and Coucke et al. teach all of the limitations as in claim 1 and 5, above.
Fuegen et al. further teaches wherein the method further comprises:
training a natural language understanding (NLU) model using the text (see Fig. 1 and Col. 3, lines 11-24: “Returning to FIG. 1, a natural language understanding component 104 may be trained using tagged transcripts 110. […] The transcripts 110 may or may not corresponding to the audio data 106 used to train the SR", Here, the tagged transcripts, which are not necessarily corresponding to audio data, are interpreted as analogous to the labeled unpaired text.;); 
training a spoken language understanding (SLU) model using the speech (see Fig. 1 and Col. 3, lines 11-24: "[…] the audio data 106 used to train the SR", Here, the SR component is part of the SLU, so it can be interpreted that the SLU is trained using speech (i.e., audio data).) and  
using the synthetic speech to jointly train the NLU model and the SLU model (see Fig. 1 and  Col. 3, lines 11-24: "[…] the audio data 106 used to train the SR", Here, the SR and NLU components are both part of the SLU, so it can be interpreted that the SLU is trained using both text and speech (i.e., audio data, which could be synthesized)).
Lugosch et al. in combination with Fuegen et al. and Coucke et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in language / speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lugosch et al. in combination with Fuegen et al.  to incorporate further teachings of Fuegen et al. of training a natural language understanding (NLU) model using the text, training a spoken language understanding (SLU) model using the speech, and using the synthetic speech to jointly train the NLU model and the SLU model, which allow for more a more accurate translation performed in a more resource-efficient manner (particularly in terms of processing resources) (abstract of Christian Fuegen et al. (US 11107462 B1)).
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Lugosch, Loren, et al. "Using Speech Synthesis to Train End-to-End Spoken Language Understanding Models." arXiv preprint arXiv:1910.09463 (2019) (hereinafter referred to   as applied to claims 5-8 and 16 above, and further in view of Yun-Nung Chen et al. (US 20170372200 A1; hereinafter referred to as Chen et al.).
Regarding claims 9, Lugosch et al. in combination with Fuegen et al. and Coucke et al. teach all of the limitations as in claim 1, 5 and 8, above.
Fuegen et al. further teaches wherein using the text paired with speech to jointly train comprises:
using the NLU model to produce a text embedding of the text paired to the speech; (see Fig. 1 and 2B and Col. 3, lines 11-24 and Col. 6, lines 51-64: FIG. 2B depicts an example of an NLU model 250 suitable for performing domain identification or intent determination. The depicted example is an LSTM-based utterance classifier, in which the input words 252 are first embedded in a dense representation in an embedding layer 254, and then an LSTM network is used to encode the word sequence (the depicted example uses a 2-layer bi-directional LSTM encoder 256); "The output 120 may be in the form of text. In some embodiments, the desirable end form of the output may be something other than text, such as an audio representation of the intent or domain". Here, mentioned in response to claim 8, 100 is the SLU model and 104 is the NLU model, where the NLU model is a component of the SLU model, hence it is interpreted that training happens jointly.).

using the SLU model to produce an acoustic embedding of the speech paired to the text
Chen et al. does teach wherein the method further comprises:
using the SLU model to produce an acoustic embedding of the speech paired to the text (see ¶ [0012] and [0084] and FIGS. 4A-C: ¶ [0012]: “[…] illustrate example end-to-end memory network models for contextual, e.g., multi-turn, language understanding, including multi-turn SLU according to various examples described herein.”; ¶ [0084]: “The model, e.g., model 220, can embed inputs, e.g., utterances, into a continuous space and store historic inputs, e.g., historic utterances, x embeddings to the memory.”).
Lugosch  et al. A, Fuegen et al., Coucke et al., and Chen et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in language / speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lugosch et al. in combination with Fuegen et al., and Coucke et al.  to incorporate further teachings of Chen et al. of and using the SLU model to produce an acoustic embedding, which allow models to exploit contextual information from memory. (¶ [0085] of Chen et al. (US US 20170372200 A1)).
Allowable Subject Matter
Claims 4, 10, and 11 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following statement of reasons for the indication of allowable subject matter:
Regarding claim 4, it would be allowable for disclosing:
 The method of claim 3, wherein the text-to-intent model and the speech-to-intent model are within a shared deep neural network.
Lugosch et al. teaches all of the limitations as in claim 1, above. 
Lugosch et al. further teaches the use of an encoder-decoder model for the E2E SLU experiments, where the encoder consists of a deep neural network. 
However, Lugosch et al. fails to teach wherein the text-to-intent model and the speech-to-intent model are within a shared deep neural network. 
Regarding claim 10, it would be allowable for disclosing:
The method of claim 8, further comprising:
determining a mean square error loss between the text embedding and the acoustic embedding; and

Lugosch et al. in combination with Fuegen et al., and Coucke et al., and Chen et al. teach all of the limitations as in claim 1, 5, and 9 above. 
Lugosch et al. in combination with Fuegen et al., and Coucke et al., and Chen et al. teach using the NLU model to produce a text embedding and the SLU model to produce an acoustic embedding.
However, Lugosch et al. in combination with Fuegen et al., and Coucke et al., and Chen et al. fails to teach wherein the text embedding and acoustic embedding are used to determine the mean square error loss between them and the backpropagation the mean square error loss to the SLU.
Regarding claim 11, it would be allowable for disclosing:
 The method of claim 8, further comprising:
using a shared classification layer to derive respective labels from the acoustic embedding and from the text embedding; and
backpropagating composite class-entropy classification loss for the respective labels to the SLU model and the NLU model.
Lugosch et al. in combination with Fuegen et al., and Coucke et al., and Chen et al. teach all of the limitations as in claim 1, 5, and 9 above.
Lugosch et al. in combination with Fuegen et al., and Coucke et al., and Chen et al. teach using the NLU model to produce a text embedding and the SLU model to produce an acoustic embedding.

	Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Keisha Y Castillo-Torres whose telephone number is (571)272-3975. The examiner can normally be reached Monday - Friday, 8:30 am - 4:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and 

Keisha Y. Castillo-Torres
Examiner
Art Unit 2659



/Keisha Y Castillo-Torres/Examiner, Art Unit 2659                                                                                                                                                                                                        


/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        
01/04/2022