DETAILED ACTION
This communication is in response to the Amendments and Arguments filed on
05/09/2022. Claims 8-11 are pending and have been examined. Hence,
this action has been made FINAL.
All previous objections/rejections not mentioned in this Office Action have been withdrawn by the Examiner.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments and Amendments
With respect to the 35 USC § 103 rejections, the Applicant provides several arguments in which the Examiner will respond accordingly.


35 USC § 103 rejections

Argument 1: 
Claim 8 is amended into independent form, including the limitations of Claim 1 and 5, from which it previously depended. In rejecting Claim 8, the Examiner stated, on pages 12-13 of the Office Action, "Fuegen ... further teaches wherein the method further comprises using the paired text and speech to jointly train the NLU model and the SLU model. (see Fig. 1 and Col. 3, lines 11-24 and Col. 6, lines 51-64 : ' Returning to FIG. 1, a natural language understanding component 104 may be trained using tagged transcripts 110. The transcripts 110 may or may not corresponding to the audio data 106 used to train the SR 102.'; (54): 'The output 120 may be in the form of text. In some embodiments, the desirable end form of the output may be something other than text, such as an audio representation [...]' Here, 100 is the SLU model and 104 is the NLU model, where the NLU model is a component of the SLU model, hence it is interpreted that training happens jointly.)." 
Applicant respectfully disagrees. It is respectfully asserted that Fuegen teaches sequential and not joint training. Applicant's undersigned representative searched the text of US 11107462 on the PTO web site and the only hits for "joint" are within the phrase "joint conference" and thus not applicable. Furthermore, element 100 in Fuegen is an SLU environment not a model; it is believed that speech recognition is sequential with NLU in the cited passages/figure of Fuegen. Furthermore in this regard, FIG. 1 of Fuegen is prior art and the abstract of Fuegen indicates that the two components are typically trained separately based on different metrics. Please also see especially column 4 beginning at line 3 (with emphasis added): "Another problem with conventional SLU systems 100 is that the SR 102 and NLU 104 components are trained separately, using different criteria. For example, the SR unit 102 may be trained to minimize the word error rate (WER) when transcribing audio into text. This approach weighs all words equally when assigning a WER score, but in practice not all words contribute equally to the semantic meaning of a sentence. Thus, the SR 102 unit may be configured to devote more resources to words that, for purposes of the SLU system 100, may not be entirely consequential." 
It is thus respectfully asserted that Fuegen does not teach or suggest joint training. Furthermore, it is not seen that the Examiner points to anything in any of the other references that would cure this deficiency. Accordingly, it is respectfully asserted that even if combined as proposed by the Examiner, Lugosch, Fuegen, and Coucke fail to teach or fairly suggest at least the joint training in Claim 8.


Examiner response to Argument(s) 1:
The Examiner acknowledges the addition of limitations previously found in claims 1 and 5. 
Regarding the Applicant’s arguments (underlined above), arguments have been considered but are not persuasive. The Examiner respectfully disagrees that Fig. 1 of Fuegen is prior art, since no labeling of said figure being a prior art is present in the patent, further, the text mentions in Col. 3, lines 39-42: “Returning again to FIG. 1, when training is complete the SLU system 100 may receive audio data 106 for which semantic tasks such as domain identification, intent determination, or slot-filling are to be performed.” However, if Fig. 1 were to be considered prior art, the citation provided by the applicant above (“Another problem with conventional SLU systems 100 is that the SR 102 and NLU 104 components are trained separately,…”) in regards to the SR and NLU components being trained separately, it is noted by the Examiner that claim language in claim 8 is disclosing NLU and SLU being trained jointly and not NLU and SR as mentioned in Applicant’s argument/citation from Fuegen. Hence, the Examiner notes that since the NLU is part of the SLU as presented in Fig. 1 of Fuegen, the SLU is trained as a function of the NLU.
Therefore, the Examiner respectfully disagrees and the rejection interpretation is maintained.

Argument 2: 

Claim 9 depends from Claim 8, and is thus believed to be patentable at least by virtue of such dependency. Applicant respectfully reserves the right to argue Claim 9 separately in any future paper, including an Appeal Brief.

Examiner response to Argument(s) 1:
The Examiner respectfully disagrees and the rejection interpretation is maintained. Please refer to examiner response to argument 2 above.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Lugosch, Loren, et al. "Using Speech Synthesis to Train End-to-End Spoken Language Understanding Models." arXiv preprint arXiv:1910.09463 (2019) (hereinafter referred to as Lugosch et al.) as applied to claims 1-3 above, and further in view of Christian Fuegen et al. (US 11107462 B1; hereinafter referred to as Fuegen et al.) and Coucke, Alice, et al. "Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces." arXiv preprint arXiv:1805.10190 (2018). (hereinafter referred to as Coucke et al.). 

Regarding claim 8, Lugosh et al. teaches:
(Currently Amended) A method for training an end-to-end (E2E) spoken language understanding (SLU) system (see ¶ 3 of Introduction: “In this paper, we propose a method for reducing, or avoiding entirely, the need to record audio data ”),, the method comprising steps of: 
receiving a training corpus comprising a set of text classified using one or more sets of semantic labels but unpaired with speech(see ¶ 3 of Introduction: “In this paper, we propose a method for reducing, or avoiding entirely, the need to record audio data to train an end-to-end SLU model. Given a dataset of semantically labeled text, […] thus generating an audio dataset that can be used for training the model.” Here, the dataset of semantically labeled text is interpreted to be analogous to the text classified using one or more sets of semantic labels. Also, the unpaired with speech portion is interpreted as analogous to the reduction or elimination of audio data used in Lugosch et al., therefore, no one-to-one relationship between text and audio data.);  and 
using the set of unpaired text to train the E2E SLU system to classify speech using at least one of the one or more sets of semantic labels (see ¶ 4 of Introduction: “Given a dataset of ”. Here, it is interpreted that the semantically labeled data is analogous to the unpaired text which is used to train the model.; -2-Attorney Docket No. P201909786US01 
However, Lugosch et al. do not explicitly teach wherein the method further comprises:
training a natural language understanding (NLU) model using the unpaired text and the labels
training an automatic speech recognition (ASR) model using the paired text and speech; 
training a spoken language understanding (SLU) model using the NLU model and the ASR model; and 
using the paired text and speech to jointly train the NLU model and the SLU model

Fueguen et al. does teach:

wherein the training corpus further comprises a set of speech and text paired with at least a portion of the set of speech, the method further comprising: 
training a natural language understanding (NLU) model using the unpaired text and the labels (see ¶ 3 of Introduction: “In this paper, we propose a method for reducing, or avoiding entirely, the need to record audio data ”),; 
using the paired text and speech to jointly train the NLU model and the SLU model (see Fig. 1 and Col. 3, lines 11-24 and Col. 6, lines 51-64 : " Returning to FIG. 1, a natural language understanding component 104 may be trained using tagged transcripts 110. The transcripts 110 may or may not corresponding to the audio data 106 used to train the SR 102."; (54): "The output 120 may be in the form of text. In some embodiments, the desirable end form of the output may be something other than text, such as an audio representation […]" Here, 100 is the SLU model and 104 is the NLU model, where the NLU model is a component of the SLU model, hence it is interpreted that training happens jointly.).  
Lugosch et al. and Fuegen et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in language / speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified to incorporate the teachings of Fuegen et al. of training a NLU model using the unpaired and/or paired text and using the paired text and speech to train the NLU model and the SLU model, which allow for more a more accurate translation performed in a more resource-efficient manner (particularly in terms of processing resources) (abstract of Christian Fuegen et al. (US 11107462 B1)).
However, Lugosch et al. in combination with and Fueguen et al. do not explicitly teach wherein the method further comprises:
training an automatic speech recognition (ASR) model using the paired text and speech
training a spoken language understanding (SLU) model using the NLU model and the ASR model

Coucke et al. does teach:
training an automatic speech recognition (ASR) model using the paired text and speech (see Fig. 2, ¶3 of page 3 and ¶4 of page 4: ¶3 of page 3: “The ASR engine translates a spoken utterance into text through an acoustic model, mapping raw audio to a phonetic representation, […].”; ¶4 of page 4: “To train the acoustic model, we need several hundreds to thousands of hours of audio data with corresponding transcripts.” Here, since the acoustic model is within the ASR, it can be interpreted that training the acoustic model is part of training the ASR model.);; 
training a spoken language understanding (SLU) model using the NLU model and the ASR model; and (see Fig. 2, ¶3 of page 3 and ¶4 of page 4: ¶3 of page 3: “The ASR engine translates a spoken utterance into text through an acoustic model, mapping raw audio to a phonetic representation, […].”; ¶4 of page 4: “To train the acoustic model, we need several hundreds to thousands of hours of audio data with corresponding transcripts.” Here, since the acoustic model is within the ASR, it can be interpreted that training the acoustic model is part of training the ASR model.); 
Lugosch et al. in combination with Fuegen et al. and Coucke et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in language / speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lugosch et al. in combination with Fuegen et al.  to incorporate the teachings of Coucke et al. of training an ASR model using the paired text and speech and training a SLU model using the NLU model and the ASR model, which helps the resulting SLU engine (or model) being lightweight and fast to execute, making it fit for deployment on small (Coucke, Alice, et al. "Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces." arXiv preprint arXiv:1805.10190 (2018).).

Claim 9 are rejected under 35 U.S.C. 103 as being unpatentable over Lugosch, Loren, et al. "Using Speech Synthesis to Train End-to-End Spoken Language Understanding Models." arXiv preprint arXiv:1910.09463 (2019) (hereinafter referred to as Lugosch et al.) as applied to claims 1-3 above, and further in view of Christian Fuegen et al. (US 11107462 B1; hereinafter referred to as Fuegen et al.) and Coucke, Alice, et al. "Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces." arXiv preprint arXiv:1805.10190 (2018). (hereinafter referred to as Coucke et al.), as un claim 8 above and further in view of Yun-Nung Chen et al. (US 20170372200 A1; hereinafter referred to as Chen et al.).
Regarding claim 9, Lugosch et al. in combination with Fuegen et al. and Coucke et al. teach all of the limitations as in claim 8, above.
Fuegen et al. further teaches:
(Original) The method of claim 8, wherein using the text paired with speech to jointly train comprises: 
using the NLU model to produce a text embedding of the text paired to the speech (see Fig. 1 and 2B and Col. 3, lines 11-24 and Col. 6, lines 51-64: FIG. 2B depicts an example of an NLU model 250 suitable for performing domain identification or intent determination. The depicted example is an LSTM-based utterance classifier, in which the input words 252 are first embedded in a dense representation in an embedding layer 254, and then an LSTM network is used to encode the word sequence (the depicted example uses a 2-layer bi-directional LSTM encoder 256); "The output 120 may be in the form of text. In some embodiments, the desirable end form of the output may be something other than text, such as an audio representation of the intent or domain". Here, mentioned in response to claim 8, 100 is the SLU model and 104 is the NLU model, where the NLU model is a component of the SLU model (as presented in Fig. 1), hence it is interpreted that training happens jointly (SLU is trained as a function of the NLU).); 

However, Lugosch et al. in combination with Fueguen et al., and Coucke et al. do not explicitly teach wherein the method further comprises:
using the SLU model to produce an acoustic embedding of the speech paired to the text
Chen et al. does teach wherein the method further comprises:
using the SLU model to produce an acoustic embedding of the speech paired to the text (see ¶ [0012] and [0084] and FIGS. 4A-C: ¶ [0012]: “[…] illustrate example end-to-end memory network models for contextual, e.g., multi-turn, language understanding, including multi-turn SLU according to various examples described herein.”; ¶ [0084]: “The model, e.g., model 220, can embed inputs, e.g., utterances, into a continuous space and store historic inputs, e.g., historic utterances, x embeddings to the memory.”).
Lugosch  et al. A, Fuegen et al., Coucke et al., and Chen et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in language / speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lugosch et al. in combination with Fuegen et al., and Coucke et al.  to incorporate further teachings of Chen et al. of and using the SLU model to produce an acoustic embedding, which allow models to exploit contextual information from memory. (¶ [0085] of Chen et al. (US US 20170372200 A1)).

Allowable Subject Matter

Claims 4, 19, and 20 are allowed.
The following statement of reasons for the indication of allowable subject matter:
Regarding claim 4, 19, and 20, they would be allowable for disclosing:
wherein the text-to-intent model and the speech-to-intent model are within a shared deep neural network.
Lugosch et al. teaches most of the limitations in (amended) independent claims 4, 19, and 20.
However, Lugosch et al. fails to teach wherein the text-to-intent model and the speech-to-intent model are within a shared deep neural network. 

Claims 10-11 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following statement of reasons for the indication of allowable subject matter:
Regarding claim 10, it would be allowable for disclosing:
(Currently Amended) The method of claim 9 [[8]], further comprising:
determining a mean square error loss between the text embedding and the acoustic embedding; and
backpropagating the mean square error loss to the SLU model not to the NLU model.
Lugosch et al. in combination with Fuegen et al., and Coucke et al., and Chen et al. teach all of the limitations as in claim 1, 5, and 9 above. 
Lugosch et al. in combination with Fuegen et al., and Coucke et al., and Chen et al. teach using the NLU model to produce a text embedding and the SLU model to produce an acoustic embedding.
However, Lugosch et al. in combination with Fuegen et al., and Coucke et al., and Chen et al. fails to teach wherein the text embedding and acoustic embedding are used to determine the mean square error loss between them and the backpropagation the mean square error loss to the SLU.
Regarding claim 11, it would be allowable for disclosing:
 (Currently Amended) The method of claim 9 [[8]], further comprising:
using a shared classification layer to derive respective labels from the acoustic embedding and from the text embedding; and
backpropagating composite class-entropy classification loss for the respective labels to the SLU model and the NLU model.
Lugosch et al. in combination with Fuegen et al., and Coucke et al., and Chen et al. teach all of the limitations as in claim 1, 5, and 9 above.
Lugosch et al. in combination with Fuegen et al., and Coucke et al., and Chen et al. teach using the NLU model to produce a text embedding and the SLU model to produce an acoustic embedding.
However, Lugosch et al. in combination with Fuegen et al., and Coucke et al., and Chen et al. fails to teach wherein the text embedding and acoustic embedding are used for a shared classification layer to derive respective labels and backpropagating composite class-entropy classification loss for the respective labels to the SLU model and the NLU model.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Keisha Y Castillo-Torres whose telephone number is (571)272-3975. The examiner can normally be reached Monday - Friday, 9:00 am - 4:00 pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Keisha Y. Castillo-Torres
Examiner
Art Unit 2659



/Keisha Y. Castillo-Torres/Examiner, Art Unit 2659                                                                                                                                                                                                        
/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        
05/21/2022