DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/03/2020 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference signs mentioned in the description: Wnm in paragraph [0045], 204 in paragraph [0048], and 304 in paragraph [0048]. Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference characters not mentioned in the description: 302a in Fig. 1, 302b in Fig. 1, Wmn in Fig. 2, S1 in Fig. 2, S2 in Fig. 2, and Sm in Fig. 2. Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the 
Specification
The disclosure is objected to because of the following informalities: 
In paragraph [0045], lines 3-5, “a sequence of sentence representations 403 are converted by a review-level RNN 404 into a task representation 405 by a review RNN 404” should read “a sequence of sentence representations 403 are converted by a review-level RNN 404 into a task representation 405”
In paragraph [0084], line 1, “The components of computer system may include” should read “The components of the computer system may include”
In paragraph [0084], line 3, “Processor 510 may include software module that performs” should read “Processor 510 may include a software module that performs”
In paragraph [0085], line 2 “that it is accessible by computer system” should read “that it is accessible by the computer system”
Appropriate correction is required.
Claim Objections
Claims 1-7, 9, 12, 14, 18, and 20 objected to because of the following informalities:  
In claim 1, line 12, “update parameters of each of a plurality layers” should read “update parameters of each of a plurality of layers”
In claim 9, line 1, “wherein a task of neural model includes” should read “wherein a task of a neural model includes”
In claim 12, line 2, “the first and second languages” should read “the first language and the second language”
In claim 14, line 1, “wherein the training the second neural model” should read “wherein the training of the second neural model”
In claim 18, line 2, “the first and second languages” should read “the first language and the second language”
In claim 20, line 1, “wherein the training the first neural model” should read “wherein the training of the first neural model”
Each dependent claim of claim 1 is objected to based on the same rationale as the claim from which it depends.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-7, 9, and 14-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
1 recites the limitation " the first language or dialect" in line 10. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the first language or dialect” has been interpreted as “a first language or dialect”.
Claim 1 recites the limitation " the second language or dialect " in line 10. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the second language or dialect” has been interpreted as “a second language or dialect”.
Claim 2 recites the limitation "the layers" in line 7. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the layers” has been interpreted as “the one or more layers” in reference to “one or more layers” in line 5.
Claim 4 recites the limitation "the task-appropriate model architecture" in line 1. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the task-appropriate model architecture” has been interpreted as “a task-appropriate model architecture”.
Claim 14 recites the limitation "the unlabeled loss function" in lines 3-4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the unlabeled loss function” has been interpreted as “an unlabeled loss function”.
Claim 15 recites the limitation "the parallel data" in line 12. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the parallel data” has been interpreted as “the unannotated parallel data” in reference to “unannotated parallel data” in line 8.
Claim 16 recites the limitation "the neural model" in line 1. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the neural model” has been interpreted as “the first neural model or the second neural model” in reference to “a first neural model” in line 3 of claim 15 and “a second neural model” in line 9 of claim 15.
Claim 9 recites the limitation “of neural model” in line 1. This limitation lacks clarity because It is unclear whether “of neural model” is in reference to “a first neural model” in line 3 of claim 8 or “a 
Claim 16 recites the limitation “the neural model” in line 1. This limitation lacks clarity because it is unclear whether “the neural model" is in reference to “a first neural model” in line 3 of claim 15 or “a second neural model” in line 9 of claim 15. For examination purposes, “of neural model” has been interpreted as “the first neural model or the second neural model”.
Each dependent claim of claims 1 and 15 is rejected based on the same rationale as the claim from which it depends.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 and 5-20 are rejected under 35 U.S.C. 103 as being unpatentable over Vanhoucke et al. (US 9,460,711 B1) in view of Sypniewski et al. (US 10,720,151 B2) and further in view of Conneau et al. (“XNLI: Evaluating Cross-lingual Sentence Representations”).
Regarding Claim 1,
Vanhoucke et al. teaches a system for transferring a cross-lingual neural model (Col. 2 line 54 - Col. 3 line 5: “The system may also include a means for processing a multilingual deep neural network (DNN) acoustic model based on the training data to generate a trained multilingual DNN acoustic model. The multilingual DNN acoustic model may include a feedforward neural network having multiple layers of one or more nodes. Each node of a given layer may connect with a respective weight to each node of a subsequent layer. Further, the multiple layers of one or more nodes may include one or more hidden layers of nodes that are shared between at least two of the two or more languages and a language-specific output layer of nodes corresponding to each of the two or more languages” - teaches a system for processing a multilingual deep neural network (DNN) acoustic model by sharing hidden layers for multiple languages), comprising: 
a processor and a memory (Fig. 9; Col. 10 lines 29-45: “FIG. 9 is a functional block diagram illustrating an example computing device 900 used in a computing system that is arranged in accordance with at least some embodiments described herein. The computing device 900 may be implemented to process a multilingual DNN acoustic model or perform any of the functions described above with reference to FIGS. 1-8. In a basic configuration 902, computing device 900 may typically include one or more processors 910 and system memory 920” - teaches a computing device 900 as part of a computing system, where the computing device comprises one or more processors 910 and a system memory 920), 
wherein a language or dialect of the first neural model is different from a language or dialect of the second neural model (Col. 2 lines 10-30: “The functions may include receiving training data that includes a respective training data set for each of two or more or languages. The functions may also include processing a multilingual deep neural network (DNN) acoustic model based on the training data to generate a trained multilingual DNN acoustic model. The multilingual DNN acoustic model may include a feedforward neural network having multiple layers of one or more nodes. Each node of a given layer may connect with a respective weight to each node of a subsequent layer. Further, the multiple layers of one or more nodes may include one or more hidden layers of nodes that are shared between at least two of the two or more languages and a language-specific output layer of nodes corresponding to each of the two or more languages” - teaches a multilingual acoustic DNN model consists of two or more languages with language specific output layers); 
an operating environment executing commands using the processor (Fig. 9; Col. 10 lines 29-45: “FIG. 9 is a functional block diagram illustrating an example computing device 900 used in a computing system that is arranged in accordance with at least some embodiments described herein. The computing device 900 may be implemented to process a multilingual DNN acoustic model or perform any of the functions described above with reference to FIGS. 1-8” - teaches a computing device 900 as part of a computing system used to process a multilingual acoustic DNN model using one or more processors 910) to, 
train the first neural model on annotated data based on a labeled loss function to define and update parameters of each of a plurality of layers of the first neural model (Fig. 2; Col. 5 line 60 - Col. 6 line 10: “Initially, at block 202, the method 200 includes receiving training data comprising a respective training data set for each of two or more languages. A training data set for an individual language may include one or any combination of read/spontaneous speech and supervised/unsupervised data received from any number of sources. For instance, the training data may be data recorded from voice searches, audio clips, videos, or other sources that has been transcribed. More specifically, the training data may be feature-extracted windows of speech that are associated with a label. For example, the training data may include portions of audio data that are labeled as phonetic states or sub-phonetic states” - teaches that the model is trained using training data consisting of labeled phonetic states (annotated data). Col. 8 lines 13-31: “In one example, processing the multilingual DNN acoustic model may include jointly training the multilingual DNN acoustic model based on the training data. For example, the multilingual DNN acoustic model may be processed on the training data by backpropagating derivatives of a cost function that measures the discrepancy between an expected output d and actual output produced by the multilingual DNN acoustic model for each training data case. An example cost function C may be of the form:

    PNG
    media_image1.png
    50
    157
    media_image1.png
    Greyscale

Other example threshold functions, such as the hyperbolic tangent, or generally any function with a well-behaved derivate, may also be used. Similarly, other cost functions may be used for determining errors that are used in backpropagation” - teaches that training the model using the training data involves backpropagating derivatives of a cost function (labeled loss function) for determining errors).
Vanhoucke et al. does not appear to explicitly teach wherein a first neural model and a second neural model are stored in the memory, to train the first neural model and a second neural model on parallel data between the first language or dialect and the second language or dialect based on an unlabeled loss function to update each of a plurality of layers of the first neural model and to define and update parameters of each of a plurality layers of the second neural model, and wherein all but a lowest level layer of the first neural model is copied to the second neural model.
However, Sypniewski et al. teaches wherein a first neural model and a second neural model are stored in the memory (Col. 10 line 58 - Col. 11 line 18: “In an embodiment, the first fully-connected layer 203 is implemented as a fully-connected neural network that is repeated across the entire segment that is output by the CNN stack 202, and each copy of the fully-connected neural network accepts as input a single strided frame. … Each copy of the fully-connected neural network shares the same parameters, in particular each of the weights of all of the nodes of the fully-connected neural network, which allows for computational and memory efficiency because the size of the fully-connected neural network corresponds to a single strided frame rather than the segment and one copy of the fully-connected neural network may be stored and reused” - teaches that a fully-connected 
and wherein all but a lowest level layer of the first neural model is copied to the second neural model (Col. 10 line 58 - Col. 11 line 18: “In an embodiment, the first fully-connected layer 203 is implemented as a fully-connected neural network that is repeated across the entire segment that is output by the CNN stack 202, and each copy of the fully-connected neural network accepts as input a single strided frame. … Each copy of the fully-connected neural network shares the same parameters, in particular each of the weights of all of the nodes of the fully-connected neural network, which allows for computational and memory efficiency because the size of the fully-connected neural network corresponds to a single strided frame rather than the segment and one copy of the fully-connected neural network may be stored and reused” - teaches a fully connected layer 203 implemented as a fully-connected neural network that receives inputs from CNN stack 202 (lowest level layer) where the fully connected layer 203 (all other layers of the model) is copied and reused for subsequent networks).
	Vanhoucke et al. and Sypniewski et al. are analogous to the claimed invention because they are directed to neural models for language processing.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein a first neural model and a second neural model are stored in the memory, and wherein all but a lowest level layer of the first neural model is copied to the second neural model as taught by Sypniewski et al. to the disclosed invention of Vanhoucke et al.
	One of ordinary skill in the art would have been motivated to make this modification because "Each copy of the fully-connected neural network shares the same parameters, in particular each of the weights of all of the nodes of the fully-connected neural network, which allows for computational and memory efficiency" (Sypniewski et al. Col. 11 lines 4-8).
Vanhoucke et al. in view of Sypniewski et al. does not appear to explicitly teach to train the first neural model and a second neural model on parallel data between the first language or dialect and the second language or dialect based on an unlabeled loss function to update each of a plurality of layers of the first neural model and to define and update parameters of each of a plurality layers of the second neural model.
	However, Conneau et al. teaches to train the first neural model and a second neural model on parallel data between the first language or dialect and the second language or dialect based on an unlabeled loss function to update each of a plurality of layers of the first neural model and to define and update parameters of each of a plurality layers of the second neural model (Section 4.2.3 First two paragraphs: “Training for similarity of source and target sentences in an embedding space is conceptually and computationally simpler than generating a translation in the target language from a source sentence. We propose a method for training for cross lingual similarity and evaluate approaches based on the simpler task of aligning sentence representations. Under our objective, the embeddings of two parallel sentences need not be identical, but only close enough in the embedding space that the decision boundary of the English classifier captures the similarity. We propose a simple alignment loss function to align the embedding spaces of two different languages. Specifically, we train an English encoder on NLI, and train a target encoder by minimizing the loss” - teaches training cross-lingual models using parallel embeddings (parallel data) of the two languages and an alignment loss function (unlabeled loss function) for aligning the embedding spaces of two different languages. Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the model).
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.

	One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Regarding Claim 5,
	Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the system of claim 1.
	Additionally, Conneau et al. further teaches wherein the second neural model is trained without annotated data of the second language or dialect (Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the second model. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the aligned word embeddings come from using parallel datasets (parallel corpora) between the first and second languages, not annotated data in the second language).
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the second neural model is trained without annotated data of the second language or dialect as taught by Conneau et al. to the disclosed invention of Vanhoucke et al. in view of Sypniewski et al.
	One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Regarding Claim 6,
	Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the system of claim 1.
	Additionally, Conneau et al. further teaches wherein the second neural model is trained without a translation system, a dictionary, or a pivot lexicon (Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the second neural model is trained only using parallel datasets to learn the alignment between the first language and second language. Section 5.3 Page 9 Second Column: "At inference time, the multilingual sentence encoder approach is however much cheaper than the TRANSLATE TEST baseline, and this method also does not require any machine translation system" - teaches that the encoder approach used does not need any form of translation system).
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the second neural model is trained without a translation system, a dictionary, or a pivot lexicon as taught by Conneau et al. to the disclosed invention of Vanhoucke et al. in view of Sypniewski et al.
	One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Regarding Claim 7,
	Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the system of claim 1.
	Additionally, Vanhoucke et al. further teaches wherein training resources consist of annotated data of the first language or dialect … (Fig. 2; Col. 5 line 60 - Col. 6 line 10: “Initially, at block 202, the method 200 includes receiving training data comprising a respective training data set for each of two or more languages. A training data set for an individual language may include one or any combination of read/spontaneous speech and supervised/unsupervised data received from any number of sources. For instance, the training data may be data recorded from voice searches, audio clips, videos, or other sources that has been transcribed. More specifically, the training data may be feature-extracted windows of speech that are associated with a label. For example, the training data may include portions of audio data that are labeled as phonetic states or sub-phonetic states” - teaches that the model is trained using training data consisting of labeled phonetic states (annotated data)).
	Moreover, Conneau et al. further teaches wherein training resources consist of … unannotated parallel data in both the first language or dialect and the second language or dialect (Section 4.2.3 First two paragraphs: “Training for similarity of source and target sentences in an embedding space is conceptually and computationally simpler than generating a translation in the target language from a source sentence. We propose a method for training for cross lingual similarity and evaluate approaches based on the simpler task of aligning sentence representations. Under our objective, the embeddings of two parallel sentences need not be identical, but only close enough in the embedding space that the decision boundary of the English classifier captures the similarity” - teaches training cross-lingual models using parallel embeddings (parallel data) of the two languages. Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the model. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the aligned word embeddings come from using parallel corpora (unannotated parallel data) between the first (English in this example) and second languages (target language in this example); the parallel corpora are considered unannotated or unlabeled because the alignment has not yet taken place).
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein training resources consist of … unannotated parallel data in both the first language or dialect and the second language or dialect as taught by Conneau et al. to the disclosed invention of Vanhoucke et al. in view of Sypniewski et al.
	One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Regarding Claim 8,
Vanhoucke et al. teaches a computer implemented method for cross-lingual neural model transfer (Col. 1 line 59 - Col. 2 line 9: “The method may also include processing a multilingual deep neural network (DNN) acoustic model based on the training data to generate a trained multilingual DNN acoustic model. The multilingual DNN acoustic model may include a feedforward neural network having multiple layers of one or more nodes. Each node of a given layer may connect with a respective weight to each node of a subsequent layer. Further, the multiple layers of one or more nodes may include one or more hidden layers of nodes that are shared between at least two of the two or more languages and a language-specific output layer of nodes corresponding to each of the two or more languages” - teaches a method for processing a multilingual deep neural network (DNN) acoustic model by sharing hidden layers for multiple languages. Fig. 9; Col. 10 lines 29-45: “FIG. 9 is a functional block diagram illustrating an example computing device 900 used in a computing system that is arranged in accordance with at least some embodiments described herein. The computing device 900 may be implemented to process a multilingual DNN acoustic model or perform any of the functions described above with reference to FIGS. 1-8” - teaches that the method can be implemented by a computing device 900), comprising: 
supplying annotated data of a first language to a first neural model of a first language (Fig. 2; Col. 5 line 60 - Col. 6 line 10: “Initially, at block 202, the method 200 includes receiving training data comprising a respective training data set for each of two or more languages. A training data set for an individual language may include one or any combination of read/spontaneous speech and supervised/unsupervised data received from any number of sources. For instance, the training data may be data recorded from voice searches, audio clips, videos, or other sources that has been transcribed. More specifically, the training data may be feature-extracted windows of speech that are associated with a label. For example, the training data may include portions of audio data that are labeled as phonetic states or sub-phonetic states” - teaches that the model is trained using training data consisting of labeled phonetic states (annotated data)), 
training the first neural model of the first language on the annotated data to define and update parameters of the first neural model of the first language based on a labeled loss function (Fig. 2; Col. 5 line 60 - Col. 6 line 10: “Initially, at block 202, the method 200 includes receiving training data comprising a respective training data set for each of two or more languages. A training data set for an individual language may include one or any combination of read/spontaneous speech and supervised/unsupervised data received from any number of sources. For instance, the training data may be data recorded from voice searches, audio clips, videos, or other sources that has been transcribed. More specifically, the training data may be feature-extracted windows of speech that are associated with a label. For example, the training data may include portions of audio data that are labeled as phonetic states or sub-phonetic states” - teaches that the model is trained using training data consisting of labeled phonetic states (annotated data). Col. 8 lines 13-31: “In one example, processing the multilingual DNN acoustic model may include jointly training the multilingual DNN acoustic model based on the training data. For example, the multilingual DNN acoustic model may be processed on the training data by backpropagating derivatives of a cost function that measures the discrepancy between an expected output d and actual output produced by the multilingual DNN acoustic model for each training data case. An example cost function C may be of the form:

    PNG
    media_image1.png
    50
    157
    media_image1.png
    Greyscale

Other example threshold functions, such as the hyperbolic tangent, or generally any function with a well-behaved derivate, may also be used. Similarly, other cost functions may be used for determining errors that are used in backpropagation” - teaches that training the model using the training data involves backpropagating derivatives of a cost function (labeled loss function) for determining errors); 
freezing the parameters of the first neural model of the first language (Fig. 8; Col. 9 line 64 - Col. 10 line 10 "During processing of the training data for the second language, the weights of the one or more bottom hidden layers may be held fixed" - teaches the weights (parameters) from the monolingual DNN acoustic model for the first language are held fixed during training of the multilingual DNN acoustic model for a second language); 
Vanhoucke et al. does not appear to explicitly teach supplying unannotated parallel data between the first language and a second language to the first neural model of the first language and a 
	However, Sypniewski et al. teaches merging a portion of the parameters of the first neural model of the first language into the second neural model of the second language (Col. 10 line 58 - Col. 11 line 18: “In an embodiment, the first fully-connected layer 203 is implemented as a fully-connected neural network that is repeated across the entire segment that is output by the CNN stack 202, and each copy of the fully-connected neural network accepts as input a single strided frame. … Each copy of the fully-connected neural network shares the same parameters, in particular each of the weights of all of the nodes of the fully-connected neural network” - teaches that copies (second neural model) of the fully connected neural network share the same parameters as the first fully connected layer 203 (first neural model)).
	Vanhoucke et al. and Sypniewski et al. are analogous to the claimed invention because they are directed to neural models for language processing.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate merging a portion of the parameters of the first neural model of the first language into the second neural model of the second language as taught by Sypniewski et al. to the disclosed invention of Vanhoucke et al.
	One of ordinary skill in the art would have been motivated to make this modification because "Each copy of the fully-connected neural network shares the same parameters, in particular each of the weights of all of the nodes of the fully-connected neural network, which allows for computational and memory efficiency" (Sypniewski et al. Col. 11 lines 4-8).
Vanhoucke et al. in view of Sypniewski et al. does not appear to explicitly teach supplying unannotated parallel data between the first language and a second language to the first neural model of the first language and a second neural model of a second language; and training the second neural model of the second language on the unannotated parallel data to define and update parameters of the second neural model of the second language.
	However, Conneau et al. teaches supplying unannotated parallel data between the first language and a second language to the first neural model of the first language and a second neural model of a second language (Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the cross-lingual models. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the aligned word embeddings come from using parallel corpora (unannotated parallel data) between the first (English in this example) and second languages (target language in this example); the parallel corpora are considered unannotated or unlabeled because the alignment has not yet taken place); 
and training the second neural model of the second language on the unannotated parallel data to define and update parameters of the second neural model of the second language (Section 4.2.3 First two paragraphs: “Training for similarity of source and target sentences in an embedding space is conceptually and computationally simpler than generating a translation in the target language from a source sentence. We propose a method for training for cross lingual similarity and evaluate approaches based on the simpler task of aligning sentence representations. Under our objective, the embeddings of two parallel sentences need not be identical, but only close enough in the embedding space that the decision boundary of the English classifier captures the similarity. We propose a simple alignment loss function to align the embedding spaces of two different languages. Specifically, we train an English encoder on NLI, and train a target encoder by minimizing the loss” - teaches training cross-lingual models using parallel embeddings (parallel data) of the two languages and an alignment loss function (unlabeled loss function) for aligning the embedding spaces of two different languages. Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the model. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the aligned word embeddings come from using parallel corpora (unannotated parallel data) between the first (English in this example) and second languages (target language in this example); the parallel corpora are considered unannotated or unlabeled because the alignment has not yet taken place).
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate supplying unannotated parallel data between the first language 
	One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Regarding Claim 9,
	Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the method of claim 8.
	Additionally, Vanhoucke et al. further teaches wherein a task of neural model includes one of sentiment classification, style classification, intent understanding, message routing, duration prediction, or structured content recognition (Col. 4 line 57 - Col. 5 line 7: “The multilingual DNN acoustic model 102 may be configured to output probabilities with respect to phonetic states corresponding to the acoustic input data. Speech can be broken into phonetic segments, referred to as phones, or further broken into sub-phonetic segments referred to as senones. The multilingual DNN acoustic model 102 may be configured to output, for each of multiple possible phonetic or sub-phonetic states, the probability that the received acoustic input data corresponds to the phonetic or sub-phonetic state” - teaches that the multilingual acoustic DNN model may be configured to output the probability that each input is a phonetic or sub-phonetic state (corresponds to structured content recognition)).


Regarding Claim 10,
	Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the method of claim 8.
	Additionally, Conneau et al. further teaches wherein the second neural model of the second language is trained without annotated data of the second language (Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the second model. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the aligned word embeddings come from using parallel datasets (parallel corpora) between the first and second languages, not annotated data in the second language).
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the second neural model of the second language is trained without annotated data of the second language as taught by Conneau et al. to the disclosed invention of Vanhoucke et al. in view of Sypniewski et al.


Regarding Claim 11,
	Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the method of claim 8.
	Additionally, Conneau et al. further teaches wherein the second neural model of the second language is trained without a translation system, a dictionary, and a pivot lexicon (Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the second neural model is trained only using parallel datasets to learn the alignment between the first language and second language. Section 5.3 Page 9 Second Column: "At inference time, the multilingual sentence encoder approach is however much cheaper than the TRANSLATE TEST baseline, and this method also does not require any machine translation system" - teaches that the encoder approach used does not need any form of translation system, dictionary, or pivot lexicon).
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.

	One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Regarding Claim 12,
	Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the method of claim 8.
	Additionally, Vanhoucke et al. further teaches wherein training resources consist of annotated data of the first language … (Fig. 2; Col. 5 line 60 - Col. 6 line 10: “Initially, at block 202, the method 200 includes receiving training data comprising a respective training data set for each of two or more languages. A training data set for an individual language may include one or any combination of read/spontaneous speech and supervised/unsupervised data received from any number of sources. For instance, the training data may be data recorded from voice searches, audio clips, videos, or other sources that has been transcribed. More specifically, the training data may be feature-extracted windows of speech that are associated with a label. For example, the training data may include portions of audio data that are labeled as phonetic states or sub-phonetic states” - teaches that the model is trained using training data consisting of labeled phonetic states (annotated data)).
Conneau et al. further teaches wherein training resources consist of … unannotated parallel data in both the first and second languages (Section 4.2.3 First two paragraphs: “Training for similarity of source and target sentences in an embedding space is conceptually and computationally simpler than generating a translation in the target language from a source sentence. We propose a method for training for cross lingual similarity and evaluate approaches based on the simpler task of aligning sentence representations. Under our objective, the embeddings of two parallel sentences need not be identical, but only close enough in the embedding space that the decision boundary of the English classifier captures the similarity” - teaches training cross-lingual models using parallel embeddings (parallel data) of the two languages. Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the model. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the aligned word embeddings come from using parallel corpora (unannotated parallel data) between the first (English in this example) and second languages (target language in this example); the parallel corpora are considered unannotated or unlabeled because the alignment has not yet taken place).
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.

	One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Regarding Claim 13,
	Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the method of claim 8.
Additionally, Vanhoucke et al. further teaches wherein training the first neural model of the first language on the annotated data to define and update parameters of the first neural model of the first language comprises optimizing the labeled loss function of the first neural model of the first language (Fig. 2; Col. 5 line 60 - Col. 6 line 10: “Initially, at block 202, the method 200 includes receiving training data comprising a respective training data set for each of two or more languages. A training data set for an individual language may include one or any combination of read/spontaneous speech and supervised/unsupervised data received from any number of sources. For instance, the training data may be data recorded from voice searches, audio clips, videos, or other sources that has been transcribed. More specifically, the training data may be feature-extracted windows of speech that are associated with a label. For example, the training data may include portions of audio data that are labeled as phonetic states or sub-phonetic states” - teaches that the model is trained using training Col. 8 lines 13-31: “In one example, processing the multilingual DNN acoustic model may include jointly training the multilingual DNN acoustic model based on the training data. For example, the multilingual DNN acoustic model may be processed on the training data by backpropagating derivatives of a cost function that measures the discrepancy between an expected output d and actual output produced by the multilingual DNN acoustic model for each training data case. An example cost function C may be of the form:

    PNG
    media_image1.png
    50
    157
    media_image1.png
    Greyscale

Other example threshold functions, such as the hyperbolic tangent, or generally any function with a well-behaved derivate, may also be used. Similarly, other cost functions may be used for determining errors that are used in backpropagation” - teaches that training the model using the training data involves backpropagating derivatives of a cost function (labeled loss function) for determining errors. Col. 9 lines 6-17: “Parameters such as weights and errors from each model instance 604A-C may be communicated to and updated by a centralized parameter server 606. In one example, before processing a batch of training data, each model instance may query the centralized parameter server 606 for an updated copy of model parameters. … In some instances, an adaptive learning rate procedure may also be used for each parameter” - teaches that an adaptive learning rate procedure (optimization) may be used for each parameter (including errors from the cost function) of the model).
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.

Regarding Claim 14,
Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the method of claim 13.
Conneau et al. further teaches wherein the training the second neural model of the second language on the unannotated parallel data to define and update parameters of the second neural model of the second language comprises optimizing the unlabeled loss function between task representations yielded by the first neural model of the first language and the second neural model of the second language on the unannotated parallel data (Section 4.2.3 First two paragraphs: “Training for similarity of source and target sentences in an embedding space is conceptually and computationally simpler than generating a translation in the target language from a source sentence. We propose a method for training for cross lingual similarity and evaluate approaches based on the simpler task of aligning sentence representations. Under our objective, the embeddings of two parallel sentences need not be identical, but only close enough in the embedding space that the decision boundary of the English classifier captures the similarity. We propose a simple alignment loss function to align the embedding spaces of two different languages. Specifically, we train an English encoder on NLI, and train a target encoder by minimizing the loss: 

    PNG
    media_image2.png
    50
    500
    media_image2.png
    Greyscale

where (x, y) corresponds to the source and target sentence embeddings, (xc, yc) is a contrastive term (i.e. negative sampling), λ controls the weight of the negative examples in the loss” - teaches training cross-lingual models using parallel embeddings (parallel data) of the two languages and an alignment loss function (unlabeled loss function) for aligning the embedding spaces of two different languages. This further teaches that the loss function is minimized (optimized) based on embeddings (task representations) of a first language (source language/English in this example) and a second language (target language in this example). Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the model. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the aligned word embeddings come from using parallel corpora (unannotated parallel data) between the first (English in this example) and second languages (target language in this example); the parallel corpora are considered unannotated or unlabeled because the alignment has not yet taken place).
Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the training the second neural model of the second language on the unannotated parallel data to define and update parameters of the second neural model of the second language comprises optimizing the unlabeled loss function between task representations yielded by the first neural model of the first language and the second neural model of the second language on the unannotated parallel data as taught by Conneau et al. to the disclosed invention of Vanhoucke et al. in view of Sypniewski et al.
One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Regarding Claim 15,
Vanhoucke et al. teaches a computer implemented method for cross-lingual neural model transfer (Col. 1 line 59 - Col. 2 line 9: “The method may also include processing a multilingual deep neural network (DNN) acoustic model based on the training data to generate a trained multilingual DNN acoustic model. The multilingual DNN acoustic model may include a feedforward neural network having multiple layers of one or more nodes. Each node of a given layer may connect with a respective weight to each node of a subsequent layer. Further, the multiple layers of one or more nodes may include one or more hidden layers of nodes that are shared between at least two of the two or more languages and a language-specific output layer of nodes corresponding to each of the two or more languages” - teaches a method for processing a multilingual deep neural network (DNN) acoustic model by sharing hidden layers for multiple languages. Fig. 9; Col. 10 lines 29-45: “FIG. 9 is a functional block diagram illustrating an example computing device 900 used in a computing system that is arranged in accordance with at least some embodiments described herein. The computing device 900 may be implemented to process a multilingual DNN acoustic model or perform any of the functions described above with reference to FIGS. 1-8” - teaches that the method can be implemented by a computing device 900), comprising: 
supplying annotated data of a first language to a first neural model of a first language (Fig. 2; Col. 5 line 60 - Col. 6 line 10: “Initially, at block 202, the method 200 includes receiving training data comprising a respective training data set for each of two or more languages. A training data set for an individual language may include one or any combination of read/spontaneous speech and supervised/unsupervised data received from any number of sources. For instance, the training data may be data recorded from voice searches, audio clips, videos, or other sources that has been transcribed. More specifically, the training data may be feature-extracted windows of speech that are associated with a label. For example, the training data may include portions of audio data that are labeled as phonetic states or sub-phonetic states” - teaches that the model is trained using training data consisting of labeled phonetic states (annotated data));  
-21-and training the first neural model of the first language on the annotated data to define and update parameters of the first neural model of the first language based on a labeled loss function (Fig. 2; Col. 5 line 60 - Col. 6 line 10: “Initially, at block 202, the method 200 includes receiving training data comprising a respective training data set for each of two or more languages. A training data set for an individual language may include one or any combination of read/spontaneous speech and supervised/unsupervised data received from any number of sources. For instance, the training data may be data recorded from voice searches, audio clips, videos, or other sources that has been transcribed. More specifically, the training data may be feature-extracted windows of speech that are associated with a label. For example, the training data may include portions of audio data that are labeled as phonetic states or sub-phonetic states” - teaches that the model is trained using training data consisting of labeled phonetic states (annotated data). Col. 8 lines 13-31: “In one example, processing the multilingual DNN acoustic model may include jointly training the multilingual DNN acoustic model based on the training data. For example, the multilingual DNN acoustic model may be processed on the training data by backpropagating derivatives of a cost function that measures the discrepancy between an expected output d and actual output produced by the multilingual DNN acoustic model for each training data case. An example cost function C may be of the form:

    PNG
    media_image1.png
    50
    157
    media_image1.png
    Greyscale

Other example threshold functions, such as the hyperbolic tangent, or generally any function with a well-behaved derivate, may also be used. Similarly, other cost functions may be used for determining errors that are used in backpropagation” - teaches that training the model using the 
Vanhoucke et al. does not appear to explicitly teach supplying unannotated parallel data between the first language and a second language to the first neural model of the first language and a second neural model of a second language; training the first neural model of the first language and the second neural model of the second language on the parallel data to update the parameters of the first neural model of the first language and define and update parameters of the second neural model of the second language; and merging a portion of the parameters of the first neural model of the first language into the second neural model of the second language.
	However, Sypniewski et al. teaches merging a portion of the parameters of the first neural model of the first language into the second neural model of the second language (Col. 10 line 58 - Col. 11 line 18: “In an embodiment, the first fully-connected layer 203 is implemented as a fully-connected neural network that is repeated across the entire segment that is output by the CNN stack 202, and each copy of the fully-connected neural network accepts as input a single strided frame. … Each copy of the fully-connected neural network shares the same parameters, in particular each of the weights of all of the nodes of the fully-connected neural network” - teaches that copies (second neural model) of the fully connected neural network share the same parameters as the first fully connected layer 203 (first neural model)).
	Vanhoucke et al. and Sypniewski et al. are analogous to the claimed invention because they are directed to neural models for language processing.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate merging a portion of the parameters of the first neural model of the first language into the second neural model of the second language as taught by Sypniewski et al. to the disclosed invention of Vanhoucke et al.
which allows for computational and memory efficiency" (Sypniewski et al. Col. 11 lines 4-8).
Vanhoucke et al. in view of Sypniewski et al. does not appear to explicitly teach supplying unannotated parallel data between the first language and a second language to the first neural model of the first language and a second neural model of a second language; and training the first neural model of the first language and the second neural model of the second language on the parallel data to update the parameters of the first neural model of the first language and define and update parameters of the second neural model of the second language.
However, Conneau et al. teaches supplying unannotated parallel data between the first language and a second language to the first neural model of the first language and a second neural model of a second language (Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the second model. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the aligned word embeddings come from using parallel corpora (unannotated parallel data) between the first (English in this example) and second languages (target language in this example); the parallel corpora are considered unannotated or unlabeled because the alignment has not yet taken place); 
Section 4.2.3 First two paragraphs: “Training for similarity of source and target sentences in an embedding space is conceptually and computationally simpler than generating a translation in the target language from a source sentence. We propose a method for training for cross lingual similarity and evaluate approaches based on the simpler task of aligning sentence representations. Under our objective, the embeddings of two parallel sentences need not be identical, but only close enough in the embedding space that the decision boundary of the English classifier captures the similarity. We propose a simple alignment loss function to align the embedding spaces of two different languages. Specifically, we train an English encoder on NLI, and train a target encoder by minimizing the loss” - teaches training a first language model (source (English in this example)) and a second language model (target in this example) using parallel embeddings (parallel data) of the two languages and an alignment loss function (unlabeled loss function) for aligning the embedding spaces of two different languages. Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the model. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the aligned word embeddings come from using parallel corpora (unannotated parallel data) between the first (English in this example) and 
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate supplying unannotated parallel data between the first language and a second language to the first neural model of the first language and a second neural model of a second language; and training the first neural model of the first language and the second neural model of the second language on the parallel data to update the parameters of the first neural model of the first language and define and update parameters of the second neural model of the second language as taught by Conneau et al. to the disclosed invention of Vanhoucke et al. in view of Sypniewski et al.
	One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Regarding Claim 16,
	Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the method of claim 15.
	Additionally, Vanhoucke et al. further teaches wherein a task of the neural model includes one of sentiment classification, style classification, intent understanding, message routing, duration prediction, or structured content recognition (Col. 4 line 57 - Col. 5 line 7: “The multilingual DNN acoustic model 102 may be configured to output probabilities with respect to phonetic states corresponding to the acoustic input data. Speech can be broken into phonetic segments, referred to as phones, or further broken into sub-phonetic segments referred to as senones. The multilingual DNN acoustic model 102 may be configured to output, for each of multiple possible phonetic or sub-phonetic states, the probability that the received acoustic input data corresponds to the phonetic or sub-phonetic state” - teaches that the multilingual acoustic DNN model may be configured to output the probability that each input is a phonetic or sub-phonetic state (corresponds to structured content recognition)).
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.

Regarding Claim 17,
Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the method of claim 15.
	Additionally, Conneau et al. further teaches wherein the second neural model of the second language is trained without annotated data of the second language, a translation system, a dictionary, and a pivot lexicon (Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the second model. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the Section 5.3 Page 9 Second Column: "At inference time, the multilingual sentence encoder approach is however much cheaper than the TRANSLATE TEST baseline, and this method also does not require any machine translation system" - teaches that the encoder approach used does not need any form of translation system, dictionary, or pivot lexicon).
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the second neural model of the second language is trained without annotated data of the second language, a translation system, a dictionary, and a pivot lexicon as taught by Conneau et al. to the disclosed invention of Vanhoucke et al. in view of Sypniewski et al.
	One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Regarding Claim 18,
Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the method of claim 15.
	Additionally, Vanhoucke et al. further teaches wherein training resources consist of annotated data of the first language … (Fig. 2; Col. 5 line 60 - Col. 6 line 10: “Initially, at block 202, the method 200 includes receiving training data comprising a respective training data set for each of two or more languages. A training data set for an individual language may include one or any combination of read/spontaneous speech and supervised/unsupervised data received from any number of sources. For instance, the training data may be data recorded from voice searches, audio clips, videos, or other sources that has been transcribed. More specifically, the training data may be feature-extracted windows of speech that are associated with a label. For example, the training data may include portions of audio data that are labeled as phonetic states or sub-phonetic states” - teaches that the model is trained using training data consisting of labeled phonetic states (annotated data)).
	Moreover, Conneau et al. further teaches wherein training resources consist of … unannotated parallel data in both the first and second languages (Section 4.2.3 First two paragraphs: “Training for similarity of source and target sentences in an embedding space is conceptually and computationally simpler than generating a translation in the target language from a source sentence. We propose a method for training for cross lingual similarity and evaluate approaches based on the simpler task of aligning sentence representations. Under our objective, the embeddings of two parallel sentences need not be identical, but only close enough in the embedding space that the decision boundary of the English classifier captures the similarity” - teaches training cross-lingual models using parallel embeddings (parallel data) of the two languages. Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the model. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the aligned word embeddings come from using parallel corpora (unannotated parallel data) 
	Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein training resources consist of … unannotated parallel data in both the first and second languages as taught by Conneau et al. to the disclosed invention of Vanhoucke et al. in view of Sypniewski et al.
	One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Regarding Claim 19,
	Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the method of claim 15.
	Additionally, Vanhoucke et al. further teaches wherein training the first neural model of the first language on the annotated data to define and update parameters of the first neural model comprises optimizing the labeled loss function of the first neural model of the first language (Fig. 2; Col. 5 line 60 - Col. 6 line 10: “Initially, at block 202, the method 200 includes receiving training data comprising a respective training data set for each of two or more languages. A training data set for an individual language may include one or any combination of read/spontaneous speech and supervised/unsupervised data received from any number of sources. For instance, the training data may be data recorded from voice searches, audio clips, videos, or other sources that has been transcribed. More specifically, the training data may be feature-extracted windows of speech that are associated with a label. For example, the training data may include portions of audio data that are labeled as phonetic states or sub-phonetic states” - teaches that the model is trained using training data consisting of labeled phonetic states (annotated data). Col. 8 lines 13-31: “In one example, processing the multilingual DNN acoustic model may include jointly training the multilingual DNN acoustic model based on the training data. For example, the multilingual DNN acoustic model may be processed on the training data by backpropagating derivatives of a cost function that measures the discrepancy between an expected output d and actual output produced by the multilingual DNN acoustic model for each training data case. An example cost function C may be of the form:

    PNG
    media_image1.png
    50
    157
    media_image1.png
    Greyscale

Other example threshold functions, such as the hyperbolic tangent, or generally any function with a well-behaved derivate, may also be used. Similarly, other cost functions may be used for determining errors that are used in backpropagation” - teaches that training the model using the training data involves backpropagating derivatives of a cost function (labeled loss function) for determining errors. Col. 9 lines 6-17: “Parameters such as weights and errors from each model instance 604A-C may be communicated to and updated by a centralized parameter server 606. In one example, before processing a batch of training data, each model instance may query the centralized parameter server 606 for an updated copy of model parameters. … In some instances, an adaptive learning rate procedure may also be used for each parameter” - teaches that an adaptive learning rate procedure (optimization) may be used for each parameter (including errors from the cost function) of the model).


Regarding Claim 20,
Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the method of claim 19.
Additionally, Conneau et al. further teaches wherein the training the first neural model of the first language and the second neural model of the second language on the parallel data to update the parameters of the first neural model of the first language and define and update parameters of the second neural model of the second language comprises -22-optimizing a loss function between task representations yielded by the first neural model of the first language and the second neural model of the second language on the parallel data (Section 4.2.3 First two paragraphs: “Training for similarity of source and target sentences in an embedding space is conceptually and computationally simpler than generating a translation in the target language from a source sentence. We propose a method for training for cross lingual similarity and evaluate approaches based on the simpler task of aligning sentence representations. Under our objective, the embeddings of two parallel sentences need not be identical, but only close enough in the embedding space that the decision boundary of the English classifier captures the similarity. We propose a simple alignment loss function to align the embedding spaces of two different languages. Specifically, we train an English encoder on NLI, and train a target encoder by minimizing the loss: 

    PNG
    media_image2.png
    50
    500
    media_image2.png
    Greyscale

where (x, y) corresponds to the source and target sentence embeddings, (xc, yc) is a contrastive term (i.e. negative sampling), λ controls the weight of the negative examples in the loss” - teaches Section 5.1 Second Paragraph: “We use pretrained 300D aligned word embeddings for both X-CBOW and X-BILSTM” - teaches that 300D aligned word embeddings are the parallel embeddings (parallel data) used to train the model. Section 5.2 First Paragraph: “We use publicly available parallel datasets to learn the alignment between English and target encoders. For French, Spanish, Russian, Arabic and Chinese, we use the United Nation corpora (Ziemski et al., 2016), for German, Greek and Bulgarian, the Europarl corpora (Koehn, 2005), for Turkish, Vietnamese and Thai, the OpenSubtitles 2018 corpus (Tiedemann, 2012), and for Hindi, the IIT Bombay corpus (Anoop et al., 2018). For all the above language pairs, we were able to gather more than 500,000 parallel sentences, and we set the maximum number of parallel sentences to 2 million” - teaches that the aligned word embeddings come from using parallel corpora (unannotated parallel data) between the first (English in this example) and second languages (target language in this example); the parallel corpora are considered unannotated or unlabeled because the alignment has not yet taken place).
Vanhoucke et al., Sypniewski et al., and Conneau et al. are analogous to the claimed invention because they are directed to neural models for language processing.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the training the first neural model of the first language and the second neural model of the second language on the parallel data to update the parameters of the first neural model of the first language and define and update parameters of the second neural model of the second language comprises -22-optimizing a loss function between task representations 
One of ordinary skill in the art would have been motivated to make this modification because "While machine translation baselines obtained the best results in our experiments, these approaches rely on computationally-intensive translation models either at training or at test time. We found that cross-lingual encoder baselines provide an encouraging and efficient alternative" (Conneau et al. Section 6 Conclusion).

Claims 2-4 are rejected under 35 U.S.C. 103 as being unpatentable over Vanhoucke et al. (US 9,460,711 B1) in view of Sypniewski et al. (US 10,720,151 B2), in view of Conneau et al. (“XNLI: Evaluating Cross-lingual Sentence Representations”), and further in view of Li et al. (US 2017/0109355 A1).
Regarding Claim 2,
Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches the system of claim 1.
As discussed above with respect to claim 1, Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches a system with both a first neural model and a second neural model.
Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. does not appear to explicitly teach wherein the first neural model comprises: a first embedding layer, which converts linguistic units of the first language or dialect into vector representations; a first task-appropriate model architecture having a predetermined network configuration including one or more layers; and a first prediction layer, wherein one of the layers included in the first task-appropriate model 
However, Li et al. teaches wherein the first neural model comprises: a first embedding layer, which converts linguistic units of the first language or dialect into vector representations (Fig. 2, Fig.4; [0047]: “FIG. 4 shows the detailed process of step 244 according to embodiments of the present disclosure. At step 2442, the embedding layer 210 transforms the one or more words of the input query into one or more embeddings, where each embedding is a vector that represents the corresponding word” - teaches that a neural model comprises an embedding layer 210 that converts an input of words into vectors representing the words (word embeddings)); 
a first task-appropriate model architecture having a predetermined network configuration including one or more layers (Fig. 2, Fig. 4; [0047]: “FIG. 4 shows the detailed process of step 244 according to embodiments of the present disclosure. … Then, at step 2444, the stacked-bidirectional RNN 212, to produce one or more tokens corresponding to the one or more embeddings, respectively, and binary classification features of whether each token is a part of the subject chunk or not” - teaches that a neural model comprises a stacked bidirectional RNN 212 (first task-appropriate model architecture) configured to learn to produce the features for word classification); 
and a first prediction layer (Fig. 2, Fig. 4; [0047]: “FIG. 4 shows the detailed process of step 244 according to embodiments of the present disclosure. … Next, at step 2446, based on the classification features, the logical regression layer 214 predicts the probability of each token being a part of the subject chunk” - teaches a logic regression layer 214 (prediction layer) as part of a neural model for use in predicting the probability of a word embedding being part of a subject chunk), 
wherein one of the layers included in the first task-appropriate model architecture is a first task representation layer, and wherein the first task representation layer immediately precedes the first prediction layer (Fig.2; teaches a final concatenation layer (first task representation layer) as part of the 
Vanhoucke et al., Sypniewski et al., Conneau et al., and Li et al. are analogous to the claimed invention because they are directed to neural models for language processing.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the first neural model comprises: a first embedding layer, which converts linguistic units of the first language or dialect into vector representations; a first task-appropriate model architecture having a predetermined network configuration including one or more layers; and a first prediction layer, wherein one of the layers included in the first task-appropriate model architecture is a first task representation layer, and wherein the first task representation layer immediately precedes the first prediction layer as taught by Li et al. to the disclosed invention of Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al.
One of ordinary skill in the art would have been motivated to make this modification because “Extensively utilizing continuous Embedding and Stacked Bidirectional Gated-Recurrent-Units-Recurrent-Neural-Network (GRU-RNN) as sub-modules in embodiments of the system, excellent performance is obtained on all sub-modules, which collectively form a powerful yet intuitive neural pipeline” (Li et al. [0030]).

Regarding Claim 3,
Vanhoucke et al. in view of Sypniewski et al., in view of Conneau et al., and further in view of Li et al. teaches the system of claim 2.
Vanhoucke et al. in view of Sypniewski et al. and further in view of Conneau et al. teaches a system with both a first neural model and a second neural model.
	Additionally, Li et al. further teaches wherein the second neural model comprises: a second embedding layer, which converts linguistic units of the second language or dialect into vector representations (Fig. 2, Fig.4; [0047]: “FIG. 4 shows the detailed process of step 244 according to embodiments of the present disclosure. At step 2442, the embedding layer 210 transforms the one or more words of the input query into one or more embeddings, where each embedding is a vector that represents the corresponding word” - teaches that a neural model comprises an embedding layer 210 that converts an input of words into vectors representing the words (word embeddings));  
-19-a second task-appropriate model architecture having a predetermined network configuration including one or more layers (Fig. 2, Fig. 4; [0047]: “FIG. 4 shows the detailed process of step 244 according to embodiments of the present disclosure. … Then, at step 2444, the stacked-bidirectional RNN 212, to produce one or more tokens corresponding to the one or more embeddings, respectively, and binary classification features of whether each token is a part of the subject chunk or not” - teaches that a neural model comprises a stacked bidirectional RNN 212 (first task-appropriate model architecture) configured to learn to produce the features for word classification); 
and a second prediction layer (Fig. 2, Fig. 4; [0047]: “FIG. 4 shows the detailed process of step 244 according to embodiments of the present disclosure. … Next, at step 2446, based on the classification features, the logical regression layer 214 predicts the probability of each token being a part of the subject chunk” - teaches a logic regression layer 214 (prediction layer) as part of a neural model for use in predicting the probability of a word embedding being part of a subject chunk).
Vanhoucke et al., Sypniewski et al., Conneau et al., and Li et al. are analogous to the claimed invention because they are directed to neural models for language processing.

One of ordinary skill in the art would have been motivated to make this modification because “Extensively utilizing continuous Embedding and Stacked Bidirectional Gated-Recurrent-Units-Recurrent-Neural-Network (GRU-RNN) as sub-modules in embodiments of the system, excellent performance is obtained on all sub-modules, which collectively form a powerful yet intuitive neural pipeline” (Li et al. [0030]).

Regarding Claim 4,
Vanhoucke et al. in view of Sypniewski et al., in view of Conneau et al., and further in view of Li et al. teaches the system of claim 3.
	Additionally, Vanhoucke et al. further teaches wherein a task of the task-appropriate model architecture includes one of sentiment classification, style classification, intent understanding, message routing, duration prediction, and structured content recognition (Col. 4 line 57 - Col. 5 line 7: “The multilingual DNN acoustic model 102 may be configured to output probabilities with respect to phonetic states corresponding to the acoustic input data. Speech can be broken into phonetic segments, referred to as phones, or further broken into sub-phonetic segments referred to as senones. The multilingual DNN acoustic model 102 may be configured to output, for each of multiple possible phonetic or sub-phonetic states, the probability that the received acoustic input data corresponds to the phonetic or sub-phonetic state” - teaches that a neural model may be configured to output the 
Vanhoucke et al., Sypniewski et al., Conneau et al., and Li et al. are analogous to the claimed invention because they are directed to neural models for language processing.


Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRIAN J HALES whose telephone number is (571)272-0878. The examiner can normally be reached M-Th 8:00am - 5:00pm and F 8:00am - 2:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/BRIAN J HALES/

/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125