DETAILED ACTION
This communication is in response to the Amendment and Arguments filed on 11/09/2021. Claims 1-7, 9-17, and 19-20 are pending and have been examined. Hence, this action has been made
FINAL. All Objections/Rejections not mentioned in this OA has been withdrawn by the Examiner.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
The Applicant has amended the independent claims. Hence, the Applicant’s arguments are moot in view of new grounds for rejection. More specifically, the newly added limitations of “training a neural machine translation model using a single language corpus” and “a second language output from the first-to-second language translation model created using the single language corpus” raises new grounds for rejection.
Claims 5-11 have been amended to include structure for performing the listed functions. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. Therefore, the interpretation of these claims under 35 U.S.C. 112(f) is withdrawn.
With respect to claims 1, 5, and 9, the applicant provides over page 7 of the Remarks an overview of the currently cited prior art of record. The Applicant notes that the currently cited prior art of record does not teach or suggest that a single language corpus is used for the training of the neural machine translation model as recited in claims 1, 5, and 9 as opposed to using a parallel corpus. While (see p. 2, section 1, para. 6, where “by only using monolingual data, we can encode sentences of both languages into the same feature space, and from there, we can also decode/translate in any of these languages”) including a first-to-second language translation model including a first attention network and a second-to-first language translation model including a second attention network (see p. 3, section 2.1, para. 1, where “The translation model we propose is composed of an encoder and a decoder, respectively responsible for encoding source and target sentences to a latent space, and to decode from that latent space to the source or the target domain… we use a sequence-to-sequence model with attention”). Lample (1) does not teach transferring the distribution error to the first-to-second language translation model and the second-to-first language translation model. 
Lample (2), in the paper “Phrase-Based & Neural Unsupervised Machine Translation” (Aug 2018) teaches comparing a distribution of the first language output from the second-to-first language translation model with a distribution of a first language sentence input to the first-to-second language translation model (see p. 4, section 3.2, col.2, where the machine translation uses a Phrase-based Statistical Machine Translation (PBSMT) system and “PBSMT first infers an alignment between source and target phrases. It then populates phrase tables, whose entries store the probability that a certain n-gram in the source/target language is mapped to another n-gram in the target/source language”) and transferring a distribution error to the first-to-second language translation model and the second-to-first language translation model. (see p.2, Figure 1, caption, where the models are trained by “starting from an observed source sentence (filled red circle) we use the current source → target model to translate (dashed arrow), yielding a potentially incorrect translation (blue cross near the empty circle). Starting from this (back) translation, we use the target → source model (continuous arrow) to reconstruct the sentence in the original language. The discrepancy between the reconstruction and the initial sentence provides error signal to train the target → source model parameters. The same procedure is applied in the opposite direction to train the source → target model.”)
The translation models disclosed in Lample (1) use a single corpus for training rather than a parallel corpus (see p. 1, Abstract, where “This work investigates how to learn to translate when having access to only large monolingual corpora in each language…models respectively obtain 28.1 and 25.2 BLEU points without using a single parallel sentence”)
 Lample (1) and Lample (2) are combinable because they both disclose models for neural machine translation using monolingual corpora. Therefore it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for NMT model for unsupervised machine translation laid out in Lample (1) with the probability distribution calculated by the traditional phrase-based statistical machine translation system. One would be motivated to do because “PBSMT models are well-known to outperform neural models when labeled data is scarce because they merely count occurrences, whereas neural models typically fit hundred of millions of parameters to learn distributed representations, which may generalize better when data is abundant but is prone to overfit when data is scarce.” (p. 2, section 1, col. 1).
With respect to claims 2-4, 6-8 and 10-11, the Applicant states their dependencies on the presently amended independent claims, and restates the argument that the cited references do not teach or suggest the features disclosed by the amendment. The Examiner respectfully submits that they 
Hence, the Applicant’s arguments are not persuasive. 

 Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 5 and 6 are rejected under 35 U.S.C. 103 as being unpatentable over Lample (1), “Unsupervised Machine Translation Using Monolingual Corpora Only” (Apr 2018) in view of “Phrase-Based & Neural Unsupervised Machine Translation” (Aug 2018) by Lample (2).
Regarding claims 1 and 5, Lample (1) teaches A method of training a neural machine translation model using a single language corpus including a first-to-second language translation model including a first attention network and a second-to-first language translation model including a second attentionnetwork, (“The translation model we propose is composed of an encoder and a decoder, respectively responsible for encoding source and target sentences to a latent space, and to decode from that latent space to the source or the target domain. We use a single encoder and a single decoder for both domains… we use a sequence-to-sequence model with attention (Bahdanau et al., 2015)”, p. 3, section 2.1, para. 1. In Lample (1), the language model uses the encoder and decoder to conduct encoding and decoding from source/target to target/source, analogous to the first-to-second and second-to-first language models. The model’s decoder follows the attention mechanism introduced in Bahdanau et al.’s “Neural Machine Translation by Jointly Learning to Align and Translate” (2015). The same encoder, decoder, and attention mechanism is used for encoding source or target, and decoding to target or source. Furthermore, the method of training uses a single language corpus as seen in p. 2, section 1, para. 5, where “In this paper, we investigate whether it is possible to train a general machine translation system without any form of supervision whatsoever. The only assumption we make is that there exists a monolingual corpus on each language.”) 
the method comprising: inputting a second language output (“The input is a noisy translation (in this case, from source-to-target) produced by the model itself, M, at the previous iteration” p. 2, Figure 1, caption) from the first-to-second language translation model created using the single language corpus to (“the model is trained to translate a sentence in the other domain” p. 2, Figure 1, caption, and “by only using monolingual data, we can encode sentences of both languages into the same feature space, and from there, we can also decode/translate in any of these languages”, p. 2, section 1, para. 6) the second-to-first language translation model (“The model is symmetric, and we repeat the same process in the other language” p. 2, Figure 1, caption)  and outputting a translated first language; (“the model is trained to reconstruct a sentence from a noisy version of it” p. 2 Figure 1, caption) 
(“The key idea is to build a common latent space between the two languages (or domains) and to learn to translate by reconstructing in both domains according to two principles: (i) the model has to be able to reconstruct a sentence in a given language from a noisy version of it, as in standard denoising auto-encoders. (ii) The model also learns to reconstruct any source sentence given a noisy translation of the same sentence in the target domain, and vice versa. For (ii), the translated sentence is obtained by using a back-translation procedure, i.e. by using the learned model to translate the source sentence to the target domain”, p. 2, section 1, para. 6. Lample (1) employ a data-augmentation scheme, called “back-translation” for single language corpora taught in Sennrich et al.’s “Improving neural machine translation models with monolingual data” (2015). Back-translation is described by Lample (1) as the procedure “whereby an auxiliary translation system from the target language to the source language is first trained on the available parallel data, and then used to produce translations from a large monolingual corpus on the target side. The pairs composed of these translations with their corresponding ground truth targets are then used as additional training data for the original translation system”, p. 1, section 1, para. 2). 
and comparing a distribution of the first language output from the second-to-first language translation model with a distribution of a first language sentence input to the first-to-second language translation model (“we constrain the source and target sentence latent representations to have the same distribution” p. 2, section 1, para. 6) 
(“At a high level, the model starts with an unsupervised naive translation model obtained by making word-by-word translation of sentences using a parallel dictionary learned in an unsupervised way. Then, at each iteration, the encoder and decoder are trained by minimizing an objective function that measures their ability to both reconstruct and translate from a noisy version of an input training sentence. This noisy input is obtained by dropping and swapping words in the case of the auto-encoding task, while it is the result of a translation with the model at the previous iteration in the case of the translation task. In order to promote alignment of the latent distribution of sentences in the source and the target domains, our approach also simultaneously learns a discriminator in an adversarial setting. The newly learned encoder/decoder are then used at the next iteration to generate new translations, until convergence of the algorithm” p. 3, section 2.2, para. 2. Lample (1) measure the translation and back-translation outputs (“while it is the result of the translation with the model at the previous iteration in case of the translation task” (p. 3, section 2.2, para. 2) and compare distribution of the output in the discriminator.) 

Lample (2), in the paper “Phrase-Based & Neural Unsupervised Machine Translation” (Aug 2018) teaches comparing a distribution of the first language output from the second-to-first language translation model with a distribution of a first language sentence input to the first-to-second language translation model (see p. 4, section 3.2, col.2, where the machine translation uses a Phrase-based Statistical Machine Translation (PBSMT) system and “PBSMT first infers an alignment between source and target phrases. It then populates phrase tables, whose entries store the probability that a certain n-gram in the source/target language is mapped to another n-gram in the target/source language”) and transferring a distribution error to the first-to-second language translation model and the second-to-first language translation model. (see p.2, Figure 1, caption, where the models are trained by “starting from an observed source sentence (filled red circle) we use the current source → target model to translate (dashed arrow), yielding a potentially incorrect translation (blue cross near the empty circle). Starting from this (back) translation, we use the target → source model (continuous arrow) to reconstruct the sentence in the original language. The discrepancy between the reconstruction and the initial sentence provides error signal to train the target → source model parameters. The same procedure is applied in the opposite direction to train the source → target model.”)
The translation models disclosed in Lample (1) use a single corpus for training rather than a parallel corpus (see p. 1, Abstract, where “This work investigates how to learn to translate when having access to only large monolingual corpora in each language…models respectively obtain 28.1 and 25.2 BLEU points without using a single parallel sentence”)
 Lample (1) and Lample (2) are combinable because they both disclose models for neural machine translation using monolingual corpora. Therefore it would have been obvious for a person of 
As to claim 5, the device in claim 5 and method claim 1 are related in the steps of claim 1 method, with each claimed element's function corresponding to the claimed method step. Accordingly claim 5 is similarly rejected under the same rationale as applied above with respect to method claim. Furthermore, Lample (1) teaches a processor and a non-transitory storage device (see p.2, section 1, para. 5, where the authors use a “general machine translation system” which can be assumed to involve a processor and memory).
	
Regarding claim 6, Lample (1) and Lample (2) teach the device of claim 5. Lample (1) teaches wherein the first-to-second language translation model unit comprises: (“the model is trained to translate a sentence in the other domain” p. 2, Figure 1, caption)
a first encoder network for receiving the first language as an input and modelling the first language; (“The translation model we propose is composed of an encoder…responsible for encoding source to a latent space” p. 3, section 2.1, para. 1)
a first decoder network for modelling the second language; (“The translation model we propose is composed of…decoder…responsible for [decoding] from that latent space to the source or the target domain.” p. 3, section 2.1, para. 1)
and a first attention network (“we use a sequence-to-sequence model with attention (Bahdanau et al., 2015)” p. 3, section 2.2, para. 4) for modelling word alignment information between (“The most critical component is the unsupervised word alignment technique, either in the form of a back-translation dataset generated using word-by-word translation, or in the form of pretrained embeddings which enable to map sentences of different languages in the same latent space.” p. 11, section 4.5, para. 6)
and the second-to-first language translation model unit comprises: (“The model is symmetric, and we repeat the same process in the other language” p. 2, Figure 1, caption)
a second encoder network for receiving the second language as an input and modelling the second language; (“The translation model we propose is composed of an encoder…responsible for encoding source to a latent space” p. 3, section 2.1, para. 1)
a second decoder network modelling the first language; (“The translation model we propose is composed of…decoder…responsible for [decoding] from that latent space to the source or the target domain.” p. 3, section 2.1, para. 1)
and a second attention network (“we use a sequence-to-sequence model with attention (Bahdanau et al., 2015)” p. 3, section 2.2, para. 4) for modelling word alignment information between the second language and the first language. (“The most critical component is the unsupervised word alignment technique, either in the form of a back-translation dataset generated using word-by-word translation, or in the form of pretrained embeddings which enable to map sentences of different languages in the same latent space.” p. 11, section 4.5, para. 6)
(In Lample (1), the language model uses the encoder and decoder to conduct encoding and decoding from source/target to target/source, the same as the first-to-second and second-to-first language models. An illustration of the architecture for the translation models exists in Figure 2 (p. 6, section 3). The model’s decoder follows the attention mechanism introduced in Bahdanau et al.’s “Neural Machine Translation by Jointly Learning to Align and Translate” (2015). The same encoder, decoder, and attention mechanism is used for encoding source or target, and decoding to target or source.)

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Lample (1) in view of Lample (2) and further in view of “Minimum Risk Training for Neural Machine Translation” by Shen et al. (2016), henceforth “Shen et al.”
Regarding claim 2, Lample (1) and Lample (2) teach the method of claim 1. Lample (1) and Lample (2) do not teach the comparing of the two distributions comprising comparing the two distributions using cross entropy. Shen et al. teaches an “end-to-end neural machine translation” system, similar to the one used in Lample (1) and Lample (2), that is ”based on the encoder-decoder framework with an encoder to read and encode a source-language sentence into a vector, from which a decoder generates a target-language sentence” (p. 1683, section 1, col 1-2). The use of computing cross-entropy loss for distributions is taught by MLE (maximum likelihood estimation), which “usually uses the cross-entropy loss focusing on word-level errors to maximize the probability of the next correct word” (p. 1684, section 2, col. 2). 
The authors improve the MLE through their method of minimum risk training (MRT), which “introduces evaluation metrics as loss functions and aims to minimize expected loss on the training data” and allows for “evaluation metrics that actually quantify translation quality” between different models (p. 1683, section 1, col 1-2). Table 1 depicts a comparison of losses between different model predictions (p. 1685, section 3). Furthermore, the models may be trained on a single language corpus. The authors experiment with an SMT system MOSES and an NMT system RNNSearch as the translation models, the training of which it is noted that “It is possible to exploit larger monolingual corpora for both MOSES and RNNSEARCH (Gulcehre et al., 2015; Sennrich et al., 2015).” (p. 1687, [Footnote 2])
.

Claim 3, 7, 9, and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Lample (1) and Lample (2) and further in view of “Universal Neural Machine Translation for Extremely Low Resource Languages” by Gu et al (2018), henceforth “Gu et al”.
Regarding claim 3 and 7, Lample (1) teach the use of attention networks in the language model (“we use a sequence-to-sequence model with attention” p. 3, section 2.1, para. 4). Lample (1) does not teach normalizing the alignment information of the aforementioned attention network to have orthogonal relation. Gu et al. teaches normalizing alignment information in its orthogonal projection of normalized vectors (“finding an orthogonal transformation Ok that makes the projected word vectors as close as to its corresponding universal tokens” p. 4, col 1). A “list of source word-universal token pairs” called seeds are taken from “automatic word-alignment of parallel sentences” (p. 4, section 3.1, cols 1-2). Training the projection matrix involves “learning the optimal projection which maps the original monolingual embeddings into EK space”, where EK denotes the matrix of the universal token language. The universal token language is a representation that the authors use to develop a multilingual NMT system that can handle many different languages (one of the “components to extend the conventional multi-lingual NMT system…[is the] Universal Lexical Representation (ULR)” p. 3, section 3, col 1). The NMT system incorporates different language embeddings trained from separate networks using monolingual data (“As shown in Figure 2, we have multiple embedding representations. EQ is language-specific embedding trained on monolingual data” p. 3, section 3.1, col. 2), teaching the use of normalizing alignment information originating from separate networks. 
Lample (1) and Gu et al. are combinable because they deal with word-alignment information in language translations. It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the attention networks disclosed by Gu et al. by adding an orthogonal projection matrix. One of ordinary skill in the art would have been motivated to make this modification in order to obtain an optimal projection for word-alignment of the language pairs from source and target languages (Gu et al., p. 4). 

Regarding claim 9, the combination of Lample (1), Lample (2), and Gu et al. teaches a device for training a neural machine translation model (“neural machine translation system” Lample (1), p. 4, section 2.5, para. 1). Lample (1) teaches a first-to-second language translation model (“the model is trained to translate a sentence in the other domain” p. 2, Figure 1, caption, “The architecture is a sequence to sequence model, with both encoder and decoder operating on two languages depending on an input language identifier that swaps lookup tables” p. 6, Figure 2, caption) configured to translate the input first language into the second language and output the second language (“The input is a noisy translation (in this case, from source-to-target) produced by the model itself, M, at the previous iteration” p. 2, Figure 1, caption), 
the first-to-second language translation model comprising (“the model is trained to translate a sentence in the other domain” p. 2, Figure 1, caption)
(“The translation model we propose is composed of an encoder…responsible for encoding source to a latent space” p. 3, section 2.1, para. 1)
a first decoder network for modeling a second language, (“The translation model we propose is composed of…decoder…responsible for [decoding] from that latent space to the source or the target domain.” p. 3, section 2.1, para. 1)
and a first attention network (“we use a sequence-to-sequence model with attention (Bahdanau et al., 2015)” p. 3, section 2.2, para. 4) for modeling word alignment information between the first language and the second language; (“The most critical component is the unsupervised word alignment technique, either in the form of a back-translation dataset generated using word-by-word translation, or in the form of pretrained embeddings which enable to map sentences of different languages in the same latent space.” p. 11, section 4.5, para. 6; The model’s decoder follows the attention mechanism introduced in Bahdanau et al.’s “Neural Machine Translation by Jointly Learning to Align and Translate” (2015))
a second-to-first language translation model configured to (“The model is symmetric, and we repeat the same process in the other language” p. 2, Figure 1, caption)
translate the second language output from the first-to-second language translation model into the first language and output the first language, 
the second-to-first language translation model comprising a second encoder network for receiving the second language as an input and modeling the second language, 
a second decoder network for modeling the first language, and (“The translation model we propose is composed of…decoder…responsible for [decoding] from that latent space to the source or the target domain.” p. 3, section 2.1, para. 1)
(“we use a sequence-to-sequence model with attention (Bahdanau et al., 2015)” p. 3, section 2.2, para. 4)  for modeling word alignment information between the second language and the first language; (“The most critical component is the unsupervised word alignment technique, either in the form of a back-translation dataset generated using word-by-word translation, or in the form of pretrained embeddings which enable to map sentences of different languages in the same latent space.” p. 11, section 4.5, para. 6; “We use a single encoder and a single decoder for both domains” p. 3; The same encoder, decoder, and attention mechanism is used for encoding source or target, and decoding to target or source. Thus, the architecture for the first to second language is used again in Lample (1) for the second to first translation model.)
Lample (1) teach the use of attention networks in the language model (“we use a sequence-to-sequence model with attention” p. 3, section 2.2, para. 4). Lample (1) does not teach the means for normalizing the alignment information of the aforementioned attention network to have orthogonal relation. Gu et al. teaches for normalizing alignment information in its orthogonal projection of normalized vectors (“finding an orthogonal transformation Ok that makes the projected word vectors as close as to its corresponding universal tokens” p. 4, section 3.1, col 1; “a list of source word-universal token pairs” called seeds are taken from “automatic word-alignment of parallel sentences” p. 4, section 3.1, cols 1-2). Training the projection matrix involves “learning the optimal projection which maps the original monolingual embeddings into EK space”, where EK denotes the matrix of the universal token language. The universal token language is a representation that the authors use to develop a multilingual NMT system that can handle many different languages (one of the “components to extend the conventional multi-lingual NMT system…[is the] Universal Lexical Representation (ULR)” p. 3, section 3, col 1). The NMT system incorporates different language embeddings trained from separate networks using monolingual data (“As shown in Figure 2, we have multiple embedding representations. EQ is language-specific embedding trained on monolingual data” p. 3, section 3.1, col. 2), teaching the use of normalizing alignment information originating from separate networks. 
Lample (1) and Gu et al. are combinable because they both deal with word-alignment information in language translations. It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the attention networks disclosed by Gu et al. by adding an orthogonal projection matrix. One of ordinary skill in the art would have been motivated to make this modification in order to obtain an optimal projection for word-alignment of the language pairs from source and target languages (Gu et al., p. 4).

Regarding claim 11, Lample (1), Lample (2) and Gu et al. teaches the device of claim 9, further comprising: a means for (“a discriminator in an adversarial setting” Lample (1), p. 3, section 2.2, para. 2) comparing a distribution of the first language output from the second-to-first language translation model with a distribution of the first language input to the first-to-second language translation model; (“we constrain the source and target sentence latent representations to have the same distribution” p. 2, section 1, para. 6)
(“At a high level, the model starts with an unsupervised naive translation model obtained by making word-by-word translation of sentences using a parallel dictionary learned in an unsupervised way. Then, at each iteration, the encoder and decoder are trained by minimizing an objective function that measures their ability to both reconstruct and translate from a noisy version of an input training sentence. This noisy input is obtained by dropping and swapping words in the case of the auto-encoding task, while it is the result of a translation with the model at the previous iteration in the case of the translation task. In order to promote alignment of the latent distribution of sentences in the source and the target domains, our approach also simultaneously learns a discriminator in an adversarial setting. The newly learned encoder/decoder are then used at the next iteration to generate new translations, until convergence of the algorithm” p. 3, section 2.2, para. 2. Lample (1) measure the translation and back-translation outputs “while it is the result of the translation with the model at the previous iteration in case of the translation task” (p. 3, section 2.2, para. 2) and compare distribution of the output in the discriminator.) 
Lample (1) does not teach transferring the distribution error to the first-to-second language translation model and the second-to-first language translation model. 
Lample (2), in the paper “Phrase-Based & Neural Unsupervised Machine Translation” (Aug 2018) teaches comparing a distribution of the first language output from the second-to-first language translation model with a distribution of a first language sentence input to the first-to-second language translation model (see p. 4, section 3.2, col.2, where the machine translation uses a Phrase-based Statistical Machine Translation (PBSMT) system and “PBSMT first infers an alignment between source and target phrases. It then populates phrase tables, whose entries store the probability that a certain n-gram in the source/target language is mapped to another n-gram in the target/source language”) and transferring a distribution error to the first-to-second language translation model and the second-to-first language translation model. (see p.2, Figure 1, caption, where the models are trained by “starting from an observed source sentence (filled red circle) we use the current source → target model to translate (dashed arrow), yielding a potentially incorrect translation (blue cross near the empty circle). Starting from this (back) translation, we use the target → source model (continuous arrow) to reconstruct the sentence in the original language. The discrepancy between the reconstruction and the initial sentence provides error signal to train the target → source model parameters. The same procedure is applied in the opposite direction to train the source → target model.”)
The translation models disclosed in Lample (1) use a single corpus for training rather than a parallel corpus (see p. 1, Abstract, where “This work investigates how to learn to translate when having access to only large monolingual corpora in each language…models respectively obtain 28.1 and 25.2 BLEU points without using a single parallel sentence”)
 Lample (1) and Lample (2) are combinable because they both disclose models for neural machine translation using monolingual corpora. Therefore it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for NMT model for unsupervised machine translation laid out in Lample (1) with the probability distribution calculated by the traditional phrase-based statistical machine translation system. One would be motivated to do because “PBSMT models are well-known to outperform neural models when labeled data is scarce because they merely count occurrences, whereas neural models typically fit hundred of millions of parameters to learn distributed representations, which may generalize better when data is abundant but is prone to overfit when data is scarce.” (p. 2, section 1, col. 1).

Claim 4, 8, 10 are rejected under 35 U.S.C. 103 as being unpatentable over Lample (1) in view of Lample (2), Gu et al. and further in view of “Offline Bilingual Word Vectors, Orthogonal Transformation and the Inverted Softmax” by Smith et al. (2017), henceforth “Smith et al.”
Regarding claim 4, 8 and 10, Lample (1), Lample (2), and Gu et al. teach using a loss function for comparing the alignment information from the first-to-second language translation model and the second-to-first language translation model. Lample (1) describe their objective to “constrain the model to be able to map an input sentence from the source/target domain l1 to the target/source domain l2”, to which end “the encoder and decoder are trained by minimizing an objective function that measures their ability to both reconstruct and translate from a noisy version of an input training sentence” (p. 2-3, section 2.2, para. 2). The mapping of sentences is for alignment purposes and the minimizing of the objective function is a kind of loss function. 
“whereby two pre-trained embeddings are aligned with a linear transformation…the linear transformation between two spaces should be orthogonal” (p. 1, Abstract). Smith et al. teach training the translation models by minimizing loss in the cost function using SVD decomposition, which finds an orthogonal transformation to align projected word vectors and their universal tokens (“U and V are composed of columns of orthonormal vectors, while ∑ is a diagonal matrix containing the singular values. Our cost function is minimized by O = UVT.” p. 3, section 2.2, para. 3).  The transformations of U and V, are UT and VT, and by “[mapping] both languages into a single space, by applying the transformation VT to the source language and UT to the target language” the function for “the cosine similarity of translation pairs in the dictionary” is maximized, showing that “self-consistent linear mapping between semantic spaces” of the pre-trained embeddings “must be orthogonal” (p. 3, section 2.2, para. 1-6).
Lample (1) and Smith et al. are combinable because they both are concerned with mapping the word vectors from different translations into a shared semantic space. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the normalization of output vectors of Lample (1)’s language translation models to have an orthogonal relation to each other. One would be motivated to do so because Smith et al.’s orthogonal transformation improves the precision of mapping between language pairs allows the model to distinguish between them.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SARVAJNA KALVA whose telephone number is (571) 272-4692. The examiner can normally be reached on Monday - Friday 9 to 6. Examiner interviews are available via telephone, in person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppairmy.uspto.gov/pair/PrivatePair. Should you have questions on 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SARVAJNA KALVA whose telephone number is (571)272-4692. The examiner can normally be reached Monday - Friday 9 AM to 5 PM.

/SARVAJNA KALVA/Examiner, Art Unit 2659                                                                                                                                                                                                        

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659