DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Specification
The abstract of the disclosure is objected to because:
In line 5, the acronym “NLI” is used without being defined.
Correction is required.  See MPEP § 608.01(b).
The disclosure is objected to because of the following informalities:
The references in the specification to non-patent literature include the author and publication year, but do not include the title.  The non-patent literature references should include the title for the purpose of clarity.
In paragraph 0015, line 1, “process to using” should read “process to use”.
In paragraph 0031, line 1, the acronym “BERT” is used without being defined.
In paragraph 0038, line 3, “include a a model” should read “include a model”.
In paragraph 0038, line 3, “a transformer model..” should read “a transformer model.”.
In paragraph 0052, line 12, the acronym “LAMB” is used without being defined.
In paragraph 0071, line 8, the acronym “API” is used without being defined.
In paragraph 0079, lines 9-10, the meaning of “mBERT trained with CMLM and BR (f-mBERT) has a significant upon mBERT.” is not clear.
In paragraph 0086, lines 3-4, it is not clear what figure is being reference by “(x-axis labels in fig:la)”.
In paragraph 0086, line 6, “mBERT (first row)” should read “mBERT (first column)”.
In paragraph 0101, lines 4-5 “server computer system 140” should read “server computer system 130”.
Appropriate correction is required.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1 – 2, 5 – 8, 14 and 17 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Wagner et al. (US Patent Application Publication No. 2022/0067280), hereinafter Wagner.
Regarding claim 1, Wagner discloses a computer-implemented method to train machine learning models to produce representations for language segments containing multiple tokens (Abstract, lines 1-5, "Embodiments of the present disclosure include systems and methods for training transformer models. In some embodiments, a set of input data are received. The input data comprises a plurality of tokens including masked tokens. The plurality of tokens in an embedding layer are processed."), the method comprising:
processing, by a computing system comprising one or more computing devices (Figure 9, Processor(s) 902),
one or more first language segments of a plurality of language segments with a first machine-learned language encoding model to generate a contextual language embedding (Paragraph 0024, lines 1-5, "After masking tokens in the input data, embedding layer 105 may determine token embeddings for each unmasked token in the sequence of tokens using an embedding space generated from a corpus of tokens (e.g., a vocabulary of words)."; Paragraph 0017, lines 1-9, "At an embedding layer, each token may be mapped to a word in the neural network's vocabulary represented by a vector. The embedding layer may further map one or more adjacent tokens (e.g., tokens coming before and/or after the original token in a sequence) at the same time as the original token. The embedding layer may then combine the vectors. The output of the embedding layer is provided to a transformer layer which may determine correlations between tokens."; The combination of the embedding layer and the transformer layer reads on the first machine-learned language encoding model.),
wherein each of the plurality of language segments comprises multiple tokens (Paragraph 0016, lines 3-6, "In some embodiments, a system may receive input data for a transformer model. The input data can include a set of tokens (e.g., a set of words forming a sentence) in a sequence.");
generating, by the computing system, a masked version of a subject language segment of the plurality of language segments, wherein the masked version of the subject language segment comprises one or more masked tokens (Paragraph 0016, lines 6-8, "For training purposes, a number of tokens are masked. In other words, information about the token is removed.");
combining, by the computing system, the contextual language embedding and the masked version of the subject language segment to obtain a conditioned input (Paragraph 0063, lines 1-17, "For example, in one embodiment, the present disclosure includes a system comprising a set of processing units and a non-transitory machine-readable medium storing instructions that when executed by at least one processing unit in the set of processing units cause the at least one processing unit to receive a set of input data, the input data comprising a plurality of tokens, the plurality of tokens including masked tokens; process the plurality of tokens in an embedding layer, the embedding layer being coupled to a transformer layer; process the plurality of tokens in the transformer layer, the transformer layer being coupled to a classifier layer; and process the plurality of tokens in the classifier layer, the classifier layer being coupled to a loss layer, wherein one or more of the embedding layer and the classifier layer combine masked tokens at a current position with tokens at one or more of a previous position and a subsequent position.");
processing, by the computing system, the conditioned input with a second machine-learned language encoding model to generate one or more predictions respectively for the one or more masked tokens (Paragraph 0018, lines 1-8, "A classifier layer may gather the masked tokens from the output of the transformer layer. The classifier layer may further gather one or more tokens adjacent to the gathered masked token (e.g., tokens coming before and/or after the masked token in the sequence). The classifier layer may then combine the tokens. The masked token may be mapped back to the vocabulary to produce the prediction/guess for the masked token."; The classifier layer reads on the second machine-learned language encoding model.);
and modifying, by the computing system, one or more values of one or more parameters of at least the first machine-learned language encoding model based on a loss function that compares the one or more predictions respectively with the one or more masked tokens (Paragraph 0048, lines 1-10, "Referring back to FIG. 6, token loss manager 615 is responsible for determining token losses. For instance, when token loss manager 615 receives predicted tokens for masked tokens from masked token manager 610, token loss manager 615 calculates differences (e.g., errors) between the predicted tokens and the actual values of the masked tokens (e.g., stored in label data). The calculated differences is depicted in FIG. 6 as token losses 625. Token loss manager 615 may send token losses 625 to transformer layer 110, which transformer layer 110 uses to adjust its weights.").
Regarding claim 2, Wagner discloses the computer-implemented method as claimed in claim 1, wherein each of the plurality of language segments comprises one or more sentences (Paragraph 0022, lines 1-5, "For instances where the sequence of tokens of the input data includes several sets of words the each form a sentence, embedding layer 105 may generate a set of training data that includes the several set of words and a set of successive position values for each set of words.").
Regarding claim 5, Wagner discloses the computer-implemented method as claimed in claim 1, wherein one or both of the first machine-learned language encoding model and the second machine-learned language encoding model comprise a transformer model (Paragraph 0026, lines 3-6, "In some embodiments, transformer layer 110 is implemented by a transformer neural network (also referred to as a transformer or a transformer model).").
Regarding claim 6, Wagner discloses the computer-implemented method as claimed in claim 1, wherein combining, by the computing system, the contextual language embedding and the masked version of the subject language segment to obtain the conditioned input comprises: generating, by the computing system, a masked input embedding for the masked version of the subject language segment (Paragraph 0043, lines 4-9, "As shown in FIG. 6, masked token manager 610 receives transformer output array 620 as input. In some embodiments, transformer output array 620 is implemented in the form of an S×H array of vectors (e.g. a matrix) similar to the S×H array used to implement aggregate embeddings 235 described above.");
and concatenating, by the computing system, the contextual language embedding and the masked input embedding to generate the conditioned input (Paragraph 0045, lines 9-13, "Concatenation function 720 concatenates the three M×H arrays of vectors (e.g., matrices) from gather previous adjacent token 710-1, gather masked token 710-2, and gather next adjacent token 710-3. Concatenation function 720 produces an M×3H array of vectors.").
Regarding claim 7, Wagner discloses the computer-implemented method as claimed in claim 1, wherein the one or more first language segments appear prior to the subject language segment within a text source, subsequent to the subject language segment within the text source, or both prior and subsequent to the subject language segment within the text source (Paragraph 0047, lines 1-5, "Although two adjacent tokens are gathered in FIG. 7, masked token manager 610 may gather any number of previous adjacent tokens and next adjacent tokens. For instance, two, three, four, etc. previous adjacent tokens and next adjacent tokens may be gathered."; The previous adjacent tokens read on the prior language segment, and the next adjacent tokens read on the subsequent language segment.).
Regarding claim 8, Wagner discloses the computer-implemented method as claimed in claim 1, wherein modifying, by the computing system, one or more values of one or more parameters of at least the first machine-learned language encoding model based on the loss function that compares the one or more predictions respectively with the one or more masked tokens comprises jointly training, by the computing system, both the first machine-learned language encoding model and the second machine-learned language encoding model end-to-end based on the loss function (Paragraph 0048, lines 1-10, "Referring back to FIG. 6, token loss manager 615 is responsible for determining token losses. For instance, when token loss manager 615 receives predicted tokens for masked tokens from masked token manager 610, token loss manager 615 calculates differences (e.g., errors) between the predicted tokens and the actual values of the masked tokens (e.g., stored in label data). The calculated differences is depicted in FIG. 6 as token losses 625. Token loss manager 615 may send token losses 625 to transformer layer 110, which transformer layer 110 uses to adjust its weights."; Paragraph 0049, lines 1-5, "FIG. 8 illustrates process 800 for training a neural network according to some embodiments. System 100 may perform process 800. Process 800 may begin at step 805 with input data processor 105-3 receiving a set of input data for training a transformer model."; Paragraph 0053, lines 1-6, "At step 835, projection layer 730 generates a prediction for the masked tokens. At step 840, token loss manager 615 may use the prediction to train the neural network."; Using the loss calculated from the predictions to train the neural network reads on end-to-end training of the first machine-learned language encoding model and the second machine-learned language encoding model.).
Regarding claim 14, Wagner discloses a computing system, comprising:
one or more processors (Figure 9, Processor(s) 902),
and one or more non-transitory computer-readable media (Figure 9, Memory Subsystem 908) that collectively store:
a machine-learned language encoding model configured to process a language segment that comprises a plurality of tokens to generate an embedding that describes the language segment in an embedding space (Abstract, lines 1-5, "Embodiments of the present disclosure include systems and methods for training transformer models. In some embodiments, a set of input data are received. The input data comprises a plurality of tokens including masked tokens. The plurality of tokens in an embedding layer are processed."),
wherein the machine-learned language encoding model has been trained using a loss function that evaluates an ability of an additional language encoding model to perform a masked language modeling task when conditioned upon embeddings generated by the machine-learned language encoding model (Paragraph 0048, lines 1-10, "Referring back to FIG. 6, token loss manager 615 is responsible for determining token losses. For instance, when token loss manager 615 receives predicted tokens for masked tokens from masked token manager 610, token loss manager 615 calculates differences (e.g., errors) between the predicted tokens and the actual values of the masked tokens (e.g., stored in label data). The calculated differences is depicted in FIG. 6 as token losses 625. Token loss manager 615 may send token losses 625 to transformer layer 110, which transformer layer 110 uses to adjust its weights.");
and instructions that, when executed by a computing system comprising one or more computing devices, cause the computing system to perform operations, the operations comprising: obtaining an additional language segment that contains multiple tokens (Paragraph 0016, lines 3-6, "In some embodiments, a system may receive input data for a transformer model. The input data can include a set of tokens (e.g., a set of words forming a sentence) in a sequence.");
using the machine-learned language encoding model to generate an embedding for the additional language segment (Paragraph 0024, lines 1-5, "After masking tokens in the input data, embedding layer 105 may determine token embeddings for each unmasked token in the sequence of tokens using an embedding space generated from a corpus of tokens (e.g., a vocabulary of words)."; Paragraph 0017, lines 1-9, "At an embedding layer, each token may be mapped to a word in the neural network's vocabulary represented by a vector. The embedding layer may further map one or more adjacent tokens (e.g., tokens coming before and/or after the original token in a sequence) at the same time as the original token. The embedding layer may then combine the vectors. The output of the embedding layer is provided to a transformer layer which may determine correlations between tokens."; The combination of the embedding layer and the transformer layer reads on the machine-learned language encoding model.);
and performing a language task based on the embedding for the additional language segment (Paragraph 0018, lines 1-8, "A classifier layer may gather the masked tokens from the output of the transformer layer. The classifier layer may further gather one or more tokens adjacent to the gathered masked token (e.g., tokens coming before and/or after the masked token in the sequence). The classifier layer may then combine the tokens. The masked token may be mapped back to the vocabulary to produce the prediction/guess for the masked token."; The masked token prediction reads on a language task.).
Regarding claim 17, Wagner discloses one or more non-transitory computer-readable media (Figure 9, Memory Subsystem 908) that collectively store instructions that, when executed by a computing system comprising one or more computing devices (Figure 9, Processor(s) 902), cause the computing system to perform operations, the operations comprising:
processing, by the computing system, one or more sets of context data with a first machine-learned encoding model to generate a contextual embedding  (Paragraph 0024, lines 1-5, "After masking tokens in the input data, embedding layer 105 may determine token embeddings for each unmasked token in the sequence of tokens using an embedding space generated from a corpus of tokens (e.g., a vocabulary of words)."; Paragraph 0017, lines 1-9, "At an embedding layer, each token may be mapped to a word in the neural network's vocabulary represented by a vector. The embedding layer may further map one or more adjacent tokens (e.g., tokens coming before and/or after the original token in a sequence) at the same time as the original token. The embedding layer may then combine the vectors. The output of the embedding layer is provided to a transformer layer which may determine correlations between tokens."; The combination of the embedding layer and the transformer layer reads on the first machine-learned language encoding model.),
generating, by the computing system, a masked version of a subject language segment, wherein the masked version of the subject language segment comprises one or more masked tokens (Paragraph 0016, lines 6-8, "For training purposes, a number of tokens are masked. In other words, information about the token is removed.");
combining, by the computing system, the contextual embedding and the masked version of a subject language segment to obtain a conditioned input (Paragraph 0063, lines 1-17, "For example, in one embodiment, the present disclosure includes a system comprising a set of processing units and a non-transitory machine-readable medium storing instructions that when executed by at least one processing unit in the set of processing units cause the at least one processing unit to receive a set of input data, the input data comprising a plurality of tokens, the plurality of tokens including masked tokens; process the plurality of tokens in an embedding layer, the embedding layer being coupled to a transformer layer; process the plurality of tokens in the transformer layer, the transformer layer being coupled to a classifier layer; and process the plurality of tokens in the classifier layer, the classifier layer being coupled to a loss layer, wherein one or more of the embedding layer and the classifier layer combine masked tokens at a current position with tokens at one or more of a previous position and a subsequent position.");
processing, by the computing system, the conditioned input with a second machine-learned encoding model to generate one or more predictions respectively for the one or more masked tokens (Paragraph 0018, lines 1-8, "A classifier layer may gather the masked tokens from the output of the transformer layer. The classifier layer may further gather one or more tokens adjacent to the gathered masked token (e.g., tokens coming before and/or after the masked token in the sequence). The classifier layer may then combine the tokens. The masked token may be mapped back to the vocabulary to produce the prediction/guess for the masked token."; The classifier layer reads on the second machine-learned language encoding model.);
and modifying, by the computing system, one or more values of one or more parameters of at least the first machine-learned encoding model based on a loss function that compares the one or more predictions respectively with the one or more masked tokens (Paragraph 0048, lines 1-10, "Referring back to FIG. 6, token loss manager 615 is responsible for determining token losses. For instance, when token loss manager 615 receives predicted tokens for masked tokens from masked token manager 610, token loss manager 615 calculates differences (e.g., errors) between the predicted tokens and the actual values of the masked tokens (e.g., stored in label data). The calculated differences is depicted in FIG. 6 as token losses 625. Token loss manager 615 may send token losses 625 to transformer layer 110, which transformer layer 110 uses to adjust its weights.").
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 3 – 4 are rejected under 35 U.S.C. 103 as being unpatentable over Wagner in view of Yu et al. ("QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension"), hereinafter Yu, and Melamud et al. (“context2vec: Learning Generic Context Embedding with Bidirectional LSTM”), hereinafter Melamud.
Regarding claim 3, Wagner discloses the computer-implemented method as claimed in claim 1, but does not specifically disclose: wherein processing, by the computing system, the one or more first language segments of the plurality of language segments with the first machine-learned language encoding model to generate the contextual language embedding comprises: processing, by the computing system, the one or more first language segments of the plurality of language segments with a self-attention-based encoder portion of the first machine-learned language encoding model to obtain a language segment vector.
Yu teaches:
wherein processing, by the computing system, the one or more first language segments of the plurality of language segments with the first machine-learned language encoding model to generate the contextual language embedding comprises: processing, by the computing system, the one or more first language segments of the plurality of language segments with a self-attention-based encoder portion of the first machine-learned language encoding model to obtain a language segment vector (Section 1, lines 14-15, "We instead exclusively use convolutions and self-attentions as the building blocks of encoders that separately encodes the query and context."; Section 1, lines 22-23, "The additional context-query attention is a standard module to construct the query-aware context vector for each position in the context paragraph, which is used in the subsequent modeling layers."; The context vector reads on the segment vector.).
Yu teaches using an encoder with self-attention to generate language vectors in order to implement a question answering system that is faster than systems using recurrent networks (Abstract, lines 4-9, "We propose a new Q&A architecture called QANet, which does not require recurrent networks: Its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD dataset, our model is 3x to 13x faster in training and 4x to 9x faster in inference, while achieving equivalent accuracy to recurrent models.").
Wagner and Yu are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner to incorporate the teachings of Yu to use an encoder with self-attention to generate language vector.  Doing so would allow for implementing a question answering system that is faster than systems using recurrent networks.
Wagner in view of Yu does not specifically disclose: processing, by the computing system, the language segment vector with a projection head portion of the first machine-learned language encoding model to obtain the contextual language embedding.
Melamud teaches:
processing, by the computing system, the language segment vector with a projection head portion of the first machine-learned language encoding model to obtain the contextual language embedding (Section 2.2, lines 1-11, "We use a bidirectional LSTM recurrent neural network to obtain a sentence-level context representation. Let lLS be an LSTM reading the words of a given sentence from left to right, and let rLS be a reverse one reading the words from right to left. Given a sentence w1:n, our ‘shallow’ bidirectional LSTM context representation for the target wi is defined as the following vector concatenation: biLS(w1:n, i) = lLS(l1:i−1) ⊕ rLS(rn:i+1) where l/r represent distinct left-to-right/right-to-left word embeddings of the sentence words."; The sentence-level context representation reads on the contextual language embedding, the word embeddings read on the language segment vector, and the long short-term memory (LSTM) recurrent neural network reads on the first machine-learned language encoding model.).
Melamud teaches using a long short-term memory (LSTM) recurrent neural network to generate sentence-level context representations from word embeddings in order to perform natural language processing tasks such as word sense disambiguation, named entity recognition, and coreference resolution (Abstract, lines 1-8, "Context representations are central to various NLP tasks, such as word sense disambiguation, named entity recognition, coreference resolution, and many more. In this work we present a neural model for efficiently learning a generic context embedding function from large corpora, using bidirectional LSTM.").
Wagner, Yu, and Melamud are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner in view of Yu to incorporate the teachings of Melamud to use a long short-term memory (LSTM) recurrent neural network to generate sentence-level context representations from word embeddings.  Doing so would allow for performing natural language processing tasks such as word sense disambiguation, named entity recognition, and coreference resolution.
Regarding claim 4, Wagner in view of Yu and Melamud discloses the computer-implemented method as claimed in claim 3.
Melamud further teaches: wherein the projection head portion of the first machine-learned language encoding model comprises a neural network (Section 2.2, lines 1-3, "We use a bidirectional LSTM recurrent neural network to obtain a sentence-level context representation.").
Melamud teaches using a long short-term memory (LSTM) recurrent neural network to generate sentence-level context representations from word embeddings in order to perform natural language processing tasks such as word sense disambiguation, named entity recognition, and coreference resolution (Abstract, lines 1-8, "Context representations are central to various NLP tasks, such as word sense disambiguation, named entity recognition, coreference resolution, and many more. In this work we present a neural model for efficiently learning a generic context embedding function from large corpora, using bidirectional LSTM.").
Wagner, Yu, and Melamud are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner in view of Yu to incorporate the teachings of Melamud to use a long short-term memory (LSTM) recurrent neural network to generate sentence-level context representations from word embeddings.  Doing so would allow for performing natural language processing tasks such as word sense disambiguation, named entity recognition, and coreference resolution.
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Wagner in view of Dehghani et al. ("Universal Transformers”), hereinafter Dehghani.
Regarding claim 9, Wagner discloses the computer-implemented method as claimed in claim 1, but does not specifically disclose: wherein the first machine-learned language encoding model and the second machine-learned language encoding model share one or more values for one or more parameters.
Dehghani teaches:
wherein the first machine-learned language encoding model and the second machine-learned language encoding model share one or more values for one or more parameters (Section 4, lines 1-4, "When running for a fixed number of steps, the Universal Transformer is equivalent to a multi-layer Transformer with tied parameters across all its layers. This is partly similar to the Recursive Transformer, which ties the weights of its self-attention layers across depth (Gulcehre et al., 2018). However, as the per-symbol recurrent transition functions can be applied any number of times, another and possibly more informative way of characterizing the UT is as a block of parallel RNNs (one for each symbol, with shared parameters) evolving per-symbol hidden states concurrently, generated at each step by attending to the sequence of hidden states at the previous step."; The multi-layer Transformer with tied parameters across layers reads on machine-learned language encoding models sharing one or more values for one or more parameters.).
Dehghani teaches the use of a multi-layer Transformer with tied parameters across layers in order to improve performance of language understanding tasks (Abstract, lines 11-15, "We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs."; Abstract, lines 18-22, "Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging LAMBADA language modeling task where UTs achieve a new state of the art, and machine translation where UTs achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-De dataset.").
Wagner and Dehghani are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner to incorporate the teachings of Dehghani to use a multi-layer Transformer with tied parameters across layers.  Doing so would allow for improving performance of language understanding tasks.
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Wagner in view of Wang et al. (US Patent Application Publication No. 2022/0171936), hereinafter Wang.
Regarding claim 10, Wagner discloses the computer-implemented method as claimed in claim 1, but does not specifically disclose: wherein: the one or more first language segments comprise a plurality of first language segments; and processing, by the computing system, the plurality of first language segments to generate the contextual language embedding comprises: individually processing, by the computing system, each of the plurality of first language segments with the first machine-learned language encoding model to generate a respective individual language embedding; and combining, the respective individual language embeddings for the plurality of first language segments to generate the contextual language embedding.
Wang teaches: wherein: the one or more first language segments comprise a plurality of first language segments; and processing, by the computing system, the plurality of first language segments to generate the contextual language embedding comprises: individually processing, by the computing system, each of the plurality of first language segments with the first machine-learned language encoding model to generate a respective individual language embedding; and combining, the respective individual language embeddings for the plurality of first language segments to generate the contextual language embedding (Paragraph 0128, lines 1-22, "In an embodiment, the processor 204 may determine a sentence embedding associated with each of the set of sentence nodes and a paragraph embedding associated with each of the set of paragraph nodes, based on the determination of the token embedding associated with each of the set of token nodes. For example, the processor 204 may determine the sentence embedding of a sentence based on a summation of: an average value or an aggregate value of word embeddings of a set of words in the sentence, an average value or an aggregate value of token index embeddings of one or more tokens associated with the sentence, the sentence index embedding of the sentence, and the paragraph index embedding associated with the sentence. In an example, the processor 204 may determine the paragraph embedding of a paragraph based on a summation of: an average value or an aggregate value of word embeddings of a set of words in each sentence in the paragraph, an average value or an aggregate value of token index embeddings of one or more tokens associated with each sentence in the paragraph, the sentence index embedding of each sentence in the paragraph, and the paragraph index embedding associated with the paragraph in the document."; The sentence embedding reads on the individual language embedding, and the paragraph embedding reads on the contextual language embedding.).
Wang teaches generating sentence embeddings from sentences and combining sentence embeddings to generate paragraph embeddings in order to capture a global structure of a document for construction of a hierarchal graph (Paragraph 0027, lines 1-17, "According to one or more embodiments of the present disclosure, the technological field of natural language processing may be improved by configuring a computing system in a manner that the computing system may be able to effectively analyze a natural language text in a document. The computing system may capture a global structure of the document for construction of the hierarchal graph, as compared to other conventional systems which may use only information associated individual sentences in the document. The disclosed system may be advantageous, as in certain scenarios, context and sentiment associated with a sentence may not be accurately ascertained based on just the information associated with the sentence. For example, the context and sentiment associated with the sentence may depend on the context and sentiment of other sentences in a paragraph or other sentences in the document as a whole.").
Wagner and Wang are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner to incorporate the teachings of Wang to generate sentence embeddings from sentences and combine sentence embeddings to generate paragraph embeddings.  Doing so would allow for capturing a global structure of a document for construction of a hierarchal graph.
Claims 11 – 12 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Wagner in view of Lample et al. (“Cross-lingual Language Model Pretraining”), hereinafter Lample.
Regarding claim 11, Wagner discloses the computer-implemented method as claimed in claim 1, but does not specifically disclose: wherein at least one of the one or more first language segments comprises a first natural language and the subject language segment comprises a second natural language that is different from the first natural language.
Lample teaches:
wherein at least one of the one or more first language segments comprises a first natural language and the subject language segment comprises a second natural language that is different from the first natural language (Section 1, lines 12-13, "We introduce a new unsupervised method for learning cross-lingual representations using cross-lingual language modeling and investigate two monolingual pretraining objectives."; Section 3.1, lines 1-2, "In all our experiments we process all languages with the same shared vocabulary created through Byte Pair Encoding (BPE)"; Section 3.3, lines 1-4, "We also consider the masked language modeling (MLM) objective of Devlin et al. [14], also known as the Cloze task [36]. Following Devlin et al. [14], we sample randomly 15% of the BPE tokens from the text streams, replace them by a [MASK] token 80% of the time, by a random token 10% of the time, and we keep them unchanged 10% of the time.").
Lample teaches learning cross-lingual representations with masked language modeling in order to perform cross-lingual classification and machine translation (Abstract, lines 1-8, "Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation.").
Wagner and Lample are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner to incorporate the teachings of Lample to learn cross-lingual representations with masked language modeling.  Doing so would allow for performing cross-lingual classification and machine translation.
Regarding claim 12, Wagner discloses the computer-implemented method as claimed in claim 1, but does not specifically disclose: further comprising additionally training, by the computing system, at least the first machine-learned language encoding model with a bitext retrieval loss function for a bitext retrieval task.
Lample teaches:
further comprising additionally training, by the computing system, at least the first machine-learned language encoding model with a bitext retrieval loss function for a bitext retrieval task (Section 5.3, lines 4-10, "In Table 1, we evaluate two types of pretrained cross-lingual encoders: an unsupervised cross-lingual language model that uses the MLM objective on monolingual corpora only; and a supervised cross-lingual language model that combines both the MLM and the TLM loss using additional parallel data. Following Conneau et al. [12], we include two machine translation baselines: TRANSLATE-TRAIN, where the English MultiNLI training set is machine translated into each XNLI language, and TRANSLATE-TEST where every dev and test set of XNLI is translated to English."; The translation reads on the bitext retrieval task.).
Lample teaches training a cross-lingual language model for a translation task in order to perform cross-lingual classification and machine translation (Abstract, lines 1-8, "Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation.").
Wagner and Lample are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner to incorporate the teachings of Lample to train a cross-lingual language model for a translation task.  Doing so would allow for performing cross-lingual classification and machine translation.
Regarding claim 15, Wagner discloses the computing system as claimed in claim 14, but does not specifically disclose: wherein the language task comprises sentence retrieval, sentence classification, bitext or translation retrieval, sentiment analysis, or conversational response selection.
Lample teaches:
wherein the language task comprises sentence retrieval, sentence classification, bitext or translation retrieval, sentiment analysis, or conversational response selection (Section 5.3, lines 4-10, "In Table 1, we evaluate two types of pretrained cross-lingual encoders: an unsupervised cross-lingual language model that uses the MLM objective on monolingual corpora only; and a supervised cross-lingual language model that combines both the MLM and the TLM loss using additional parallel data. Following Conneau et al. [12], we include two machine translation baselines: TRANSLATE-TRAIN, where the English MultiNLI training set is machine translated into each XNLI language, and TRANSLATE-TEST where every dev and test set of XNLI is translated to English."; The translation reads on the bitext or translation retrieval task.).
Lample teaches training a cross-lingual language model for a translation task in order to perform cross-lingual classification and machine translation (Abstract, lines 1-8, "Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation.").
Wagner and Lample are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner to incorporate the teachings of Lample to train a cross-lingual language model for a translation task.  Doing so would allow for performing cross-lingual classification and machine translation.
Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Wagner in view of Conneau et al. ("XNLI: Evaluating Cross-lingual Sentence Representations”), hereinafter Conneau.
Regarding claim 13, Wagner discloses the computer-implemented method as claimed in claim 1, but does not specifically disclose: further comprising finetuning, by the computing system, at least the first machine-learned language encoding model with a natural language inference loss function for a premise segment and hypothesis segment that are in different languages.
Conneau teaches: further comprising finetuning, by the computing system, at least the first machine-learned language encoding model with a natural language inference loss function for a premise segment and hypothesis segment that are in different languages (Section 4.2.3, lines 13-22, "We propose a simple alignment loss function to align the embedding spaces of two different languages. Specifically, we train an English encoder on NLI, and train a target encoder by minimizing the loss: Lalign(x; y) = sim(x; y) - λ(sim(xc; y) + sim(x; yc)) where (x; y) corresponds to the source and target sentence embeddings, (xc; yc) is a contrastive term (i.e. negative sampling), λ controls the weight of the negative examples in the loss."; The source language reads on the premise segment and the target language reads on the hypothesis segment.).
Conneau teaches training an encoder to minimize the loss function between a source language and target language in order to develop natural language processing systems for cross-lingual language understanding (Abstract, lines 1-16, "State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in crosslingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15 languages, including low-resource languages such as Swahili and Urdu.").
Wagner and Conneau are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner to incorporate the teachings of Conneau to train an encoder to minimize the loss function between a source language and target language.  Doing so would allow for developing natural language processing systems for cross-lingual language understanding.
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Wagner in view of Arora et al. ("A Simple but Tough-to-Beat Baseline for Sentence Embeddings"), hereinafter Arora.
Regarding claim 16, Wagner discloses the computing system as claimed in claim 14, but does not specifically disclose: further comprising, prior to performing the language task based on the embedding for the additional language segment, removing at least a first principal component from the embedding for the additional language segment.
Arora teaches:
further comprising, prior to performing the language task based on the embedding for the additional language segment, removing at least a first principal component from the embedding for the additional language segment (Section 1, lines 16-20, "Here we give a new sentence embedding method that is embarrassingly simple: just compute the weighted average of the word vectors in the sentence and then remove the projections of the average vectors on their first singular vector (“common component removal”). Here the weight of a word w is a=(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency; we call this smooth inverse frequency (SIF)."; The projections of the average vectors on their first singular vector read on the first principal component from the embedding.).
Arora teaches removing the projections of the average vectors on their first singular vector in order to improve performance on textual similarity tasks (Section 1, lines 20-21, "This method achieves significantly better performance than the unweighted average on a variety of textual similarity tasks").
Wagner and Arora are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner to incorporate the teachings of Arora to remove the projections of the average vectors on their first singular vector.  Doing so would allow for improving performance on textual similarity tasks.
Claims 18 – 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wagner in view of Qi et al. (“ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data”), hereinafter Qi.
Regarding claim 18, Wagner discloses the one or more non-transitory computer-readable media as claimed in claim 17, but does not specifically disclose: wherein the one or more sets of context data comprise an image.
Qi teaches: wherein the one or more sets of context data comprise an image (Section 4, lines 1-5, "Figure 4 illustrates the overall architecture of our ImageBERT model. Similar to BERT[10], we use Transformer as basic structure, but take both image visual tokens and textual tokens as input. The image and text input are encoded into different embeddings through an embedding layer, where the image visual tokens are RoI features extracted from a Faster-RCNN[24, 25] model. Then these embeddings are fed into a multi-layer bidirectional self-attention Transformer to learn a cross-modality Transformer to model the relationship between the visual regions and the linguistic tokens.").
Qi teaches using a Transformer with image and text input to model the relationship between visual regions and linguistic tokens in order to improve performance on text-to-image and image-to-text retrieval tasks (Section 1, lines 14-15, "Then, ImageBERT is proposed as a strong baseline for cross-modal pre-training, which has achieved new state-of-the-art results on text-to-image and image-to-text retrieval tasks").
Wagner and Qi are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner to incorporate the teachings of Qi to use a Transformer with image and text input to model the relationship between visual regions and linguistic tokens.  Doing so would allow for improving performance on text-to-image and image-to-text retrieval tasks.
Regarding claim 19, Wagner in view of Qi discloses the one or more non-transitory computer-readable media as claimed in claim 18.
Qi further teaches: wherein the subject language segment comprises a textual caption that describes content depicted by the image (Section 4, lines 1-5, "Figure 4 illustrates the overall architecture of our ImageBERT model. Similar to BERT[10], we use Transformer as basic structure, but take both image visual tokens and textual tokens as input. The image and text input are encoded into different embeddings through an embedding layer, where the image visual tokens are RoI features extracted from a Faster-RCNN[24, 25] model. Then these embeddings are fed into a multi-layer bidirectional self-attention Transformer to learn a cross-modality Transformer to model the relationship between the visual regions and the linguistic tokens."; Section 4.3, lines 3-5, "This task is the same with the MLM task in BERT[10] training. We denote the n input sub-word tokens as w = {w0, . . . ,wn-1}. The input token which will be predicted afterwards is masked randomly with a probability of 15%."; Section 5.1, lines 12-14, "We use the same evaluation metrics R@K (K = 1; 5; 10) as other work, which measure the percentage of correctly matched pairs in the top K-ranked results. Since both Flickr30k and MSCOCO contain five captions per image, sentence retrieval task is easier and can get higher scores than image retrieval task.").
Qi teaches using a Transformer with image and text input, where the text is a caption for the image, to model the relationship between visual regions and linguistic tokens in order to improve performance on text-to-image and image-to-text retrieval tasks (Section 1, lines 14-15, "Then, ImageBERT is proposed as a strong baseline for cross-modal pre-training, which has achieved new state-of-the-art results on text-to-image and image-to-text retrieval tasks").
Wagner and Qi are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner to incorporate the teachings of Qi to use a Transformer with image and text input, where the text is a caption for the image, to model the relationship between visual regions and linguistic tokens.  Doing so would allow for improving performance on text-to-image and image-to-text retrieval tasks.
Regarding claim 20, Wagner in view of Qi discloses the one or more non-transitory computer-readable media as claimed in claim 18.
Qi further teaches: wherein the first machine-learned encoding model comprises a convolutional neural network, a long short term memory network, or a self-attention-based network (Section 4.1, lines 7-9, "Similar to linguistic embedding, image embedding is also generated from visual input by a similar process. A Faster-RCNN model is used to extract features from o RoIs, denoted by {r0, . . . ,ro-1}. from the image to represent its visual content.").
Qi teaches using a Transformer with image and text input, where a convolutional neural network image is used for image embedding, to model the relationship between visual regions and linguistic tokens in order to improve performance on text-to-image and image-to-text retrieval tasks (Section 1, lines 14-15, "Then, ImageBERT is proposed as a strong baseline for cross-modal pre-training, which has achieved new state-of-the-art results on text-to-image and image-to-text retrieval tasks").
Wagner and Qi are considered to be analogous to the claimed invention because they are in the same field of natural language processing systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wagner to incorporate the teachings of Qi to use a Transformer with image and text input, where a convolutional neural network image is used for image embedding, to model the relationship between visual regions and linguistic tokens.  Doing so would allow for improving performance on text-to-image and image-to-text retrieval tasks.
Conclusion
The relevant art made of record and not relied upon is considered pertinent to applicant's disclosure.
Yang et al. (Yang, Ziyi, Yinfei Yang, Daniel Cer, Jax Law, and Eric Darve, “Universal Sentence Representation Learning with Conditional Masked Language Model”, November 2021, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6216-6228.) teaches a method of using Masked Language Modeling to effectively learn sentence representations on large scale unlabeled corpora.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James Boggs whose telephone number is (571)272-2968. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JAMES BOGGS/           Examiner, Art Unit 2657                                                                                                                                                                                             

/DANIEL C WASHBURN/           Supervisory Patent Examiner, Art Unit 2657