DETAILED ACTION
Notice of AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement submitted on 05/05/2021 has been considered by the examiner.
Claim Objections


Claims 1-20 are objected to because of the following informalities:  
Claim 1 recites, “training, generating a trained encoder model, an encoder model…” and while this language is not indefinite, it is difficult to understand and therefore the examiner suggests re-writing this limitation to read “generating a trained encoder model, by training an encoder model…”
Claims 2-12 depend from claim 1 and are objected to under the same grounds as claim 1.
Claim 10 recites, “first training, generating a partially trained encoder model, the encoder model” and while this language is not indefinite, it is difficult to understand and therefore the examiner suggests re-writing this limitation to read: “generating a partially trained encoder model, by first training the encoder model….”
Claim 10 recites, “second training, generating the trained encoder model, the second training comprising adjusting the set of parameters of the encoder model” and while this language is not indefinite, it is difficult to understand and therefore the examiner suggests re-writing this limitation to read: “generating the trained encoder model, by performing a second training step comprising adjusting the set of parameters of the encoder model”.
Claim 13 recites, “program instructions to train, generating a trained encoder model, an encoder model….” and while this language is not indefinite, it is difficult to understand and therefore the examiner suggests re-writing this limitation to read: “program instructions for generating a trained encoder model, by training an encoder model…”
Claims 14-19 depend from claim 13 and are objected to under the same grounds as claim 13.
Claim 20 recites, “program instructions to train, generating a trained encoder model, an encoder model….” and while this language is not indefinite, it is difficult to understand and therefore the examiner suggests re-writing this limitation to read: “program instructions for generating a trained encoder model, by training an encoder model…”

Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1, 2, 4, 6, 9, and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Mehta, Divyam, et al. "A transformer-based architecture for fake news classification." Social Network Analysis and Mining (April 17, 2021), pp. 1-12, hereinafter referenced as MEHTA, in view of Li, Tianyu, et al. "Deep Heterogeneous Autoencoders for Collaborative Filtering." arXiv preprint arXiv:1812.06610 (2018), hereinafter referenced as LI.


Regarding claim 1, MEHTA discloses:
A computer-implemented method comprising: (MEHTA discloses using a transformer-based architecture (i.e., BERT), and performing data “processing”, e.g., computer-implemented; p. 4, sections 4.1-4.3, p. 6, Fig. 3)
constructing, from metadata of a corpus of natural language text documents, (LIAR and LIAR Plus datasets are collections of political news, e.g., natural language text documents, along with various metadata; p. 3, section 3)
training, generating a trained encoder model, an encoder model (the examiner notes that this limitation is being interpreted as “generating a trained encoder model, by training an encoder model…” as explained above with respect to the rejection under 35 U.S.C. 112(b); MEHTA discloses fine-tuning, e.g., training, an architecture including multiple pre-trained BERT models, e.g., trained encoder models based on BERT; pp. 6-7, sections 4.3-4.4 and Figs. 3 and 4) to compute an embedding corresponding to a token of a natural language text document within the corpus, (BERT input representation includes token embeddings, using pre-trained word embeddings like word2vec, e.g., tokens corresponding to natural language text documents within a corpus, where words are converted to representative vectors, e.g., embeddings are computed; p. 6, section 4.3 and Fig. 2) the encoder model comprising a first encoder layer, (BERTbase has 12 encoder layers and BERTlarge has 24 encoder layers; p. 6, section 4.3) the first encoder layer comprising a token embedding portion, (BERT input representation includes token embeddings, including segment and position embeddings; p. 6, section 4.3 and Fig. 2) a token self-attention portion, a metadata self-attention portion, (BERT transformer comprises stacked encoder and decoder layers, where each layer has a sub-layer with a self-attention mechanism; p. 4, section 4.2 and p. 7, section 4.3; news statement tokens and metadata tokens are both input in parallel into pre-trained BERT, where pre-trained BERT uses shared weights, e.g., they share a first encoder layer with a self-attention sub-layer for text tokens and metadata tokens, respectively; pp. 6-7, sections 4.3-4.4 and Figs. 3 and 4) the training comprising adjusting a set of parameters of the encoder model. (BERT hyper-parameters are fine-tuned, e.g., trained and adjusted; pp. 8-9, sections 4.4 and 5)

However, MEHTA fails to explicitly teach:
a relativity matrix, a row-column intersection in the relativity matrix corresponding to a relationship between two instances of a type of metadata; and
the relativity matrix
a relativity embedding portion,
a fusion portion,

However, in a related field of endeavor, LI pertains to a deep heterogeneous autoencoder that analyzes categorical information (user demographics and item content) and textual information (user comments and textual item tags) in recommender systems (e.g., to recommend products in an online retail setting).  (p. 1, section I).  The autoencoder generates a user-item interaction matrix, which represents a shared feature space representing information about users and items. (p. 2, section III.A and Fig. 2).  LI further explains that the autoencoder includes hidden fusion layers to bridge the joint training between feature space learning and collaborative filtering (p. 2, section III.B and p. 4, section IV.B).

The MEHTA-LI combination makes obvious:
constructing, from metadata of a corpus of natural language text documents, a relativity matrix, a row-column intersection in the relativity matrix corresponding to a relationship between two instances of a type of metadata; and (LI discloses a 2-variable (e.g., user-item) interaction matrix which represents a shared feature space representing information about the 2 variables; LI, p. 2, section III.A and Fig. 2; the MEHTA-LI combination now applies the autoencoder in LI to the metadata in the LIAR and/or LIAR Plus datasets in MEHTA, including metadata such as ID, label, statement, subject, speaker, job title, state info, political affiliation, to generate a matrix that represents a shared feature space representing information about two types of metadata, e.g., a relativity matrix where row-column intersections correspond to a relationship between two instances of a type of metadata; MEHTA, p. 3, section 3 with LI, p. 2, section III.A and Fig. 2)
training, generating a trained encoder model, an encoder model (see mapping above with respect to MEHTA) to compute an embedding corresponding to a token of a natural language text document within the corpus and the relativity matrix, (LI discloses a 2-variable interaction matrix which represents a shared feature space representing information between 2 variables, e.g., a relativity matrix; LI, p. 2, section III.A and Fig. 2; the MEHTA-LI combination now applies the 2-variable matrix of LI to metadata as disclosed in MEHTA, and the resulting matrix information is now embedded as part of the metadata embeddings as disclosed in MEHTA; MEHTA, pp. 6-7, sections 4.3-4.4 and Figs. 2-4 with LI, p. 2, section III.A and Fig. 2) the encoder model comprising a first encoder layer, the first encoder layer comprising a token embedding portion, a relativity embedding portion, (LI discloses a 2-variable interaction matrix which represents a shared feature space representing information between 2 variables, e.g., a relativity matrix; LI, p. 2, section III.A and Fig. 2; the MEHTA-LI combination now applies the 2-variable matrix of LI to metadata as disclosed in MEHTA, and the resulting matrix information is now embedded as part of the metadata embeddings as disclosed in MEHTA, e.g., the claimed relativity embedding portion; MEHTA, pp. 6-7, sections 4.3-4.4 and Figs. 2-4 with LI, p. 2, section III.A and Fig. 2) a token self-attention portion, a metadata self-attention portion, and a fusion portion, (LI discloses fusion layers to bridge the joint training between feature space learning and collaborative filtering; LI, p. 2, section III.B and p. 4, section IV.B; the MEHTA-LI combination now applies the fusion layers of LI to the token self-attention and metadata self-attention portions of MEHTA to bridge the join training between the tokens and metadata; MEHTA, pp. 6-7, sections 4.3-4.4 and Figs. 3 and 4 with LI, p. 2, section III.B and p. 4, section IV.B) the training comprising adjusting a set of parameters of the encoder model. (MEHTA discloses that BERT hyper-parameters are fine-tuned, e.g., trained and adjusted; MEHTA, pp. 8-9, sections 4.4 and 5; the MEHTA-LI combination now fine-tunes the modified architecture utilizing the 2-variable matrix of LI, to train and adjust architecture hyperparameters; MEHTA, pp. 8-9, sections 4.4 and 5 with LI, p. 2, section III.A and Fig. 2)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of LI to MEHTA, particularly the teachings relating to the 2-variable matrix generated by the autoencoder of LI and the hidden fusion layers.  As disclosed in LI, one of ordinary skill in the art would be motivated to utilize the teachings of LI in order to utilize an autoencoder to combine information from multiple domains. (LI, p. 1, section I).  Further, as disclosed in LI, one of ordinary skill in the art would be further motivated to utilize the fusion layer teachings of LI in order to bridge the joint training between feature spaces. (Li, p. 4, section IV.B). Moreover, as disclosed in LI, one of ordinary skill in the art would be further motivated to utilize the teachings of LI in order to incorporate multiple sources of heterogeneous auxiliary information, e.g., disparate metadata, in a consistent way to alleviate data sparsity problems and for performance gains.  (LI, p. 6, section V).

	Regarding claim 2, the MEHTA-LI combination discloses the computer-implemented method of claim 1.  MEHTA further discloses:
wherein the token embedding portion computes a set of token embeddings, a token embedding in the set of token embeddings corresponding to a token of a natural language text document within the corpus. (BERT input representation includes token embeddings, using pre-trained word embeddings like word2vec, e.g., tokens corresponding to natural language text documents within a corpus, where words are converted to representative vectors, e.g., embeddings are computed; p. 4, section 4.1 and p. 6, section 4.3 and Fig. 2)

Regarding claim 4, the MEHTA-LI combination discloses the computer-implemented method of claim 2.  MEHTA further teaches:
wherein the token embedding comprises a multidimensional (inputs and target sentences, e.g., input tokens and output tokens, are embedded into a multidimensional space; p. 4, section 4.2) numerical representation of the token. (pre-trained word embeddings such as word2vec, including 300-dimension word2vec, convert words to number vector representations; p. 3, section 2 and p. 6, section 4.3)

Regarding claim 6, the MEHTA-LI combination discloses the computer-implemented method of claim 1.  However, the MEHTA-LI combination fails to explicitly teach:
wherein the relativity embedding portion computes a set of relativity embeddings, . (LI discloses a 2-variable interaction matrix which represents a shared feature space representing information between 2 variables, e.g., a relativity matrix; LI, p. 2, section III.A and Fig. 2; the MEHTA-LI combination now applies the 2-variable matrix of LI to metadata as disclosed in MEHTA, and the resulting matrix information is now embedded as part of the metadata embeddings as disclosed in MEHTA, e.g., the claimed relativity embedding portion; MEHTA, pp. 6-7, sections 4.3-4.4 and Figs. 2-4 with LI, p. 2, section III.A and Fig. 2) a relativity embedding in the set of relativity embeddings comprising a multidimensional numerical representation of the row-column intersection (MEHTA discloses that the BERT embeddings are represented by vectors; MEHTA, p. 6, section 4.3; the MEHTA-LI combination now represents the relativity embeddings as vectors, e.g., multidimensional numerical representations of the row-column intersection from the 2-variable interaction matrix of LI; MEHTA, p. 6, section 4.3 with LI, p. 2, section III.A and Fig. 2; the examiner also notes that the broadest reasonable interpretation of “multidimensional numerical representation” includes a vector representation as disclosed in para. 0002 to the instant specification (“multidimensional numbers also called vectors”)).

Regarding claim 9, the MEHTA-LI combination discloses the computer-implemented method of claim 1.  The MEHTA-LI combination further makes obvious:
wherein the fusion portion combines outputs of the token self-attention portion and the metadata self-attention portion. (LI discloses fusion layers to bridge the joint training between feature space learning and collaborative filtering; LI, p. 2, section III.B and p. 4, section IV.B; the MEHTA-LI combination now applies the fusion layers of LI to the token self-attention and metadata self-attention portions of MEHTA to bridge the join training between the tokens and metadata; MEHTA, pp. 6-7, sections 4.3-4.4 and Figs. 3 and 4 with LI, p. 2, section III.B and p. 4, section IV.B)

Regarding claim 11, the MEHTA-LI combination discloses the computer-implemented method of claim 1.  MEHTA further discloses:
wherein the encoder model further comprises a first decoder layer, (BERT utilizes a transformer architecture that includes a decoder that comprises a 6-layer stack; p. 4, section 4.2) the first decoder layer comprising a decoder token self-attention portion, a decoder metadata self-attention portion, (BERT transformer comprises stacked decoder layers, where each layer has a sub-layer with a self-attention mechanism; p. 4, section 4.2 and p. 7, section 4.3; news statement tokens and metadata tokens are both input in parallel into pre-trained BERT, where pre-trained BERT uses shared weights, e.g., they share a first decoder layer with a self-attention sub-layer for text tokens and metadata tokens, respectively; pp. 6-7, sections 4.3-4.4 and Figs. 3 and 4) and a decoder attention portion, (decoder implements multi-head attention in a sublayer; p. 4, section 4.2) the training comprising adjusting a set of parameters of the first decoder layer. (BERT hyper-parameters are fine-tuned, e.g., trained and adjusted; pp. 8-9, sections 4.4 and 5)

	However, MEHTA fails to explicitly teach:
decoder fusion portion

However, in a related field of endeavor, LI explains that the autoencoder includes hidden fusion layers to bridge the joint training between feature space learning and collaborative filtering (p. 2, section III.B and p. 4, section IV.B). The MEHTA-LI combination makes obvious:
decoder fusion portion (LI discloses fusion layers to bridge the joint training between feature space learning and collaborative filtering; LI, p. 2, section III.B and p. 4, section IV.B; the MEHTA-LI combination now applies the fusion layers of LI to the token self-attention and metadata self-attention portions of the decoder of MEHTA to bridge the join training between the tokens and metadata; MEHTA, pp. 6-7, sections 4.3-4.4 and Figs. 3 and 4 with LI, p. 2, section III.B and p. 4, section IV.B)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of LI to MEHTA, particularly the teachings relating to the 2-variable matrix generated by the autoencoder of LI and the hidden fusion layers.  As disclosed in LI, one of ordinary skill in the art would be motivated to utilize the teachings of LI in order to utilize an autoencoder to combine information from multiple domains. (LI, p. 1, section I).  Further, as disclosed in LI, one of ordinary skill in the art would be further motivated to utilize the fusion layer teachings of LI in order to bridge the joint training between feature spaces. (Li, p. 4, section IV.B). Moreover, as disclosed in LI, one of ordinary skill in the art would be further motivated to utilize the teachings of LI in order to incorporate multiple sources of heterogeneous auxiliary information, e.g., disparate metadata, in a consistent way to alleviate data sparsity problems and for performance gains.  (LI, p. 6, section V).

Claims 3, 5, and 13-16 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over the MEHTA-LI combination further in view of DEVLIN.

Regarding claim 3, the MEHTA-LI combination discloses the computer-implemented method of claim 2.  However, the MEHTA-LI combination fails to explicitly disclose:
wherein the token comprises a portion of a word of the natural language text document. 

	However, in a related field of endeavor, DEVLIN introduces and describes the BERT transformer that MEHTA uses as part of its architecture.  The MEHTA-LI-DEVLIN combination makes obvious:
wherein the token comprises a portion of a word of the natural language text document. (DEVLIN discloses that BERT operates on tokens representing partial word pieces; DEVLIN, p. 13, section A.2; the MEHTA-LI-DEVLIN combination now applies the BERT-based architecture of MEHTA to partial word pieces as disclosed in DEVLIN; MEHTA, pp. 6-7, sections 4.3-4.4 and Figs. 3 and 4 with DEVLIN, p. 13, section A.2)

	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing of the present application to combine the teachings of DEVLIN with MEHTA and LI.  Indeed, MEHTA specifically cites to DEVLIN and the MEHTA architecture is based on the BERT transformer introduced and described in DEVLIN.  One of ordinary skill would further be motivated to utilize the teachings of DEVLIN to take advantage of bidirectional pre-training for language representation and to achieve state-of-the-art performance on token-level tasks, including natural language processing tasks. (DEVLIN, p. 2, section 1).  As disclosed in DEVLIN, one of ordinary skill would further be motivated to utilize the pre-trained BERT transformer and use the improved fine-tuning approaches proposed by DEVLIN. (DEVLIN, p. 1, section 1).  

Regarding claim 5, the MEHTA-LI combination discloses the computer-implemented method of claim 2.  However, the MEHTA-LI combination fails to explicitly teach:
wherein the token embedding comprises a combination of a multidimensional numerical representation of the token, a multidimensional numerical representation of a position of the token within the natural language text document, and a multidimensional numerical representation of a segment of the natural language text document in which the token is located. 

However, in a related field of endeavor, DEVLIN introduces and describes the BERT transformer that MEHTA uses as part of its architecture.  The MEHTA-LI-DEVLIN combination makes obvious:
wherein the token embedding comprises a combination of a multidimensional numerical representation of the token, a multidimensional numerical representation of a position of the token within the natural language text document, and a multidimensional numerical representation of a segment of the natural language text document in which the token is located. (MEHTA discloses utilizing BERT and that input tokens include token embeddings, segment embeddings, and position embeddings and embeddings are represented by vectors, such as in word2vec; MEHTA, p. 6, section 4.3 and Fig. 2; DEVLIN discloses that input embeddings are the sum, e.g., a combination, of the token embeddings, segmentation embeddings, and position embeddings and that tokens are represented by vectors; DEVLIN, p. 4, section 3 and p. 5, Fig. 2; the MEHTA-LI-DEVLIN combination now combines the token, segment, and position embeddings as vectors as disclosed in DEVLIN; MEHTA, p. 6, section 4.3 and Fig. 2 with DEVLIN, p. 4, section 3 and p. 5, Fig. 2; the examiner also notes that the broadest reasonable interpretation of “multidimensional numerical representation” includes a vector representation as disclosed in para. 0002 to the instant specification (“multidimensional numbers also called vectors”)).

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing of the present application to combine the teachings of DEVLIN with MEHTA and LI.  Indeed, MEHTA specifically cites to DEVLIN and the MEHTA architecture is based on the BERT transformer introduced and described in DEVLIN.  One of ordinary skill would further be motivated to utilize the teachings of DEVLIN to take advantage of bidirectional pre-training for language representation and to achieve state-of-the-art performance on token-level tasks, including natural language processing tasks. (DEVLIN, p. 2, section 1).  As disclosed in DEVLIN, one of ordinary skill would further be motivated to utilize the pre-trained BERT transformer and use the improved fine-tuning approaches proposed by DEVLIN. (DEVLIN, p. 1, section 1).  

Regarding claim 13, the MEHTA-LI combination discloses:
transformer-based natural language text (BERT is based on transformers and is used to process text; p. 4, section 4.2 and p. 6, section 4.3; proposed architecture utilizes pre-trained BERT; pp. 6-7, section 4.4 and Figs. 3 and 4) autoencoding (transformers use an encoder-decoder stack, e.g., an autoencoder; p. 4, section 4.2) program instructions:
The successive limitations in claim 13 claim program instructions that correspond to the computer-implemented method of claim 1, and therefore claim 13 is rejected under the same grounds under 35 U.S.C. 103 in view of the MEHTA-LI combination as explained above with respect to claim 1.

However, the MEHTA-LI combination fails to explicitly disclose:
computer program product 
one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media

However, in a related field of endeavor, DEVLIN introduces and describes the BERT transformer that MEHTA uses as part of its architecture.  The MEHTA-LI-DEVLIN combination makes obvious:
computer program product (DEVLIN explains that BERT is trained and fine-tuned using cloud TPUs or GPUs, and provides a reference to github source code for the BERT model, e.g., code is stored on a github server using computer storage, e.g., a computer program product; DEVLIN, p. 1, section 1 and p. 5, section 3.2; the MEHTA-LI-DEVLIN combination now implements BERT using computer storage, e.g., github server storage)
one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media, the program instructions comprising: (DEVLIN explains that BERT is trained and fine-tuned using cloud TPUs or GPUs, and provides a reference to github source code for the BERT model, e.g., code is stored on a github server using computer readable media, where the source code comprises program instructions; DEVLIN, p. 1, section 1 and p. 5, section 3.2; the MEHTA-LI-DEVLIN combination now implements BERT using computer storage, e.g., github server storage)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing of the present application to combine the teachings of DEVLIN with MEHTA and LI.  Indeed, MEHTA specifically cites to DEVLIN and the MEHTA architecture is based on the BERT transformer introduced and described in DEVLIN.  One of ordinary skill would further be motivated to utilize the teachings of DEVLIN to take advantage of bidirectional pre-training for language representation and to achieve state-of-the-art performance on token-level tasks, including natural language processing tasks. (DEVLIN, p. 2, section 1).  As disclosed in DEVLIN, one of ordinary skill would further be motivated to utilize the pre-trained BERT transformer and use the improved fine-tuning approaches proposed by DEVLIN. (DEVLIN, p. 1, section 1).  

Claim 14 depends on claim 13 and claims a computer program product having program instructions that correspond to the computer-implemented method of claim 2, and is therefore rejected under the same grounds as claims 2 and 13 above.
Claim 15 depends on claim 14 and claims a computer program product having program instructions that correspond to the computer-implemented method of claim 3, and is therefore rejected under the same grounds as claims 3 and 14 above.
Claim 16 depends on claim 14 and claims a computer program product having program instructions that correspond to the computer-implemented method of claim 4, and is therefore rejected under the same grounds as claims 4 and 14 above.

Claim 20 claims a computer system comprising one or more processors, one or more computer-readable memories, and one or more computer-readable storage devices, and program instructions stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories, where the remaining limitations in claim 20 claim program instructions that correspond to the computer-implemented method of claim 1 implemented using the storage devices and processors of claim 13, and therefore claim 20 is rejected under the same grounds under 35 U.S.C. 103 in view of the MEHTA-LI combination as explained above with respect to claims 1 and 13.

Claims 7, 8, and 12 are rejected under 35 U.S.C. 103 as being unpatentable over the MEHTA-LI combination further in view of Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems (2017) pp. 1-11, hereinafter referenced as VASWANI.

Regarding claim 7, the MEHTA-LI combination discloses the computer-implemented method of claim 1.  However, the MEHTA-LI combination fails to explicitly teach:
wherein the token self-attention portion adjusts an input token embedding according to a set of token attention weights, a token attention weight in the set of token attention weights corresponding to a relationship within the natural language text document between two tokens, the set of token attention weights computed during the training. 

However, in a related field of endeavor, VASWANI introduces and describes the transformer architecture that is explicitly cited in MEHTA (see MEHTA, p. 4, section 4.2).  The MEHTA-LI-VASWANI combination makes obvious:
wherein the token self-attention portion adjusts an input token embedding according to a set of token attention weights, (VASWANI discloses that in the transformer architecture, an attention function is computed as a weighted sum of values, e.g., token attention weights; VASWANI, pp. 3-4, section 3.2; MEHTA discloses using the transformer architecture and attention mechanisms described in VASWANI; MEHTA, p. 4, section 4.2; the MEHTA-LI-VASWANI combination now utilizes the attention function and weighted sum of values as explained in VASWANI and as incorporated in the BERT transformer architecture as modified by MEHTA; MEHTA, p. 4, section 4.2 and pp. 6-7, Figs. 3 and 4, with VASWANI, pp. 3-4, section 3.2) a token attention weight in the set of token attention weights corresponding to a relationship within the natural language text document between two tokens, the set of token attention weights computed during the training. (VASWANI discloses that an attention function maps a query, such as a first word or first token, with a set of key-value pairs, corresponding to a second word or a second token, and computing the weights based on the query and key-value pairs as a vector; VASWANI, pp. 3-4, section 3.2; VASWANI further discloses updating the transformer model, including the attention function, via training, VASWANI, pp. 7-8, section 5; MEHTA discloses using the transformer architecture and attention mechanisms described in VASWANI; MEHTA, p. 4, section 4.2; the MEHTA-LI-VASWANI combination now utilizes the attention function and weighted sum of values as explained in VASWANI and as incorporated in the BERT transformer architecture as modified by MEHTA; MEHTA, p. 4, section 4.2 and pp. 6-7, Figs. 3 and 4, with VASWANI, pp. 3-4, section 3.2 and pp. 7-8, section 5)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing of the present application to combine the teachings of VASWANI with MEHTA and LI.  Indeed, MEHTA specifically cites to VASWANI and the MEHTA architecture is based on the transformer architecture introduced and described in VASWANI.  One of ordinary skill would further be motivated to utilize the teachings of VASWANI to utilize an attention mechanism to drawn global dependencies between input and output to the transformer, e.g., input and output tokens.  (VASWANI, p. 2, section 1).  One of ordinary skill would further be motivated to utilize the teachings of VASWANI to utilize significantly more parallelization coupled with reduced training time.  (VASWANI, p. 2, section 1).  

Regarding claim 8, the MEHTA-LI combination discloses the computer-implemented method of claim 1.  However, the MEHTA-LI combination fails to explicitly teach:
wherein the metadata self-attention portion adjusts an input relativity embedding according to a set of metadata attention weights, the set of metadata attention weights computed during the training.

However, in a related field of endeavor, VASWANI introduces and describes the transformer architecture that is explicitly cited in MEHTA (see MEHTA, p. 4, section 4.2).  The MEHTA-LI-VASWANI combination makes obvious:
wherein the metadata self-attention portion adjusts an input relativity embedding according to a set of metadata attention weights, (VASWANI discloses that in the transformer architecture, an attention function is computed as a weighted sum of values, e.g., token attention weights; VASWANI, pp. 3-4, section 3.2; MEHTA discloses using the transformer architecture and attention mechanisms described in VASWANI; MEHTA, p. 4, section 4.2; LI discloses a 2-variable interaction matrix which represents a shared feature space representing information between 2 variables, e.g., a relativity matrix; LI, p. 2, section III.A and Fig. 2; the MEHTA-LI-VASWANI combination now utilizes the attention function and weighted sum of values as explained in VASWANI and as incorporated in the BERT transformer architecture as modified by MEHTA and applies such self-attention mechanism to the metadata embeddings computed by the MEHTA-LI combination as explained above with respect to claim 1; MEHTA, p. 4, section 4.2 and pp. 6-7, Figs. 3 and 4, with LI, p. 2, section III.A and Fig. 2; VASWANI, pp. 3-4, section 3.2) the set of metadata attention weights computed during the training. (VASWANI further discloses updating the transformer model, including the attention function, via training, VASWANI, pp. 7-8, section 5; MEHTA discloses using the transformer architecture and attention mechanisms described in VASWANI; MEHTA, p. 4, section 4.2; the MEHTA-LI-VASWANI combination now utilizes the attention function and weighted sum of values as explained in VASWANI and applies such self-attention mechanism to the metadata embeddings computed by the MEHTA-LI combination as explained above with respect to claim 1; MEHTA, p. 4, section 4.2 and pp. 6-7, Figs. 3 and 4, with LI, p. 2, section III.A and Fig. 2; and VASWANI, pp. 3-4, section 3.2 and pp. 7-8, section 5)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing of the present application to combine the teachings of VASWANI with MEHTA and LI.  Indeed, MEHTA specifically cites to VASWANI and the MEHTA architecture is based on the transformer architecture introduced and described in VASWANI.  One of ordinary skill would further be motivated to utilize the teachings of VASWANI to utilize an attention mechanism to drawn global dependencies between input and output to the transformer, e.g., input and output tokens.  (VASWANI, p. 2, section 1).  One of ordinary skill would further be motivated to utilize the teachings of VASWANI to utilize significantly more parallelization coupled with reduced training time.  (VASWANI, p. 2, section 1).  

Regarding claim 12, the MEHTA-LI combination discloses the computer-implemented method of claim 11.  However, the MEHTA-LI combination fails to explicitly teach:
wherein the decoder attention portion adjusts an output of an encoder layer according to a set of attention weights, the set of attention weights computed during the training.

However, in a related field of endeavor, VASWANI introduces and describes the transformer architecture that is explicitly cited in MEHTA (see MEHTA, p. 4, section 4.2).  The MEHTA-LI-VASWANI combination makes obvious:
wherein the decoder attention portion adjusts an output of an encoder layer according to a set of attention weights, (VASWANI discloses that in the transformer architecture, an attention function is computed as a weighted sum of values, e.g., token attention weights; VASWANI, pp. 3-4, section 3.2; MEHTA discloses using the transformer architecture and attention mechanisms described in VASWANI; MEHTA, p. 4, section 4.2; the MEHTA-LI-VASWANI combination now utilizes the attention function and weighted sum of values as explained in VASWANI and as incorporated in the BERT transformer architecture as modified by MEHTA and applied to the decoder and its multi-head attention sub-layer as explained with respect to the mapping in claim 11 above; MEHTA, p. 4, section 4.2 and pp. 6-7, Figs. 3 and 4, with VASWANI, pp. 3-4, section 3.2) the set of attention weights computed during the training. (VASWANI further discloses updating the transformer model, including the attention function, via training, VASWANI, pp. 7-8, section 5; MEHTA discloses using the transformer architecture and attention mechanisms described in VASWANI; MEHTA, p. 4, section 4.2; the MEHTA-LI-VASWANI combination now utilizes the attention function and weighted sum of values as explained in VASWANI and as incorporated in the BERT transformer architecture as modified by MEHTA and applied to the decoder as explained with respect to the mapping in claim 11 above; MEHTA, p. 4, section 4.2 and pp. 6-7, Figs. 3 and 4, with VASWANI, pp. 3-4, section 3.2 and pp. 7-8, section 5)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing of the present application to combine the teachings of VASWANI with MEHTA and LI.  Indeed, MEHTA specifically cites to VASWANI and the MEHTA architecture is based on the transformer architecture introduced and described in VASWANI.  One of ordinary skill would further be motivated to utilize the teachings of VASWANI to utilize an attention mechanism to drawn global dependencies between input and output to the transformer, e.g., input and output tokens.  (VASWANI, p. 2, section 1).  One of ordinary skill would further be motivated to utilize the teachings of VASWANI to utilize significantly more parallelization coupled with reduced training time.  (VASWANI, p. 2, section 1).  

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over the MEHTA-LI combination and further in view of Tan, Chuanqi, et al. "A survey on deep transfer learning." International conference on artificial neural networks. (2018) pp. 270-279, hereinafter referenced as TAN.

Regarding claim 10, the MEHTA-LI combination discloses the computer-implemented method of claim 1, including the “wherein the training comprises” limitation (see claim 1).  MEHTA further discloses:
initializing a set of parameters of the token embedding portion to a base set of token embedding parameters; (MEHTA discloses using transfer learning with pre-trained BERT, which transfers the hyperparameters (and BERT has 110M parameters), e.g., parameters for the token embedding layers (see mapping in claim 1) are transferred; MEHTA, pp. 4-6, section 4.3 and p. 8 sections 4.4 and 5; the examiner notes that the broadest reasonable interpretation of “initializing a set of parameters” includes using a base set of parameters from an already-trained portion, as disclosed in para. 0060 in the instant specification)
initializing a set of parameters of the token self-attention portion to a base set of token self-attention parameters; (MEHTA discloses using transfer learning with pre-trained BERT, which transfers the hyperparameters (and BERT has 110M parameters), e.g., parameters for the token self-attention layers (see mapping in claim 1) are transferred; MEHTA, pp. 4-6, section 4.3 and p. 8 sections 4.4 and 5)
second training, generating the trained encoder model, the second training comprising adjusting the set of parameters of the encoding model. (MEHTA discloses that BERT hyper-parameters are fine-tuned, e.g., trained and adjusted; MEHTA, pp. 8-9, sections 4.4 and 5; the MEHTA-LI combination now fine-tunes the modified architecture utilizing the 2-variable matrix of LI, to train and adjust architecture hyperparameters; MEHTA, pp. 8-9, sections 4.4 and 5 with LI, p. 2, section III.A and Fig. 2)

However, the MEHTA-LI combination fails to explicitly teach:
first training, generating a partially trained encoder model, the encoder model, the first training comprising adjusting a set of parameters of the relativity embedding portion and a set of parameters of the metadata self-attention portion while the set of parameters of the token embedding portion is set to the base set of token embedding parameters and the set of parameters of the token self-attention portion is set to the base set of token self-attention parameters; and

However, in a related field of endeavor, TAN pertains to transfer learning by using deep neural networks and the examiner notes that MEHTA also pertains to transfer learning (see MEHTA, p. 4, section 4.3).  The MEHTA-LI-TAN combination makes obvious:
first training, generating a partially trained encoder model, the encoder model, the first training comprising adjusting a set of parameters of the relativity embedding portion and a set of parameters of the metadata self-attention portion while the set of parameters of the token embedding portion is set to the base set of token embedding parameters and the set of parameters of the token self-attention portion is set to the base set of token self-attention parameters; and (TAN discloses network-based deep transfer learning including reuse of a partial network pre-trained in a source domain, where network structure and connection parameters are transferred to the target domain; TAN, p. 275, section 3.3; MEHTA discloses transfer learning from BERT to the MEHTA architecture and training in epochs; MEHTA, p. 4, section 4.3, pp. 6-7, section 4.4 and Figs. 3 and 4 and p. 8, section 5; the MEHTA-LI-TAN combination now utilizes the network-based deep transfer learning of TAN to partially reuse the pre-trained token self-attention and token embedding parameters of MEHTA, and then re-training and adjusting only the parameters relating to the relativity embedding portion and metadata self-attention portion in future training epochs as disclosed in MEHTA; MEHTA, p. 4, section 4.3, pp. 6-7, section 4.4 and Figs. 3 and 4 and p. 8, section 5 with TAN, p. 275, section 3.3)

	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to apply the network-based deep transfer learning teachings of TAN to MEHTA and LI.  As disclosed in TAN, one of ordinary skill in the art would be motivated to utilize transfer learning to reduce the demand of training data and training time in the target domain.  (TAN, p. 271, section 1).  As further disclosed in TAN, one of ordinary skill in the art would be further motivated to transfer only a sub-network and fine-tune only such sub-network.  (TAN, p. 275, section 3.3).  

Claims 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over the MEHTA-LI-DEVLIN combination and further in view of Bhowmick et al., US 20190019233 A1, hereinafter referenced as BHOWMICK.

Regarding claim 17, the MEHTA-LI-DEVLIN combination discloses the computer program product of claim 13.  However, the MEHTA-LI-DEVLIN combination fails to explicitly teach:
wherein the stored program instructions are stored in the at least one of the one or more storage media of a local data processing system, and 
wherein the stored program instructions are transferred over a network from a remote data processing system.  

However, in a related field of endeavor, BHOWMICK discloses a real-time recommendation engine based on user interests that may be deployed in a cloud computing environment. (paras. 0003, 0118, and Fig. 9).  The MEHTA-LI-DEVLIN-BHOWMICK combination makes obvious:
wherein the stored program instructions are stored in the at least one of the one or more storage media of a local data processing system, and (BHOWMICK discloses that a host device 112, e.g., a local data processing system, includes a memory 118, e.g., storage media, and natural language processing system 122 that implements program instructions; BHOWMICK, paras. 0030, 0126-0130 and Figs. 1 and 9; LI discloses a recommender system that utilizes user profiles and item descriptions; LI, p. 1, Fig. 1; MEHTA discloses a fake news classification system for analyzing the uncontrolled dissemination of fake news over the Internet; MEHTA, p. 1, section 1; the MEHTA-LI-DEVLIN-BHOWMICK combination now applies the fake news classification architecture of MEHTA to a client-server architecture or cloud computing environment as disclosed in BHOWMICK, where the MEHTA architecture, including program instructions, may be stored on a host device; MEHTA, p. 1, section 1; with LI, p. 1, Fig. 1 and BHOWMICK, paras. 0030, 0126-0130 and Figs. 1 and 9)
wherein the stored program instructions are transferred over a network from a remote data processing system.  (BHOWMICK discloses that remote device 102 and host device 112 communicate over a network, which can be implemented as part of a cloud computing environment; BHOWMICK, paras. 0024-0026 and Fig. 1; BHOWMICK further discloses that program instructions can be downloaded via a network, such as part of a cloud computing environment; BHOWMICK, para. 0126 and Fig. 9; the MEHTA-LI-DEVLIN-BHOWMICK combination now applies the fake news classification architecture of MEHTA to a client-server architecture or cloud computing environment as disclosed in BHOWMICK, where the MEHTA architecture, including program instructions, may be downloaded to the host device from a remote device over a network such as the Internet; MEHTA, p. 1, section 1 and BHOWMICK, paras. 0024-0026, 0126 and Figs. 1 and 9)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the cloud computing and client-server architecture teachings of BHOWMICK to MEHTA and LI and DEVLIN.  As disclosed in BHOWMICK, one of ordinary skill in the art would be motivated to use the cloud computing features of BHOWMICK in order to take advantage of characteristics such as on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service. (BHOWMICK, paras. 0101-0107). As disclosed in BHOWMICK, one of ordinary skill in the art would further be motivated to use the cloud computing features of BHOWMICK to implement a provider’s application in a software-as-a-service (SaaS) model, so that the application may be accessible via various client devices and hosted on a cloud-based server.  (BHOWMICK, para. 0109).

Regarding claim 18, the MEHTA-LI-DEVLIN combination discloses the computer program product of claim 13.  However, the MEHTA-LI-DEVLIN combination fails to explicitly teach:
wherein the stored program instructions are stored in the at least one of the one or more storage media of a server data processing system, and
wherein the stored program instructions are downloaded over a network to a remote data processing system for use in a computer readable storage device associated with the remote data processing system.

However, in a related field of endeavor, BHOWMICK discloses a real-time recommendation engine based on user interests that may be deployed in a cloud computing environment. (paras. 0003, 0118, and Fig. 9).  The MEHTA-LI-DEVLIN-BHOWMICK combination makes obvious:
wherein the stored program instructions are stored in the at least one of the one or more storage media of a server data processing system, and (BHOWMICK discloses that a remote device 102, e.g., a server data processing system, includes a memory 108, e.g., storage media, that stores program instructions; BHOWMICK, paras. 0023, 0126-0130 and Figs. 1 and 9; LI discloses a recommender system that utilizes user profiles and item descriptions; LI, p. 1, Fig. 1; MEHTA discloses a fake news classification system for analyzing the uncontrolled dissemination of fake news over the Internet; MEHTA, p. 1, section 1; the MEHTA-LI-DEVLIN-BHOWMICK combination now applies the fake news classification architecture of MEHTA to a client-server architecture or cloud computing environment as disclosed in BHOWMICK, where the MEHTA architecture, including program instructions, may be stored on a server device; MEHTA, p. 1, section 1; with LI, p. 1, Fig. 1 and BHOWMICK, paras. 0023, 0126-0130 and Figs. 1 and 9)
wherein the stored program instructions are downloaded over a network to a remote data processing system for use in a computer readable storage device associated with the remote data processing system. (BHOWMICK discloses that remote device 102 and host device 112 communicate over a network, which can be implemented as part of a cloud computing environment; BHOWMICK, paras. 0024-0026 and Fig. 1; BHOWMICK further discloses that program instructions can be downloaded via a network, such as part of a cloud computing environment; BHOWMICK, para. 0126 and Fig. 9; the MEHTA-LI-DEVLIN-BHOWMICK combination now applies the fake news classification architecture of MEHTA to a client-server architecture or cloud computing environment as disclosed in BHOWMICK, where the MEHTA architecture, including program instructions, may be downloaded to the remote device from a host device over a network such as the Internet; MEHTA, p. 1, section 1 and BHOWMICK, paras. 0024-0026, 0126 and Figs. 1 and 9)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the cloud computing and client-server architecture teachings of BHOWMICK to MEHTA and LI and DEVLIN.  As disclosed in BHOWMICK, one of ordinary skill in the art would be motivated to use the cloud computing features of BHOWMICK in order to take advantage of characteristics such as on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service. (BHOWMICK, paras. 0101-0107). As disclosed in BHOWMICK, one of ordinary skill in the art would further be motivated to use the cloud computing features of BHOWMICK to implement a provider’s application in a software-as-a-service (SaaS) model, so that the application may be accessible via various client devices and hosted on a cloud-based server.  (BHOWMICK, para. 0109).

Regarding claim 19, the MEHTA-LI-DEVLIN combination discloses the computer program product of claim 13.  However, the MEHTA-LI-DEVLIN combination fails to explicitly teach:
wherein the computer program product is provided as a service in a cloud environment.

However, in a related field of endeavor, BHOWMICK discloses a real-time recommendation engine based on user interests that may be deployed in a cloud computing environment. (paras. 0003, 0118, and Fig. 9).  The MEHTA-LI-DEVLIN-BHOWMICK combination makes obvious:
wherein the computer program product is provided as a service in a cloud environment. (BHOWMICK discloses providing an application in a software-as-a-service model; BHOWMICK, para. 0109; the MEHTA-LI-DEVLIN-BHOWMICK combination now applies the fake news classification architecture of MEHTA to a cloud computing environment as disclosed in BHOWMICK, where the MEHTA fake news implementation is provided in a software-as-a-service model as disclosed in BHOWMICK; MEHTA, p. 1, section 1 and BHOWMICK, para. 0109 and Figs. 1 and 9)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the cloud computing and client-server architecture teachings of BHOWMICK to MEHTA and LI and DEVLIN.  As disclosed in BHOWMICK, one of ordinary skill in the art would be motivated to use the cloud computing features of BHOWMICK in order to take advantage of characteristics such as on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service. (BHOWMICK, paras. 0101-0107). As disclosed in BHOWMICK, one of ordinary skill in the art would further be motivated to use the cloud computing features of BHOWMICK to implement a provider’s application in a software-as-a-service (SaaS) model, so that the application may be accessible via various client devices and hosted on a cloud-based server.  (BHOWMICK, para. 0109).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 11500939 B2 (Aggarwal et al.) discloses a similarity search framework utilizing multiple modalities, including the use of metadata as part of a multi-modal graph. (col. 3, lines 59-61).
US 20210056428 A1 (Palowitch) discloses graph embeddings via metadata-orthogonal training, where metadata is utilized to enhance graph learning models.  (para. 0004).
US 20200279105 A1 (Muffat et al.) discloses a deep learning engine for context and context aware data classification.  Further discloses fine-tuning the BERT transformer where the embedded model can vectorize both metadata and content. (para. 0036).
Cho, Won Ik, et al. "Pay Attention to Categories: Syntax-Based Sentence Modeling with Metadata Projection Matrix." Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation. 2020 pp. 1-10.  Discloses a proposed system for document classification using a one-hot encoded projection matrix.  (page 4, figure 2).
Ravindranath, Manjusha, et al. "M2NN: Rare event inference through multi-variate multi-scale attention." 2020 IEEE International Conference on Smart Data Services (SMDS). 2020 pp. 53-62.  Discloses a metadata enriched multi-variate time series model, including representing a metadata graph as a metadata matrix.  (p. 55, figure 3).
Doshi, Ketan.  Transformers Explained Visually (Part 1).  Dec. 13, 2020.  https://web.archive.org/web/20201213160656/https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452.  Discloses an overview of transformers, including those used by Google BERT, including an explanation of the encoder-decoder stack.
Doshi, Ketan.  Transformers Explained Visually (Part 2).  Jan. 2, 2021. https://web.archive.org/web/20210102144728/https://towardsdatascience.com/transformers-explained-visually-part-2-how-it-works-step-by-step-b49fa4a64f34.  Discloses differences between encoder and decoder models in a transformer and further describes the role of attention in a transformer.
  Doshi, Ketan.  Transformers Explained Visually (Part 3).  Jan. 17, 2021. https://web.archive.org/web/20210117040743/https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853.  Further explains multi-head attention.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL C LEE whose telephone number is (571)272-4933. The examiner can normally be reached M-F 9:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/MICHAEL C. LEE/Examiner, Art Unit 2655                                                                                                                                                                                                        
/DOUGLAS GODBOLD/Primary Examiner, Art Unit 2655