DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference sign mentioned in the description: “method 500” in paragraph 0092, line 1.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The disclosure is objected to because of the following informalities:
In paragraph 0005, lines 6-7, “can configured” should read “can be configured”.
In paragraph 0039, line 3, “cam be determined” should read “can be determined”.
In paragraph 0088, line 3, “first textual block representations 405B” should read “first textual block representations 405A”.
Figure 4B element 452 is cites as “textual blocks 452” in paragraph 0090, line 5, and paragraph 0091, lines 3, 5, and 8, and as “token embeddings 452” in paragraph 0090, line 8, and paragraph 0091, lines 1-2.
In paragraph 0096, line 5, “representations520B-520D” should read “representations 520B-520D”.
Appropriate correction is required.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitations uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are: “the sentence encoding portion” and “the document encoding portion” in claim 15.
Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1 – 3, 11 – 13 and 15 – 17 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Jiang et al. ("Semantic Text Matching for Long-Form Documents"), hereinafter Jiang.
Regarding claim 1, Jiang discloses a computer-implemented method for predicting semantic similarity between documents, the method comprising:
obtaining, by a computing system comprising one or more computing devices, a first document comprising a plurality of first sentences and a second document comprising a plurality of second sentences (Section 3.1, lines 20-24, "Given a source document ds and a set of candidate documents Dc, our goal is to estimate semantic similarity ŷ = Sim(ds, dc) between the source document ds and every candidate document dc ∈ Dc so that the target documents semantically matched to the source document have higher semantic similarity scores."; Section 3.1, lines 4-6, "To facilitate readability, we assume that there are three levels in hierarchy – paragraphs, sentences and words.");
parsing, by the computing system, the first document into a plurality of first textual blocks and the second document into a plurality of second textual blocks, wherein each of the plurality of first textual blocks comprises one or more of the plurality of first sentences and each of the plurality of second textual blocks comprises one or more of the plurality of second sentences (Section 3.1, lines 9-12, "Words in the d can be fitted into three hierarchical structures, i.e., Wp, Ws, and Ww, with depth 3 (paragraph-level), depth 2 (sentence-level), and depth 1 (word-level) respectively."; Figure 2, "The illustration of hierarchical structures with different depths for an example document. Pk and Sj are the structures of paragraphs and sentences."; The paragraphs and the sentences read on the textual blocks.);
processing, by the computing system, each of the plurality of first textual blocks with a block encoding portion of a first encoding submodel of a machine-learned semantic document encoding model to obtain a respective plurality of first textual block representations (Section 3.2, lines 7-12, "For each level, an attention-based hierarchical RNN (with corresponding level depth) is constructed as an encoder to generate representations for that level. For example, the paragraph-level encoder produces paragraph-level representations with a depth-3 encoder while the sentence-level encoder produces sentence-level representations with a depth-2 encoder."; The paragraph-level representations and sentence-level representations read on the textual block representations.);
processing, by the computing system, each of the plurality of second textual blocks with a block encoding portion of a second encoding submodel of the machine-learned semantic document encoding model to obtain a respective plurality of second textual block representations (Section 3.2, lines 7-12, "For each level, an attention-based hierarchical RNN (with corresponding level depth) is constructed as an encoder to generate representations for that level. For example, the paragraph-level encoder produces paragraph-level representations with a depth-3 encoder while the sentence-level encoder produces sentence-level representations with a depth-2 encoder."; The paragraph-level representations and sentence-level representations read on the textual block representations.);
respectively processing, by the computing system, the plurality of first textual block representations and the plurality of second textual block representations with a document encoding portion of the first encoding submodel and a document encoding portion of the second encoding submodel to obtain a first document encoding and a second document encoding (Section 3.2, lines 4-15, "For each document, MASH RNN derives an informative representation based on the knowledge from different levels of document structure. For each level, an attention-based hierarchical RNN (with corresponding level depth) is constructed as an encoder to generate representations for that level. For example, the paragraph-level encoder produces paragraph-level representations with a depth-3 encoder while the sentence-level encoder produces sentence-level representations with a depth-2 encoder. The final document representation is then acquired by concatenating the representations of different levels, comprehensively covering the knowledge in all document structure levels."; Concatenating the paragraph-level representations and sentence-level representations to acquire the document representation reads on processing the textual block representations to obtain a document encoding.);
and determining, by the computing system, a similarity metric descriptive of a semantic similarity between the first document and the second document based on the first document encoding and the second document encoding (Section 3.2, lines 15-21, "To estimate semantic similarity for semantic text matching, SMASH RNN adopts the Siamese structure with two MASH RNN towers. Given representations generated by MASH RNN for both the source and target documents, a fully-connected layer with nonlinearity infers a probabilistic score to examine the semantic relation between two documents with a sigmoid function.").
Regarding claim 2, Jiang discloses the computer-implemented method as claimed in claim 1,
wherein the block encoding portion of the first encoding model comprises a multi-head self-attention mechanism (Section 3.2, lines 1-7, "Figure 3 and 4 illustrate the framework of our proposed Siamese multi-depth attention-based hierarchical RNN (SMASH RNN). Under the Siamese structure [32], each SMASH RNN has two multi-depth attention-based hierarchical RNN (MASH RNN) towers. For each document, MASH RNN derives an informative representation based on the knowledge from different levels of document structure."; The multi-depth attention-based hierarchical RNN reads on the multi-head self-attention mechanism.).
Regarding claim 3, Jiang discloses the computer-implemented method as claimed in claim 2, wherein processing, by the computing system, the plurality of first textual block representations with the document encoding portion of the first encoding submodel to obtain the first document encoding comprises:
processing, by the computing system, each of the plurality of first textual block representations with the document encoding portion of the first encoding submodel to obtain a respective plurality of contextual block representations (Section 2.3, lines 3-5, "Given an input sequence, the attention mechanism infers the importance of each position with a learnable context vector."; Section 3.2, lines 7-12, "For each level, an attention-based hierarchical RNN (with corresponding level depth) is constructed as an encoder to generate representations for that level. For example, the paragraph-level encoder produces paragraph-level representations with a depth-3 encoder while the sentence-level encoder produces sentence-level representations with a depth-2 encoder."; The paragraph-level representations and sentence-level representations generated by an attention-based hierarchical RNN where the attention mechanism infers the importance of each position with a learnable context vector reads on obtaining contextual block representations.);
and determining, by the computing system, the first document encoding based at least in part on the plurality of contextual block representations (Section 3.2, lines 12-15, "The final document representation is then acquired by concatenating the representations of different levels, comprehensively covering the knowledge in all document structure levels."; Concatenating the paragraph-level representations and sentence-level representations to acquire the document representation reads on determining the document encoding based on the contextual block representations.).
Regarding claim 11, Jiang discloses the computer-implemented method as claimed in claim 1, wherein the method further comprises:
evaluating, by the computing system, a loss function that evaluates a difference between the similarity metric and ground truth data associated with the first document and the second document (Section 3.5, lines 2-6, "Given a tuple of training data (ds, dc, y), where y is a Boolean value showing whether two documents are semantically matched, SMASH RNN optimizes the binary cross-entropy [20] between the estimated probabilistic score ŷ and the gold standard y."; The binary cross-entropy reads on the loss function, the probabilistic score ŷ reads on the similarity metric, and the gold standard y reads on the ground truth data.);
and adjusting, by the computing system, one or more parameters of the machine-learned semantic document encoding model based at least in part on the loss function (Section 4.1, lines 18-19, "The Adam optimizer [26] is applied to optimize the parameters with an initial learning rate of 10−5.").
Regarding claim 12, Jiang discloses the computer-implemented method as claimed in claim 1,
wherein the machine-learned semantic document encoding model comprises a machine-learned siamese transformer neural network, and wherein the first encoder submodel and the second encoder submodel respectively comprise a first machine-learned transformer neural network and a second machine-learned transformer neural network of the machine-learned siamese transformer neural network (Section 3.2, lines 1-7, "Figure 3 and 4 illustrate the framework of our proposed Siamese multi-depth attention-based hierarchical RNN (SMASH RNN). Under the Siamese structure [32], each SMASH RNN has two multidepth attention-based hierarchical RNN (MASH RNN) towers. For each document, MASH RNN derives an informative representation based on the knowledge from different levels of document structure.").
Regarding claim 13, Jiang discloses the computer-implemented method as claimed in claim 1, wherein the similarity metric comprises:
a binary prediction whether the first document and the second document are semantically similar; or a predicted level of semantic similarity between the two documents (Section 3.1, lines 20-24, "Given a source document ds and a set of candidate documents Dc, our goal is to estimate semantic similarity ŷ = Sim(ds, dc) between the source document ds and every candidate document dc ∈ Dc so that the target documents semantically matched to the source document have higher semantic similarity scores.").
Regarding claim 15, Jiang discloses a computing system for training a machine-learned model for semantic document analysis, comprising:
one or more processors (Section 3.3, lines 10-11, “In this paper, we propose to model documents with information from different document structure levels.”; Section 3.3, lines 14-16, “The computation of encoders in MASH RNN follows a bottom-up principle with bidirectional recurrent neural networks (Bi-RNNs) with attention.”; Section 3.3, lines 30-32, “The backward pass processes the input sequence in reverse order and generates the backward hidden states”; Implementing a recurrent neural network, performing computations, and processing demonstrates the use of a processor.);

and a machine-learned semantic document encoding model comprising a first encoding submodel and a second encoding submodel, each of the first and second encoding submodels comprising a sentence encoding portion and a document encoding portion, wherein: the sentence encoding portion is configured to process a plurality of textual blocks to obtain a plurality of textual block representations (Section 3.1, lines 9-12, "Words in the d can be fitted into three hierarchical structures, i.e., Wp, Ws, and Ww, with depth 3 (paragraph-level), depth 2 (sentence-level), and depth 1 (word-level) respectively."; Section 3.2, lines 7-12, "For each level, an attention-based hierarchical RNN (with corresponding level depth) is constructed as an encoder to generate representations for that level. For example, the paragraph-level encoder produces paragraph-level representations with a depth-3 encoder while the sentence-level encoder produces sentence-level representations with a depth-2 encoder."; The paragraphs and the sentences read on the textual blocks, and the paragraph-level representations and sentence-level representations read on the textual block representations.);
and the document encoding portion is configured to process the plurality of textual block representations to obtain a plurality of contextual block representations (Section 2.3, lines 3-5, "Given an input sequence, the attention mechanism infers the importance of each position with a learnable context vector."; Section 3.2, lines 7-12, "For each level, an attention-based hierarchical RNN (with corresponding level depth) is constructed as an encoder to generate representations for that level. For example, the paragraph-level encoder produces paragraph-level representations with a depth-3 encoder while the sentence-level encoder produces sentence-level representations with a depth-2 encoder."; The paragraph-level representations and sentence-level representations generated by an attention-based hierarchical RNN where the attention mechanism infers the importance of each position with a learnable context vector reads on obtaining contextual block representations.);
and one or more tangible, non-transitory computer readable media storing computer-readable instructions that when executed by the one or more processors cause the one or more processors to perform operations, the operations comprising: obtaining a plurality of first textual blocks and a plurality of second textual blocks, wherein each of the plurality of textual blocks and the plurality of second textual blocks respectively comprise one or more sentences of a first document and one or more sentences of a second document (Section 3.1, lines 20-24, "Given a source document ds and a set of candidate documents Dc, our goal is to estimate semantic similarity ŷ = Sim(ds, dc) between the source document ds and every candidate document dc ∈ Dc so that the target documents semantically matched to the source document have higher semantic similarity scores."; Section 3.1, lines 9-12, "Words in the d can be fitted into three hierarchical structures, i.e., Wp, Ws, and Ww, with depth 3 (paragraph-level), depth 2 (sentence-level), and depth 1 (word-level) respectively."; Documents ds and dc read on the first document and the second document, and the paragraphs and the sentences read on the textual blocks.);
processing the plurality of first textual blocks and the plurality of second textual blocks with the machine-learned semantic document encoding model to respectively obtain a plurality of first contextual block representations and a plurality of second contextual block representations (Section 2.3, lines 3-5, "Given an input sequence, the attention mechanism infers the importance of each position with a learnable context vector."; Section 3.2, lines 7-12, "For each level, an attention-based hierarchical RNN (with corresponding level depth) is constructed as an encoder to generate representations for that level. For example, the paragraph-level encoder produces paragraph-level representations with a depth-3 encoder while the sentence-level encoder produces sentence-level representations with a depth-2 encoder."; The paragraph-level representations and sentence-level representations generated by an attention-based hierarchical RNN where the attention mechanism infers the importance of each position with a learnable context vector reads on obtaining contextual block representations.);
determining, based on at least one of the plurality of first contextual block representations and at least one of the plurality of second contextual block representations, a similarity metric descriptive of a semantic similarity between the first document and the second document (Section 3.2, lines 15-21, "To estimate semantic similarity for semantic text matching, SMASH RNN adopts the Siamese structure with two MASH RNN towers. Given representations generated by MASH RNN for both the source and target documents, a fully-connected layer with nonlinearity infers a probabilistic score to examine the semantic relation between two documents with a sigmoid function.");
evaluating a loss function that evaluates a difference between the similarity metric and ground truth data associated with the first document and the second document (Section 3.5, lines 2-6, "Given a tuple of training data (ds, dc, y), where y is a Boolean value showing whether two documents are semantically matched, SMASH RNN optimizes the binary cross-entropy [20] between the estimated probabilistic score ŷ and the gold standard y."; The binary cross-entropy reads on the loss function, the probabilistic score ŷ reads on the similarity metric, and the gold standard y reads on the ground truth data.);
and adjusting one or more parameters of the machine-learned semantic document encoding model based at least in part on the loss function (Section 4.1, lines 18-19, "The Adam optimizer [26] is applied to optimize the parameters with an initial learning rate of 10−5.").
Regarding claim 16, Jiang discloses the computing system as claimed in claim 15, wherein processing the one or more first textual blocks and the one or more second textual blocks with the machine-learned semantic document encoding model to obtain the plurality of first contextual block representations and the plurality of second contextual block representations comprises:
processing each of the plurality of first textual blocks with the block encoding portion of the first encoding submodel to obtain a respective plurality of first textual block representations  (Section 3.1, lines 9-12, "Words in the d can be fitted into three hierarchical structures, i.e., Wp, Ws, and Ww, with depth 3 (paragraph-level), depth 2 (sentence-level), and depth 1 (word-level) respectively."; Section 3.2, lines 7-12, "For each level, an attention-based hierarchical RNN (with corresponding level depth) is constructed as an encoder to generate representations for that level. For example, the paragraph-level encoder produces paragraph-level representations with a depth-3 encoder while the sentence-level encoder produces sentence-level representations with a depth-2 encoder."; The paragraphs and the sentences read on the textual blocks, and the paragraph-level representations and sentence-level representations read on the textual block representations.);
processing each of the plurality of second textual blocks with the block encoding portion of the second encoding submodel to obtain a respective plurality of second textual block representations (Section 3.1, lines 9-12, "Words in the d can be fitted into three hierarchical structures, i.e., Wp, Ws, and Ww, with depth 3 (paragraph-level), depth 2 (sentence-level), and depth 1 (word-level) respectively."; Section 3.2, lines 7-12, "For each level, an attention-based hierarchical RNN (with corresponding level depth) is constructed as an encoder to generate representations for that level. For example, the paragraph-level encoder produces paragraph-level representations with a depth-3 encoder while the sentence-level encoder produces sentence-level representations with a depth-2 encoder."; The paragraphs and the sentences read on the textual blocks, and the paragraph-level representations and sentence-level representations read on the textual block representations.);
and respectively processing the plurality of first textual block representations and the plurality of second textual block representations with the document encoding portion of the first encoding submodel and the document encoding portion of the second encoding submodel to respectively obtain the plurality of first contextual block representations and the plurality of second contextual block representations (Section 2.3, lines 3-5, "Given an input sequence, the attention mechanism infers the importance of each position with a learnable context vector."; Section 3.2, lines 7-12, "For each level, an attention-based hierarchical RNN (with corresponding level depth) is constructed as an encoder to generate representations for that level. For example, the paragraph-level encoder produces paragraph-level representations with a depth-3 encoder while the sentence-level encoder produces sentence-level representations with a depth-2 encoder."; The paragraph-level representations and sentence-level representations generated by an attention-based hierarchical RNN where the attention mechanism infers the importance of each position with a learnable context vector reads on obtaining contextual block representations.).
Regarding claim 17, Jiang discloses the computing system as claimed in claim 15,
wherein the one or more parameters of the machine-learned semantic document encoding model are shared between the first encoding submodel and the second encoding submodel (Section 3.4, lines 1-7, "The Siamese structure associated with two identical sub-networks were shown to be effective in measuring the affinity between representations of two documents modeled in the same hidden space [32, 38, 47, 54]. To address the problem of semantic text matching for long-form documents, we propose the Siamese multidepth attention-based hierarchical RNN (SMASH RNN) using a Siamese structure that fuses the outputs of two MASH RNNs."; The identical sub-networks reads on one or more shared parameters between the first submodel and second submodel.).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 4 – 5 are rejected under 35 U.S.C. 103 as being unpatentable over Jiang in view of Lu et al. (“TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval”), hereinafter Lu.
Regarding claim 4, Jiang discloses the computer-implemented method as claimed in claim 3, but does not specifically disclose: wherein processing, by the computing system, each of the plurality of first textual blocks with the sentence encoding portion of the first encoding submodel comprises: for each of the plurality of first textual blocks: processing, by the computing system, a respective first textual block with the sentence encoding portion of the first encoding submodel to obtain sentence tokens respectively corresponding to words in each of the one or more first sentences of the respective first textual block; concatenating, by the computing system, a first sentence token of the sentence tokens with a position embedding corresponding to the first sentence token to obtain a first textual block representation for the respective first textual block.
Lu teaches:
 wherein processing, by the computing system, each of the plurality of first textual blocks with the sentence encoding portion of the first encoding submodel comprises: for each of the plurality of first textual blocks: processing, by the computing system, a respective first textual block with the sentence encoding portion of the first encoding submodel to obtain sentence tokens respectively corresponding to words in each of the one or more first sentences of the respective first textual block (Section 4.2, lines 12-13, "For token embeddings, TwinBERT uses the tri-letter based word embeddings introduced in [28].");
concatenating, by the computing system, a first sentence token of the sentence tokens with a position embedding corresponding to the first sentence token to obtain a first textual block representation for the respective first textual block (Section 4.2, lines 20-25, "BERT embeddings are combinations of three components: token embeddings, segment embeddings and position embeddings. While, the input of a TwinBERT encoder only contains one single sentence and segment embeddings are unnecessary. Therefore, the input embeddings only consist of the sum of token embeddings and position embeddings."; Combining token embeddings and position embeddings to generate BERT embeddings reads on concatenating sentence tokens and position embeddings to obtain textual block representations.).
Lu teaches generating embeddings by combining token embeddings and position embeddings in order to represent queries and documents for effective and efficient information retrieval (Abstract, lines 4-8, "We present Twin-BERT model for effective and efficient retrieval, which has twin-structured BERT-like encoders to represent query and document respectively and a crossing layer to combine the embeddings and produce a similarity score.").
Jiang and Lu are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang to incorporate the teachings of Lu to generate embeddings by combining token embeddings and position embeddings.  Doing so would allow for representing queries and documents for effective and efficient information retrieval.
Regarding claim 5, Jiang in view of Lu discloses the computer-implemented method as claimed in claim 4.
Lu further teaches:
wherein each of the sentence tokens comprises an attentional weight (Section 4.3, line 8, "weighted-average pooling introduces a weight to each token vector").
Lu teaches applying a weight to each token vector in order to represent queries and documents for effective and efficient information retrieval (Abstract, lines 4-8, "We present Twin-BERT model for effective and efficient retrieval, which has twin-structured BERT-like encoders to represent query and document respectively and a crossing layer to combine the embeddings and produce a similarity score.").
Jiang and Lu are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang in view of Lu to further incorporate the teachings of Lu to apply a weight to each token vector.  Doing so would allow for representing queries and documents for effective and efficient information retrieval.
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Jiang in view of Lu as applied to claim 5 above, and further in view of Peters et al. (“Deep Contextualized Word Representations”), hereinafter Peters.
Regarding claim 6, Jiang in view of Lu discloses the computer-implemented method as claimed in claim 5.
Lu further teaches: wherein determining, by the computing system, the first document encoding comprises: determining, by the computing system based at least in part on the attentional weights of the sentence tokens of each of the plurality of first textual blocks, a weighted sum of the plurality of first textual block representations (Section 4.3, line 8, "weighted-average pooling introduces a weight to each token vector"; Section 4.1, lines 17-19, "The last and top layer of the encoder is the weighted pooling layer which applies a weighted sum of the final hidden vectors and produces a single embedding for each input sentence."; The weighted sum of the hidden vectors reads on the weighted sum of the textual block representations, and applying a weight to each token vector demonstrates that the weighted sum is based at least in part on the attentional weights of the sentence tokens.).
Lu teaches generating an embedding from a weighted sum of hidden vectors, and applying a weight to each token vector, in order to represent queries and documents for effective and efficient information retrieval (Abstract, lines 4-8, "We present Twin-BERT model for effective and efficient retrieval, which has twin-structured BERT-like encoders to represent query and document respectively and a crossing layer to combine the embeddings and produce a similarity score.").
Jiang and Lu are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang in view of Lu to further incorporate the teachings of Lu to generate an embedding from a weighted sum of hidden vectors, and applying a weight to each token vector.  Doing so would allow for representing queries and documents for effective and efficient information retrieval.
Jiang in view of Lu does not specifically disclose: concatenating, by the computing system, the weighted sum and the contextual block representation associated with at least one first textual block of the plurality of first textual blocks to determine the first document encoding.
Peters teaches:
concatenating, by the computing system, the weighted sum and the contextual block representation associated with at least one first textual block of the plurality of first textual blocks to determine the first document encoding (Section 3.3, lines 12-24, "Given a sequence of tokens (t1, ..., tN), it is standard to form a context-independent token representation xk for each token position using pre-trained word embeddings and optionally character-based representations. Then, the model forms a context-sensitive representation hk, typically using either bidirectional RNNs, CNNs, or feed forward networks. To add ELMo to the supervised model, we first freeze the weights of the biLM and then concatenate the ELMo vector ELMoktask with xk and pass the ELMo enhanced representation [xk; ELMoktask] into the task RNN."; The context-independent token representation xk reads on the textual block representation, the ELMo vector ELMoktask reads on the contextual block representation, and the ELMo enhanced representation [xk; ELMoktask] reads on the document encoding.).
Peters teaches concatenating context-independent representations and context-sensitive representations in order to generate text representations that improve natural language processing applications including question answering, textual entailment, and sentiment analysis (Abstract, lines 6-14, "Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis.")
Jiang, Lu, and Peters are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang in view of Lu to incorporate the teachings of Peters to concatenate context-independent representations and context-sensitive representations.  Doing so would allow for generating text representations that improve natural language processing applications including question answering, textual entailment, and sentiment analysis.
Claims 7, 9 – 10 and 18 – 20 are rejected under 35 U.S.C. 103 as being unpatentable over Jiang in view of Zhang et al. ("HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization"), hereinafter Zhang.
Regarding claim 7, Jiang discloses the computer-implemented method as claimed in claim 3, but does not specifically disclose: wherein determining, by the computing system, the first document encoding comprises: concatenating, by the computing system, a sum of the plurality of first textual block representations and the contextual block representation associated with at least one textual block of the plurality of first textual blocks to determine the first document encoding; concatenating, by the computing system, a mean of the plurality of first textual block representations and the contextual block representation associated with the at least one textual block of the plurality of first textual blocks to determine the first document encoding; or determining, by the computing system, the contextual block representation associated with the at least one textual block of the plurality of first textual blocks to be the first document encoding.
Zhang teaches:
wherein determining, by the computing system, the first document encoding comprises: concatenating, by the computing system, a sum of the plurality of first textual block representations and the contextual block representation associated with at least one textual block of the plurality of first textual blocks to determine the first document encoding; concatenating, by the computing system, a mean of the plurality of first textual block representations and the contextual block representation associated with the at least one textual block of the plurality of first textual blocks to determine the first document encoding; or determining, by the computing system, the contextual block representation associated with the at least one textual block of the plurality of first textual blocks to be the first document encoding (Section 3.1, lines 41-49, "In analogy to the sentence encoder, as shown in Figure 1, the document encoder is yet another Transformer but applies on the sentence level. After running the Transformer on a sequence of sentence representations (ĥ1, ĥ2, ..., ĥ|D|), we obtain the context sensitive sentence representations (d1, d2, ..., d|D|). Now we have finished the encoding of a document with a hierarchical bidirectional transformer encoder HIBERT."; The context sensitive sentence representations obtained from the sentence representations read on the contextual block representation associated with the at least one textual block.).
Zhang teaches encoding a document by obtaining context sensitive sentence representations from sentence representations in order to perform document summarization (Section 1, lines 77-84, "In this paper, we propose HIBERT, which stands for HIerachical Bidirectional Encoder Representations from Transformers. We design an unsupervised method to pre-train HIBERT for document modeling. We apply the pre-trained HIBERT to the task of document summarization and achieve state-of-the-art performance on both the CNN/Dailymail and New York Times dataset.").
Jiang and Zhang are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang to incorporate the teachings of Zhang to encode a document by obtaining context sensitive sentence representations from sentence representations.  Doing so would allow for performing document summarization.
Regarding claim 9, Jiang discloses the computer-implemented method as claimed in claim 1,
wherein, prior to obtaining the first document and the second document, the method comprises: obtaining, by the computing system, a plurality of textual training blocks from one or more training documents, wherein each of the plurality of textual training blocks comprises one or more sentences from the one or more training documents (Section 4.2, lines 11-13, "We adopt the largest publicly available Avocado Research Email Collection [36] as the experimental dataset for email attachment suggestion."; Section 4.2, lines 16-19, "To partition the emails into training, validation, and testing sets, the first 18-month emails are the training data while the following 2-month emails are utilized for validation." Section 4.2, lines 41-45, "While using only the paragraph-level hierarchy, SMASH (P) has a similar accuracy and a better F1-score compared to HAN. After accordingly adding sentence- and word-level knowledge, SMASH (P+S) and SMASH (P+S+W) have further improved performance."; The Avocado Research Email Collection reads on the one or more training documents, and using paragraph-level hierarchy reads on obtaining a plurality of textual training blocks.);
and adjusting, by the computing system, one or more parameters of the machine-learned semantic document encoding model based at least in part on the pre-training loss function (Section 4.1, lines 18-19, "The Adam optimizer [26] is applied to optimize the parameters with an initial learning rate of 10−5.").
Jiang does not specifically disclose: masking, by the computing system, one or more sentences of a textual training block of the plurality of textual training blocks to obtain a masked training block; processing, by the computing system, the plurality of textual training blocks with the machine-learned semantic document encoding model to obtain a respective plurality of contextual block representations, wherein the contextual block representation for the masked training block comprises a multi-class classification output comprising a predicted similarity between the masked training block and each of a plurality of additional masked training blocks from the training batch; evaluating, by the computing system, a pre-training loss function that evaluates a difference between the multi-class classification output and ground truth data associated with the masked training block and the plurality of additional masked training blocks.
Zhang teaches:
masking, by the computing system, one or more sentences of a textual training block of the plurality of textual training blocks to obtain a masked training block (Section 3.2, lines 15-19, ”HIBERT aims to learn the representation of a document, where its basic units are sentences. Therefore, a natural way of pre-training a document level model (e.g., HIBERT) is to predict a sentence (or sentences) instead of a word (or words).”; Section 3.2, lines 28-30, "We randomly select 15% of the sentences in D and mask them."; Section 3.2, lines 67-70, "After the application of the above procedures to a document D = (S1, S2, ..., S|D|), we obtain the masked document D̃ = (S̃1, S̃2, ..., S̃|D|).");
processing, by the computing system, the plurality of textual training blocks with the machine-learned semantic document encoding model to obtain a respective plurality of contextual block representations (Section 3.2, lines 70-78, "Let K denote the set of indices of selected sentences in D. Now we are ready to predict the masked sentences M = {Sk | k ϵ K} using D̃. We first apply the hierarchical encoder HIBERT in Section 3.1 to D̃ and obtain its context sensitive sentence representations (d̃1, d̃2, ..., d̃|D|). We will demonstrate how we predict the masked sentence Sk = (w0k, w1k, w2k, ..., w|S|k) one word per step."; The context sensitive sentence representations read on the contextual block representations.),
wherein the contextual block representation for the masked training block comprises a multi-class classification output comprising a predicted similarity between the masked training block and each of a plurality of additional masked training blocks from the training batch (Section 3.2, lines 70-78, "Let K denote the set of indices of selected sentences in D. Now we are ready to predict the masked sentences M = {Sk | k ϵ K} using D̃. We first apply the hierarchical encoder HIBERT in Section 3.1 to D̃ and obtain its context sensitive sentence representations (d̃1, d̃2, ..., d̃|D|). We will demonstrate how we predict the masked sentence Sk = (w0k, w1k, w2k, ..., w|S|k) one word per step."; Section 3.2, lines 82-84, "As shown in Figure 1, we employ a Transformer decoder (Vaswani et al., 2017) to predict wjk with d̃k as its additional input."; Section 3.2, lines 117-119, "The probability of wjk given w0k, ..., wj-1k and D̃ is: p(wjk | w0:j-1k, D̃) = softmax(WO gj̃-1 ). Finally the probability of all masked sentences M given D̃ is p(M|D̃)"; The masked sentences M read on the masked training block, the masked document D̃ reads on the plurality of additional masked training blocks, and the probability of all masked sentences M given D̃ reads on the predicted similarity between the masked training block and each of a plurality of additional masked training blocks.);
evaluating, by the computing system, a pre-training loss function that evaluates a difference between the multi-class classification output and ground truth data associated with the masked training block and the plurality of additional masked training blocks (Section 3.5, lines 123-125, "The model above can be trained by minimizing the negative log-likelihood of all masked sentences given their paired documents."; The negative log-likelihood reads on the pre-training loss function, the paired documents read on the ground truth data, and minimizing the negative log-likelihood of all masked sentences given their paired documents reads on evaluating a difference between the multi-class classification output and ground truth data associated with the masked training block and the plurality of additional masked training blocks.).
Zhang teaches training a document encoding model by masking sentences in a training document, obtaining context sensitive sentence representations for the document, finding the probability that each sentence representation corresponds to a masked sentence from the document for all masked sentences in the document, and minimizing the negative log-likelihood of all masked sentences given their paired documents in order to perform document summarization (Section 1, lines 77-84, "In this paper, we propose HIBERT, which stands for HIerachical Bidirectional Encoder Representations from Transformers. We design an unsupervised method to pre-train HIBERT for document modeling. We apply the pre-trained HIBERT to the task of document summarization and achieve state-of-the-art performance on both the CNN/Dailymail and New York Times dataset.").
Jiang and Zhang are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang to incorporate the teachings of Zhang to train a document encoding model by masking sentences in a training document, obtaining context sensitive sentence representations for the document, finding the probability that each sentence representation corresponds to a masked sentence from the document for all masked sentences in the document, and minimizing the negative log-likelihood of all masked sentences given their paired documents.  Doing so would allow for performing document summarization.
Regarding claim 10, Jiang in view of Zhang discloses the computer-implemented method as claimed in claim 9.
Zhang further teaches:
wherein masking the one or more sentences of the masked training block comprises masking at least one word of each of the one or more sentences (Section 3.2, lines 46-48, "In 80% of the cases, we mask the selected sentence (i.e., we replace each word in the sentence with a mask token [MASK]).").
Zhang teaches masking words in a sentence when training a document encoding model in order to perform document summarization (Section 1, lines 77-84, "In this paper, we propose HIBERT, which stands for HIerachical Bidirectional Encoder Representations from Transformers. We design an unsupervised method to pre-train HIBERT for document modeling. We apply the pre-trained HIBERT to the task of document summarization and achieve state-of-the-art performance on both the CNN/Dailymail and New York Times dataset.").
Jiang and Zhang are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang in view of Zhang to further incorporate the teachings of Zhang to mask words in a sentence when training a document encoding model.  Doing so would allow for performing document summarization.
Regarding claim 18, Jiang discloses one or more tangible, non-transitory computer readable media storing computer-readable instructions that when executed by one or more processors cause the one or more processors to perform operations (Section 3.3, lines 10-11, “In this paper, we propose to model documents with information from different document structure levels.”; Section 3.3, lines 14-16, “The computation of encoders in MASH RNN follows a bottom-up principle with bidirectional recurrent neural networks (Bi-RNNs) with attention.”; Section 3.3, lines 30-32, “The backward pass processes the input sequence in reverse order and generates the backward hidden states”; Implementing a recurrent neural network, performing computations, and processing demonstrates the use of a processor.),
the operations comprising:
obtaining a plurality of textual training blocks from one or more training documents, wherein each of the plurality of textual training blocks comprises one or more sentences from the one or more training documents (Section 4.2, lines 11-13, "We adopt the largest publicly available Avocado Research Email Collection [36] as the experimental dataset for email attachment suggestion."; Section 4.2, lines 16-19, "To partition the emails into training, validation, and testing sets, the first 18-month emails are the training data while the following 2-month emails are utilized for validation." Section 4.2, lines 41-45, "While using only the paragraph-level hierarchy, SMASH (P) has a similar accuracy and a better F1-score compared to HAN. After accordingly adding sentence- and word-level knowledge, SMASH (P+S) and SMASH (P+S+W) have further improved performance."; The Avocado Research Email Collection reads on the one or more training documents, and using paragraph-level hierarchy reads on obtaining a plurality of textual training blocks.);
processing each of the plurality of training blocks with a block encoding portion of a machine-learned semantic document encoding model to obtain a respective plurality of textual block representations (Section 3.2, lines 7-12, "For each level, an attention-based hierarchical RNN (with corresponding level depth) is constructed as an encoder to generate representations for that level. For example, the paragraph-level encoder produces paragraph-level representations with a depth-3 encoder while the sentence-level encoder produces sentence-level representations with a depth-2 encoder."; The paragraph-level representations and sentence-level representations read on the textual block representations.);
and adjusting one or more parameters of the machine-learned semantic document encoding model based at least in part on the pre-training loss function (Section 4.1, lines 18-19, "The Adam optimizer [26] is applied to optimize the parameters with an initial learning rate of 10−5.").
Jiang does not specifically disclose: masking one or more sentences of a textual block representation of the plurality of textual block representations to obtain a masked block representation; adding the one or more masked sentences of each the masked block representation to a corpus of candidate sentences comprising a plurality of masked sentences from the one or more training documents; processing the plurality of textual block representations with a document encoding portion of the machine-learned semantic document encoding model to respectively obtain a plurality of contextual block representations, wherein the contextual block representation for the masked block representation comprises a multi-class classification of the one or more masked sentences of the masked block representation as being one or more respective sentences of the corpus of candidate sentences; evaluating a pre-training loss function that evaluates a difference between the multi-class classification for the masked block representation and ground truth data associated with the masked block representation and the corpus of candidate sentences.
Zhang teaches:
masking one or more sentences of a textual block representation of the plurality of textual block representations to obtain a masked block representation (Section 3.2, lines 15-19, ”HIBERT aims to learn the representation of a document, where its basic units are sentences. Therefore, a natural way of pre-training a document level model (e.g., HIBERT) is to predict a sentence (or sentences) instead of a word (or words).”; Section 3.2, lines 28-30, "We randomly select 15% of the sentences in D and mask them.");
adding the one or more masked sentences of each the masked block representation to a corpus of candidate sentences comprising a plurality of masked sentences from the one or more training documents (Section 3.2, lines 67-70, "After the application of the above procedures to a document D = (S1, S2, ..., S|D|), we obtain the masked document D̃ = (S̃1, S̃2, ..., S̃|D|).");
processing the plurality of textual block representations with a document encoding portion of the machine-learned semantic document encoding model to respectively obtain a plurality of contextual block representations (Section 3.2, lines 70-78, "Let K denote the set of indices of selected sentences in D. Now we are ready to predict the masked sentences M = {Sk | k ϵ K} using D̃. We first apply the hierarchical encoder HIBERT in Section 3.1 to D̃ and obtain its context sensitive sentence representations (d̃1, d̃2, ..., d̃|D|). We will demonstrate how we predict the masked sentence Sk = (w0k, w1k, w2k, ..., w|S|k) one word per step."; The context sensitive sentence representations read on the contextual block representations.),
wherein the contextual block representation for the masked block representation comprises a multi-class classification of the one or more masked sentences of the masked block representation as being one or more respective sentences of the corpus of candidate sentences (Section 3.2, lines 70-78, "Let K denote the set of indices of selected sentences in D. Now we are ready to predict the masked sentences M = {Sk | k ϵ K} using D̃. We first apply the hierarchical encoder HIBERT in Section 3.1 to D̃ and obtain its context sensitive sentence representations (d̃1, d̃2, ..., d̃|D|). We will demonstrate how we predict the masked sentence Sk = (w0k, w1k, w2k, ..., w|S|k) one word per step."; Section 3.2, lines 82-84, "As shown in Figure 1, we employ a Transformer decoder (Vaswani et al., 2017) to predict wjk with d̃k as its additional input."; Section 3.2, lines 117-119, "The probability of wjk given w0k, ..., wj-1k and D̃ is: p(wjk | w0:j-1k, D̃) = softmax(WO gj̃-1 ). Finally the probability of all masked sentences M given D̃ is p(M|D̃)"; The masked sentences M read on the masked training block, the masked document D̃ reads on the plurality of additional masked training blocks, and the probability of all masked sentences M given D̃ reads on the predicted similarity between the masked training block and each of a plurality of additional masked training blocks.);
evaluating a pre-training loss function that evaluates a difference between the multi-class classification for the masked block representation and ground truth data associated with the masked block representation and the corpus of candidate sentences (Section 3.5, lines 123-125, "The model above can be trained by minimizing the negative log-likelihood of all masked sentences given their paired documents."; The negative log-likelihood reads on the pre-training loss function, the paired documents read on the ground truth data, and minimizing the negative log-likelihood of all masked sentences given their paired documents reads on evaluating a difference between the multi-class classification output and ground truth data associated with the masked training block and the corpus of candidate sentences.).
Zhang teaches training a document encoding model by masking sentences in a training document, obtaining context sensitive sentence representations for the document, finding the probability that each sentence representation corresponds to a masked sentence from the document for all masked sentences in the document, and minimizing the negative log-likelihood of all masked sentences given their paired documents in order to perform document summarization (Section 1, lines 77-84, "In this paper, we propose HIBERT, which stands for HIerachical Bidirectional Encoder Representations from Transformers. We design an unsupervised method to pre-train HIBERT for document modeling. We apply the pre-trained HIBERT to the task of document summarization and achieve state-of-the-art performance on both the CNN/Dailymail and New York Times dataset.").
Jiang and Zhang are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang to incorporate the teachings of Zhang to train a document encoding model by masking sentences in a training document, obtaining context sensitive sentence representations for the document, finding the probability that each sentence representation corresponds to a masked sentence from the document for all masked sentences in the document, and minimizing the negative log-likelihood of all masked sentences given their paired documents.  Doing so would allow for performing document summarization.
Regarding claim 19, Jiang in view of Zhang discloses the one or more tangible, non-transitory media as claimed in claim 18.
Zhang further teaches:
wherein each sentence of the masked block representation is replaced with a respective masking token (Section 3.2, lines 46-48, "In 80% of the cases, we mask the selected sentence (i.e., we replace each word in the sentence with a mask token [MASK]).").
Zhang teaches replacing masked sentences with mask tokens when training a document encoding model in order to perform document summarization (Section 1, lines 77-84, "In this paper, we propose HIBERT, which stands for HIerachical Bidirectional Encoder Representations from Transformers. We design an unsupervised method to pre-train HIBERT for document modeling. We apply the pre-trained HIBERT to the task of document summarization and achieve state-of-the-art performance on both the CNN/Dailymail and New York Times dataset.").
Jiang and Zhang are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang in view of Zhang to further incorporate the teachings of Zhang to replace masked sentences with mask tokens when training a document encoding model.  Doing so would allow for performing document summarization.
Regarding claim 20, Jiang in view of Zhang discloses the one or more tangible, non-transitory media as claimed in claim 19.
Zhang further teaches:
wherein the multi-class classification of the one or more masked sentences of the masked block representation as being the one or more respective sentences of the corpus of candidate sentences comprises, for each of the one or more masked sentences, a predicted similarity between a respective masked sentence and each candidate sentence of the corpus of candidate sentences (Section 3.2, lines 70-78, "Let K denote the set of indices of selected sentences in D. Now we are ready to predict the masked sentences M = {Sk | k ϵ K} using D̃. We first apply the hierarchical encoder HIBERT in Section 3.1 to D̃ and obtain its context sensitive sentence representations (d̃1, d̃2, ..., d̃|D|). We will demonstrate how we predict the masked sentence Sk = (w0k, w1k, w2k, ..., w|S|k) one word per step."; Section 3.2, lines 82-84, "As shown in Figure 1, we employ a Transformer decoder (Vaswani et al., 2017) to predict wjk with d̃k as its additional input."; Section 3.2, lines 117-119, "The probability of wjk given w0k, ..., wj-1k and D̃ is: p(wjk | w0:j-1k, D̃) = softmax(WO gj̃-1 ). Finally the probability of all masked sentences M given D̃ is p(M|D̃)"; The masked sentences M read on the respective masked sentence, the masked document D̃ reads on the corpus of candidate sentences, and the probability of all masked sentences M given D̃ reads on the predicted similarity between a respective masked sentence and each candidate sentence of the corpus of candidate sentences.);
Zhang teaches training a document encoding model by finding the probability that each sentence representation corresponds to a masked sentence from the document for all masked sentences in the document (Section 1, lines 77-84, "In this paper, we propose HIBERT, which stands for HIerachical Bidirectional Encoder Representations from Transformers. We design an unsupervised method to pre-train HIBERT for document modeling. We apply the pre-trained HIBERT to the task of document summarization and achieve state-of-the-art performance on both the CNN/Dailymail and New York Times dataset.").
Jiang and Zhang are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang in view of Zhang to further incorporate the teachings of Zhang to train a document encoding model by finding the probability that each sentence representation corresponds to a masked sentence from the document for all masked sentences in the document.  Doing so would allow for performing document summarization.
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Jiang in view of Felderman et al. (US Patent No.  11,146,613), hereinafter Felderman.
Regarding claim 8, Jiang discloses the computer-implemented method as claimed in claim 1, but does not specifically disclose: wherein: each of the plurality of first textual blocks comprises a textual capacity; and parsing, by the computing system, the first document into the plurality of first textual blocks comprises: determining, by the computing system, that a sentence parsed from the plurality of first sentences to a textual block of the plurality of first textual blocks would exceed the textual capacity of the textual block; and parsing, by the computing system, the sentence to a second textual block of the plurality of first textual blocks.
Felderman teaches:
each of the plurality of first textual blocks comprises a textual capacity (Column 6, lines 51-59, "Further, the quantity and size (or storage capacity) of the data blocks may vary based on the desired amount of parallel processing. For example, an optimal processing arrangement may include one data block for each processing node. In this case, the quantity of data blocks may be equal to the quantity of processing nodes, and the size of the data blocks may be equal to the size of the document divided by the quantity of processing nodes (e.g., data block size=(size of document/quantity of processing nodes))."; The size of the data blocks reads on the textual capacity.);
and parsing, by the computing system, the first document into the plurality of first textual blocks comprises: determining, by the computing system, that a sentence parsed from the plurality of first sentences to a textual block of the plurality of first textual blocks would exceed the textual capacity of the textual block (Column 6, lines 26-30, "The data within the data block is compared to information within the schema for the document to identify boundaries for the various logical units (or sections) within the document, and determine whether a partition has occurred within a logical unit."; A partition occurring within a logical unit reads on a sentence exceeding the textual capacity of the textual block.);
and parsing, by the computing system, the sentence to a second textual block of the plurality of first textual blocks (Column 6, lines 30-50, "If the data block has not been partitioned on the desired logical boundary (e.g., has been partitioned within a logical unit (e.g., the data block contains an incomplete logical unit where one or more remaining portions of the logical unit reside on other data blocks)), the remaining portion completing the logical unit is extracted from a succeeding data block stored among processing nodes 150 at step 415. The extracted portion is added to the end of the data block being processed at step 420 to complete the logical unit and enable the data block to be partitioned on a logical boundary (e.g., page, etc.). However, the data may be shifted in any manner to either prior or succeeding data blocks to complete a logical unit (e.g., data from a succeeding data block may be appended to a prior data block to complete a logical unit as described above, data from a prior data block may be inserted into a succeeding data block to complete a logical unit, data block content may be shifted or adjusted in any manner, etc.). Moreover, data may be retrieved from any quantity of other (succeeding or preceding) data blocks to complete logical units."; Column 7, lines 3-7, "Further, the data block size may be regulated, where data is shifted between data blocks in a manner enabling each data block to be within a predetermined or threshold amount (e.g., a specific quantity, a percentage of the size of the data block, etc.) of a specified data block size."; Shifting a portion of a logical unit to a succeeding data block to complete a logical unit reads on parsing the sentence to a second textual block.).
Felderman teaches dividing data into data blocks and shifting a portion of a logical unit to a succeeding data block to complete a logical unit if a partition occurs within a logical unit in order to partition a document into sub-documents and process the sub-documents in parallel (Column 1, lines 44-49, "The system partitions a document into a plurality of data blocks, wherein each data block comprises one or more complete logical units of the document. A plurality of sub-documents is produced from the plurality of data blocks. The sub-documents are processed in parallel by a plurality of processing elements.").
Jiang and Felderman are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang to incorporate the teachings of Felderman to divide data into data blocks and shift a portion of a logical unit to a succeeding data block to complete a logical unit if a partition occurs within a logical unit.  Doing so would allow for partitioning a document into sub-documents and processing the sub-documents in parallel.  
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Jiang in view of Huang et al. (“Embedding-based Retrieval in Facebook Search”), hereinafter Huang.
Regarding claim 14, Jiang discloses the computer-implemented method as claimed in claim 1, but does not specifically disclose: wherein the method further comprises indexing, by the computing system for a search system, the first document encoding as a representation of the first document.
Huang teaches:
wherein the method further comprises indexing, by the computing system for a search system, the first document encoding as a representation of the first document (Abstract, lines 9-12, "We introduce the unified embedding framework developed to model semantic embeddings for personalized search, and the system to serve embedding-based retrieval in a typical search system based on an inverted index."; Section 2.3, lines 1-6, "To learn embeddings that are optimizing the triplet loss, our model comprises three major components: a query encoder EQ = f (Q) which produces a query embedding, a document encoder ED = g(D) which produces a document embedding, and a similarity function S(EQ, ED) which produces a score between query Q and document D."; The document embedding reads on the document encoding.).
Huang teaches implementing a search system that indexes document embeddings in order to provide semantic matching in search retrieval (Section 7, lines 1-9, "It has long term benefits to introduce semantic embeddings into search retrieval to address the semantic matching issues by leveraging the advancement on deep learning research. However, it is also a highly challenging problem due to the modeling difficulty, system implementation and cross-stack optimization complexity, especially for a large-scale personalized social search engine. In this paper, we presented our approach of unified embedding to model semantics for social search, and the implementation of embedding-based retrieval in a classical inverted index based search system.").
Jiang and Huang are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jiang to incorporate the teachings of Huang to implement a search system that indexes document embeddings.  Doing so would allow for providing semantic matching in search retrieval.
Conclusion
The art made of record and not relied upon is considered pertinent to applicant's disclosure.
Yang et al. (Yang, Liu, Mingyang Zhang, Cheng Li, Michael Bendersky, and Marc Najork, “Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching”, 2020, Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1725-1734.) teaches the use of a Siamese multi-depth attention-based hierarchical recurrent neural network for learning long document representations for document matching.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James Boggs whose telephone number is (571)272-2968. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JAMES BOGGS/Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657