DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. 202010034773, filed on January 14, 2020.

Information Disclosure Statement
The information disclosure statement (IDS) were submitted on 06/29/2021, 04/08/2022, and 09/28/2022. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1, 8-9, 10-11, 17, and 19-20 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
The independent claims 1, 11, and 20 recite:
acquiring a word sequence obtained by performing word segmentation on two paragraphs in a text, wherein the word sequence comprises at least one specified identifier for replacing a first word;
inputting the word sequence into a to-be-trained natural language processing model to generate a word vector corresponding to a second word in the word sequence, wherein the word vector is used to represent the second word in the word sequence and a position of the second word;
inputting the word vector into a preset processing layer of the to-be-trained natural language processing model, wherein the preset processing layer comprises an encoder and a decoder;
predicting whether the two paragraphs are adjacent, and the replaced first word in the two paragraphs, to obtain a prediction result, based on a processing result output by the preset processing layer; and
acquiring reference information of the two paragraphs, and training the to-be-trained natural language processing model to obtain a trained natural language processing model, based on the prediction result and the reference information, wherein the reference information comprises adjacent information indicating whether the two paragraphs are adjacent, and the replaced first word. 

The claims relate to a human organizing of activities. Specifically, a human based on:
receiving, from another human, written words corresponding to two paragraphs in a text (the written words including a specified identifier or label);  
writing down the word sequence (i.e., a first list) of words previously obtained and considering a pre-defined criteria (i.e., model), generating a second list (i.e., word vector) corresponding to a word present in the first list of words obtained (which represents the word and its position in the first list);
writing down the second list (i.e., preset processing layer) and considering predefined criteria (i.e., model);
predicting or determining if the two paragraphs are adjacent or not;
receiving reference information of the two paragraphs and re-defining the pre-defined criteria (i.e., model) considering the prediction above and the reference information (e.g., location related / adjacency).


This judicial exception is not integrated into a practical application because for example: claim 1 recites “encoder” and “decoder”; claim 11 recites “processor” and “storage device”; and claim 20 recites “computer readable medium” and “computer program”. As an example, in [0026, 0040 and 0042] and [0117] of the as filed specification, “[0026] In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; and a storage device for storing one or more programs, and the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described in any one of the embodiments of the method for processing information.”, “[0040]   As shown in Fig. 1, the system architecture 100 may include terminal devices 101, 102 and 103, network 104, and a server 105.” and “[0042]   The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, the terminal devices 101, 102, 103 may be various electronic devices having display screens, including but not limited to, a smart phone, a tablet computer, an electronic book reader, a laptop computer and a desktop computer.” and “[0117] …The execution body may use the encoder to read an input word vector and encode it into an intermediate representation. Then, the execution body may use the decoder to further process the intermediate representation and output the processed word vector.” Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

With respect to claims 8 and 17, the claims recite: 
wherein the inputting the word sequence into a to-be-trained natural language processing model to generate a word vector corresponding to a second word in the word sequence, comprises:
inputting the word sequence into an embedding layer of the to-be-trained natural language processing model;
converting, for the second word in the word sequence, the second word into an identifier of the second word through the embedding layer, and converting the identifier of the second word into a first vector;
converting position information of the second word in the word sequence into a second vector through the embedding layer;
determining paragraph position information indicating a paragraph in which the second word is located in the two paragraphs through the embedding layer, and converting the paragraph position information into a third vector; and
splicing the first vector, the second vector, and the third vector to obtain a word vector corresponding to the second word.

The claims relate to a human organizing of activities. Specifically, a human based on:
wherein the process of writing down the word sequence (i.e., a first list) of words previously obtained and considering a pre-defined criteria (i.e., model), generating a second list (i.e., word vector) corresponding to a word present in the first list of words obtained (which represents the word and its position in the first list), consists of:
writing down the words into a structured list (i.e., embedding layer) of the pre-defined criteria (i.e., model);
defining the list of words as a classification or label (i.e., identifier);
defining the position information of the words using the structured list (i.e., embedding layer);
determining the position of paragraphs and whether a word is repeated in the two paragraphs, and writing down the position information into a third list;
relating the three lists to obtain a final list.
No additional limitations are present. 	

With respect to claims 9 and 18, the claims recite: 
wherein the preset processing layer comprises a plurality of cascaded preset processing layers; and the inputting the word vector into a preset processing layer of the to-be- trained natural language processing model, comprises:
inputting the word vector into a first preset processing layer of the plurality of cascaded preset processing layers.

The claims relate to a human organizing of activities. Specifically, a human based on:
wherein the process of writing down the second list (i.e., preset processing layer; which is chosen from a plurality of other lists) and considering predefined criteria (i.e., model), consists of:
writing down the list into a structured/pre-defined list (i.e., preset layer) from a plurality of the structured/pre-defined lists.
No additional limitations are present. 	

With respect to claim 10, the claims recite: 
wherein, the preset processing layer comprises a plurality of processing units comprising the encoder and the decoder;
in the plurality of cascaded preset processing layers, a result of each processing unit of a previous preset processing layer is input into processing units of a posterior preset processing layer.

The claims relate to a human organizing of activities. Specifically, a human based on:
wherein the process of writing down the second list (i.e., preset processing layer) consists of a plurality of pre-defined subsequent conversion tools (e.g., encoding/decoding));
wherein the second list (i.e., preset processing layer; which is chosen from a plurality of other lists) a result from a previous conversion tool (i.e., encoder/decoder) is used for a subsequent conversion tool (i.e., encoder/decoder)
Additional limitations (i.e., encoder and decoder) were discussed in independent claims, above. 	

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 9-12, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (https://arxiv.org/abs/1810.04805) and further in view of Vaswani et al. (Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017)). 

As to independent claim 1, Devlin et al. teaches:
1. A method for processing information (see § 3. BERT: Input/Output Representations: “To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., h Question, Answer i) in one token sequence…”), the method comprising:
acquiring a word sequence obtained by performing word segmentation on two paragraphs in a text (see § 3. BERT: Input/Output Representations citation as above and Fig. 2: “input: ‘[cls], my, dog, is, cute [sep] [i.e., first paragraph]; he, likes, play, ##ing, [sep] [i.e., second paragraph]’ and see § 3.1 Pre-training BERT: Task #1: Masked LM: “… In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM); A.1 Illustration of the Pre-training Tasks, Masked LM and the Masking Procedure: “Replace the word with the [MASK] token, e.g., my dog is hairy [Wingdings font/0xE0] my dog is [MASK]” ), wherein the word sequence comprises at least one specified identifier for replacing a first word (see § 3.1 Pre-training BERT: Task #1: Masked LM and A.1 Illustration of the Pre-training Tasks, Masked LM and the Masking Procedure citations as above. [i.e., [MASK] specified identifier].);
inputting the word sequence into a to-be-trained natural language processing model to generate a word vector corresponding to a second word in the word sequence (see § 3. BERT: Input/Output Representations: “Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.” and Fig. 2: “input: ‘[cls], my, dog, is, cute [sep] [i.e., first paragraph]; he, likes, play, ##ing, [sep] [i.e., second paragraph]’ ”), wherein the word vector is used to represent the second word in the word sequence (see Fig. 2: “token embeddings: ‘E_[cls], E_my, E_dog, E_is, E_cute, E_ [sep], E_he, E_likes, E_play, E_##ing, E_[sep]’ ”) and a position of the second word (see Fig. 2: “position embeddings: ‘E_0, E_1, E_2, E_3, E_4, E_ 5, E_6, E_7, E_8, E_9, E_10’ ”);
inputting the word vector into a preset processing layer of the to-be-trained natural language processing model (see § 3. BERT: Input/Output Representations and Figure 2 citations as in limitation above.: “Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.”), wherein the preset processing layer comprises an encoder and a decoder (see § 3. BERT: Input/Output Representations and Figure 2 citations as in limitation above and § 3. BERT: Model Architecture: “Because the use of Transformers has become common and our implementation is almost identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al. (2017)… BERTBASE was chosen to have the same model size as OpenAI GPT for comparison purposes. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left.4 (4We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.)”);
predicting whether the two paragraphs are adjacent, and the replaced first word in the two paragraphs, to obtain a prediction result, based on a processing result output by the preset processing layer (see § 3.1 Pre-training BERT: Task #1: Masked LM: “… In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM); A.1 Illustration of the Pre-training Tasks, Masked LM and the Masking Procedure: “Replace the word with the [MASK] token, e.g., my dog is hairy [Wingdings font/0xE0] my dog is [MASK]”, § 3.1 Pre-training BERT: Task #2: Next Sentence Prediction (NSP): …In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus. Specifically, when choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext)…”, and § A.1 Illustration of the Pre-training Tasks, Next Sentence Prediction: “Input = [CLS] the man went to [MASK] store [SEP]; he bought a gallon [MASK] milk [SEP]; Label = IsNext”), ; and
acquiring reference information of the two paragraphs, and training the to-be-trained natural language processing model to obtain a trained natural language processing model, based on the prediction result and the reference information, wherein the reference information comprises adjacent information indicating whether the two paragraphs are adjacent, and the replaced first word (see § A.2 Pre-training Procedure: “To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is  512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces. We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 … The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood [i.e., associated with input embeddings and hence the first/second word sequences/paragraphs]. Training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).13 Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.”).

However, Devlin et al. does not explicitly explain wherein the preset processing layer comprises an encoder and a decoder. Instead, Devlin et al. discloses: “Because the use of Transformers has become common and our implementation is almost identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al. (2017)… (see § 3. BERT: Input/Output Representations). 
Therefore, in the interest of compact prosecution, Examiner considers Vaswani et al. which discloses: (see 3. Model Architecture: Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29]. Here, the encoder maps an input sequence of symbol representations (x1; :::; xn) to a sequence of continuous  representations z = (z1; :::; zn). Given z, the decoder then generates an output sequence (y1; :::; ym) of symbols one element at a time.)

Devlin et al. and Vaswani et al. are both considered to be analogous to the claimed invention because they are in the same field of endeavor in natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Devlin et al. to incorporate the teachings of Vaswani et al. of wherein the preset processing layer comprises an encoder and a decoder which provides best performing models that connect the encoder and decoder through an attention mechanism (abstract of Vaswani et al.).

As to independent claim 11, Devlin et al. in combination with Vaswani et al. teach the limitations as in claim 1, above.
Devlin et al.  further teaches:
11. An electronic device, comprising:
one or more processors (see § 4 of 1. Introduction: “The contributions
of our paper are as follows: … BERT advances the state of the art for eleven NLP tasks. The code and pre-trained models are available at https://github.com/google-research/bert. [i.e., code implemented in a computer (inherently comprising one or more processors storage device(s) with programs executed by processor(s))].”); and
a storage device configured to store one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors (see § 4 of 1. Introduction citation as in limitation above.) to perform operations comprising:
[the limitations of claim 1, above].

As to independent claim 20, Devlin et al. in combination with Vaswani et al. teach the limitations as in claim 1, above.
Devlin et al.  further teaches:
20. A non-transitory computer readable medium storing a computer program, wherein the computer program, when executed by a processor, causes the processor to perform operations (see § 4 of 1. Introduction citation as in claim 11 above. [i.e., code implemented in a computer (inherently comprising a non-transitory computer readable medium storing a computer program executed by processor(s))]) comprising:
[the limitations of claim 1, above].

Regarding claims 2 and 12, Devlin et al. in combination with Vaswani et al.  teach the limitations as in claims 1 and 11, above.
Devlin et al. further teaches:
2 and 12. The method according to claim 1, wherein the method further comprises:
acquiring first sample information, wherein the first sample information comprises a first paragraph word sequence obtained by performing word segmentation on a first target paragraph, and a first specified attribute (see § A.1 Illustration of the Pre-training Tasks, Next Sentence Prediction: “Input = [CLS] the man went to [MASK] store [SEP] [i.e., first sample information/paragraph]; he bought a gallon [MASK] milk [SEP] [i.e., first specified attribute]; Label = IsNext” and Fig. 2 wherein as previously discussed shows the segmentation of Input.);
inputting the first sample information into the trained natural language processing model to predict correlation information, wherein the correlation information is used to indicate a correlation between the first paragraph word sequence and the first specified attribute (see § B. Detailed Experimental Setup STS-2 and CoLA and § 3.1. Pre-training BERT: Task #2: Next Sentence Prediction (NSP): “Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship [i.e., correlation] between two sentences [i.e., first sample information (e.g., first sentence) and specified attribute (e.g., second sentence)], which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.”); and
training the trained natural language processing model to obtain a first model, based on predicted correlation information and correlation information for labeling the first sample information (see § 3.1. Pre-training BERT: Task #2: Next Sentence Prediction (NSP) citation above and: “… 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext). As we show in Figure 1, C is used for next sentence prediction (NSP). 5 (5 The final model achieves 97%-98% accuracy on NSP.)” and § 3.2 Fine-tuning BERT: “For each task, we simply plug in the task specific inputs and outputs into BERT and finetune [i.e., training the trained model] all the parameters end-to-end.”).

Regarding claims 9 and 19, Devlin et al. in combination with Vaswani et al.  teach the limitations as in claims 1 and 11, above.
Devlin et al. further teaches:
9 and 19. The method according to claim 1, wherein the preset processing layer comprises a plurality of cascaded preset processing layers (see § 5.3 Feature-based Approach with BERT “Results are presented in Table 7. BERTLARGE performs competitively with state-of-the-art methods. The best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer, which is only 0.3 F1 behind fine-tuning the entire model. This demonstrates that BERT is effective for both finetuning and feature-based approaches.” and § C.2 Ablation for Different Masking Procedures: “For the feature-based approach, we concatenate the last 4 layers of BERT as the features, which was shown to be the best approach in Section 5.3.”); and 
the inputting the word vector into a preset processing layer of the to-be- trained natural language processing model (see Fig. 1 (yellow blocks, layer of token embeddings) and Fig. 2 and § 3. BERT Model Architecture citation as in claims 1 and 11 above.), comprises:
inputting the word vector into a first preset processing layer of the plurality of cascaded preset processing layers (see Fig. 1 (yellow blocks, layer of token embeddings) and Fig. 2 and § 3. BERT Model Architecture citation as in claims 1 and 11 above.).

Regarding claim 10, Devlin et al. in combination with Vaswani et al.  teach the limitations as in claim 1, above.
Devlin et al. further teaches:
10. The method according to claim 9, wherein, the preset processing layer comprises a plurality of processing units comprising the encoder and the decoder (see § 3. BERT: Model Architecture: “In this work, we denote the number of layers (i.e., Transformer blocks) [i.e., layers/transformer blocks: plurality of processing units] as L, the hidden size as H, and the number of self-attention heads as A.3… BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses  constrained self-attention where every token can only attend to context to its left.4 4 We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.”);
in the plurality of cascaded preset processing layers, a result of each processing unit of a previous preset processing layer is input into processing units of a posterior preset processing layer (see § 3. BERT: Model Architecture citation as in limitation above and § 3.2 Fine-tuning BERT: “Fine-tuning is straightforward since the self-attention mechanism in the Transformer [i.e., plurality of cascaded preset processing layers: associated with Transformer units of BERT] allows BERT to model many downstream tasks [i.e., associated with previous and/or posterior processing layers]— whether they involve single text or text pairs—by swapping out the appropriate inputs and outputs. For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional cross attention, such as Parikh et al. (2016); Seo et al. (2017). BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences […] At the input, sentence A and sentence B from pre-training are analogous to (1) sentence pairs in paraphrasing, (2) hypothesis-premise pairs in entailment, (3) question-passage pairs in question answering, and (4) a degenerate text- ? pair in text classification or sequence tagging. At the output, the token representations are fed into an output layer for token level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.”). 

However, Devlin et al. in combination with Vaswani et al. does not explicitly explain wherein the preset processing layer comprises an encoder and a decoder. Instead, Devlin et al. discloses: “Because the use of Transformers has become common and our implementation is almost identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al. (2017)… (see § 3. BERT: Input/Output Representations). 
Therefore, in the interest of compact prosecution, Examiner considers Vaswani et al. which discloses: (see 3. Model Architecture: Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29]. Here, the encoder maps an input sequence of symbol representations (x1; :::; xn) to a sequence of continuous  representations z = (z1; :::; zn). Given z, the decoder then generates an output sequence (y1; :::; ym) of symbols one element at a time.)

Devlin et al. and Vaswani et al. are both considered to be analogous to the claimed invention because they are in the same field of endeavor in natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Devlin et al. to incorporate the teachings of Vaswani et al. of wherein the preset processing layer comprises an encoder and a decoder which provides best performing models that connect the encoder and decoder through an attention mechanism (abstract of Vaswani et al.).

Claims 3-4 and 13-14 are rejected under 35 U.S.C. 103 as being unpatentable over Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (https://arxiv.org/abs/1810.04805) further in view of Vaswani et al. (Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017)), as applied to claim 1 and 11 above, and further in view of Jung et al. (JP 2017076281 A). 

Regarding claims 3 and 13, Devlin et al. in combination with Vaswani et al. teaches all of the limitations as in claims 1 and 11, above.
3 and 13. The method according to claim 2, wherein the method further comprises:
acquiring second sample information, wherein the second sample information comprises a second paragraph word sequence obtained by performing word segmentation on a second target paragraph, and a second specified attribute (see Fig. 4(c) “Question” [i.e., second sample information/second paragraph word sequence (i.e., “Tok 1”… “Tok N” [Wingdings font/0xE0] segmentation)]  and “Paragraph” [i.e., specified attribute (i.e., “Tok 1”… “Tok M” [Wingdings font/0xE0] segmentation)].), 

However, Devlin et al. in combination with Vaswani et al. does not explicitly teach, but Jung et al. does teach:
wherein an attribute matching the second specified attribute is comprised in the second paragraph word sequence, and the attribute matching the second specified attribute completely matches or partially matches the second specified attribute (see ¶ 2 of <Configuration of Text Evaluation Device According to Embodiment of the Present Invention> (page 3): “The input unit 210 receives text to be evaluated. For example, text representing a question such as “b for a is something for c” [i.e., first sample information (e.g., sentence A)] is accepted. The types of questions are: “China for Beijing is something for London” [i.e., second sample information (e.g., sentence B)] (capital), or “something for dancing, dancing, flying” This includes syntactic things such as tense.” and ¶1 of <Word similarity evaluation> (page 4): “First, word embedding is evaluated with respect to standard word similarity measures, and it is examined whether these evaluation measures can be improved by focusing on the text hierarchy. The model is trained using the Wikipedia® 2014 data set. A hierarchical softmax function is used for word prediction. Set the window size to 11.  Standard ontology rating scales including Tofel-353, MC, RG, SCWS and RW are used. Each data set is manually assigned a word pair and a similarity score between them as a correct answer. For example, “book, paper, 7.46” indicates that the similarity score between (book, paper) is 7.46. Typically, the similarity score [i.e., matching of attributes] between word embeddings is calculated using the cosine similarity. Then, a Spearman rank correlation coefficient between this score and human judgment is obtained.”);
inputting the second sample information into the trained natural language processing model, (see ¶ 1 of <Outline of Embodiment of the Present Invention>: “The purpose of this embodiment is to improve the quality of word embedding. The inventors have developed a hierarchical neural network model to model the relationship between different levels of text units. This model allows the interaction between text units to be encoded in learned word embeddings.”; ¶ 1-2 of <Configuration of word embedding learning device according to embodiment of the present invention> (page 2): “Functionally, the word embedding learning device 100 includes an input unit 10 [i.e., associated with the text received], a calculation unit 20 [i.e., associated with the model], and an output unit 90 as shown in FIG. The input unit 10 receives a plurality of learning texts and stores the plurality of texts in the text 22.”; ¶ 5 of <Optimum condition of this embodiment> (page 1-2): “As shown here, even if two words are not in the same sentence, the effect propagates to the embedding of a sentence containing one word, the embedding of a paragraph, and the embedding of a document [i.e., associated with prediction], and the hierarchy goes down in the opposite direction. Interacting remotely such that the influence propagates to the other word. Therefore, the model according to the present embodiment can also enjoy the advantages of configuring a local language model by a neural network while taking into consideration global level statistics to some extent.”; ¶ 3 of <Configuration of text evaluation device according to embodiment of the present invention> (page 3): “The calculation unit 220 includes a word vector conversion unit 222, a search unit 224, and a word embedding 226.”;  ¶ 2 of <Configuration of Text Evaluation Device According to Embodiment of the Present Invention> (page 3); and ¶1 of <Word similarity evaluation> (page 4) citations as in limitations above.) and predicting an attribute value of the second specified attribute in the second paragraph word sequence (see ¶ 1-2 of <Word similarity evaluation>: “First, word embedding is evaluated with respect to standard word similarity measures, and it is examined whether these evaluation measures can be improved by focusing on the text hierarchy. The model is trained using the Wikipedia® 2014 data set. A hierarchical softmax function is used for word prediction [i.e., prediction of an attribute value]. Set the window size to 11. Standard ontology rating scales including Tofel-353, MC, RG, SCWS and RW are used. Each data set is manually assigned a word pair and a similarity score between them as a correct answer. For example, “book, paper, 7.46” indicates that the similarity score between (book, paper) is7.46. Typically, the similarity score between word embeddings is calculated using the cosine similarity. Then, a Spearman rank correlation coefficient between this score and human judgment is obtained.”); and
training the trained natural language processing model to obtain a second model, based on the predicted attribute value and an attribute v for labeling the attribute matching the second specified attribute (see ¶ 5-6 of <Optimum condition of this embodiment> (page 2): “As shown here, even if two words are not in the same sentence, the effect propagates to the embedding of a sentence containing one word, the embedding of a paragraph, and the embedding of a document, and the hierarchy goes down in the opposite direction. Interacting remotely such that the influence propagates to the other word. Therefore, the model [i.e., trained model] according to the present embodiment can also enjoy the advantages of configuring a local language model by a neural network [i.e., configuration by neural network associated with training the trained model] while taking into consideration global level statistics to some extent In addition, based on Markov characteristics along different levels of the tree structure, the meaning of the adjacent text units at a high level such as paragraphs and sentence strings interacts while maintaining the consistency of meanings at each level. A lower level meaning can be a better representation. Such benefits are further propagated to word-level prediction, leading to improved word-level embedding [i.e., associated second model based on predicted attribute value (i.e., word prediction)].”).
Devlin et al. in combination with Vaswani et al. and Jung et al. are both considered to be analogous to the claimed invention because they are in the same field of endeavor in natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Devlin et al. in combination with Vaswani et al. to incorporate the teachings of Jung et al. of wherein an attribute matching the second specified attribute is comprised in the second paragraph word sequence, and the attribute matching the second specified attribute completely matches or partially matches the second specified attribute; inputting the second sample information into the trained natural language processing model, and predicting an attribute value of the second specified attribute in the second paragraph word sequence .) and predicting an attribute value of the second specified attribute in the second paragraph word sequence and training the trained natural language processing model to obtain a second model, based on the predicted attribute value and an attribute value for labeling the attribute matching the second specified attribute which provides the benefit of improving the quality of an embedded word to be learned (abstract of Jung et al.).

 Regarding claims 4 and 14, Devlin et al. in combination with Vaswani et al. and Jung et al. teaches all of the limitations as in claims 3 and 13, above.
Devlin et al. further teaches:
predicting position information, wherein the position information comprises start position information and end position information - (see Fig. 4(c) “Question” and “Paragraph” [i.e., second sample information/second paragraph word sequence (i.e., “Tok 1”… “Tok N” [Wingdings font/0xE0] segmentation)] “E1”… “EM” [i.e., specified attribute] and § 4.2 SQuAD v. 1.1 citations as in claim 3, above.: “… The training objective is the sum of the log-likelihoods of the correct start and end positions.”.)

Jung et al. further teaches:
4 and 14. The method according to claim 3, wherein the predicting an attribute value of the second specified attribute in the second paragraph word sequence (see ¶ 2 of <Configuration of Text Evaluation Device According to Embodiment of the Present Invention> (page 3): “The input unit 210 receives text to be evaluated. For example, text representing a question such as “b for a is something for c” [i.e., first sample information (e.g., sentence A)] is accepted. The types of questions are: “China for Beijing is something for London” [i.e., second sample information (e.g., sentence B); word prediction for b and c (start and end positions)]), comprises:
predicting position information of the attribute value of the second specified attribute in the second paragraph word sequence, wherein the position information comprises start position information and end position information (see ¶ 2 of <Configuration of Text Evaluation Device According to Embodiment of the Present Invention> (page 3): “The input unit 210 receives text to be evaluated. For example, text representing a question such as “b for a is something for c” [i.e., first sample information (e.g., sentence A)] is accepted. The types of questions are: “China for Beijing is something for London” [i.e., second sample information (e.g., sentence B), prediction of word in end position “c”]). Although Jung et al. does not explicitly mention the start or end, only by way of example, in the interest of compact prosecution, it is noted that the primary reference in Devlin et al., as shown above.

Devlin et al. in combination with Vaswani et al. and Jung et al. are both considered to be analogous to the claimed invention because they are in the same field of endeavor in natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Devlin et al. in combination with Vaswani et al. to incorporate the teachings of Jung et al. of wherein the predicting an attribute value of the second specified attribute in the second paragraph word sequence comprises: predicting position information of the attribute value of the second specified attribute in the second paragraph word sequence, wherein the position information comprises start position information and end position information which provides the benefit of improving the quality of an embedded word to be learned (abstract of Jung et al.).

Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (https://arxiv.org/abs/1810.04805) further in view of Vaswani et al. (Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017)), as applied to claim 1 and 11 above, and further in view of Chen et al. (US 20220215177 A1). 

Regarding claims 8 and 17, Devlin et al. in combination with Vaswani et al. teach the limitations as in claims 1 and 11, above.
Devlin et al. further teaches:
8 and 17. The method according to claim 1, wherein the inputting the word sequence into a to-be-trained natural language processing model to generate a word vector corresponding to a second word in the word sequence (see § 3. BERT: Input / Output Representations citation as in claims 1 and 11 above.), comprises:
inputting the word sequence into an embedding layer of the to-be-trained natural language processing model (see Fig. 1 (yellow blocks, layer of token embeddings) and Fig. 2 and § 3. BERT Input/Output Representations: “Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B. As shown in Figure 1, we denote input embedding as E, the final hidden vector of the special [CLS] token as C 2 RH, and the final hidden vector for the ith input token as Ti 2 RH.” and 3. BERT Model Architecture: “Model Architecture BERT’s model architecture is a multi-layer [i.e., associated with embedding layer(s)] bidirectional Transformer encoder based on the original implementation”);
converting, for the second word in the word sequence, the second word into an identifier of the second word through the embedding layer, and converting the identifier of the second word into a first vector (see Fig. 1-2 “[SEP] [ i.e., identifier] is a special separator token (e.g. separating questions/answers)” and (red blocks, input embeddings i.e., associated with first vector));
converting position information of the second word in the word sequence into a second vector through the embedding layer (see Fig. 2 (green blocks, segment embeddings [i.e., EA (first sentence/paragraph, e.g., question) or EB (e.g., second sentence/paragraph, e.g., answer)]));
determining paragraph position information indicating a paragraph in which the second word is located in the two paragraphs through the embedding layer, and converting the paragraph position information into a third vector (see Fig. 2 (gray blocks, position embeddings); 

However, Devlin et al. in combination with Vaswani et al. does not explicitly teach, but Chen et al. does teach:
splicing the first vector, the second vector, and the third vector to obtain a word vector corresponding to the second word (see ¶ [0078]: “According to an embodiment of the present disclosure, the method first inputs a splicing of the default start symbol, the vector representation of the historical dialogue, and the knowledge vector to the third recurrent neural network, and updates a state of a hidden layer of the third recurrent neural network as a word vector of a first word of the reply sentence; then inputs a splicing of the word vector of the first word of the reply sentence, the vector representation of the historical dialogue, and the knowledge vector to the third recurrent neural network, and updates the state of the hidden layer of the third recurrent neural network as a word vector of a second word of the reply sentence; . . . and so on, until all word vectors in the reply sentence are generated, and the word vectors are converted into natural language to generate the reply sentence.”).

Devlin et al. in combination with Vaswani et al. and Chen et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in natural language/sentence processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Devlin et al. in combination with Vaswani et al. to incorporate the teachings of Chen et al. of splicing the first vector, the second vector, and the third vector to obtain a word vector corresponding to the second word which provides the benefit of better reply sentences being generated, avoiding meaningless replies ([0051] of Chen et al.).

Allowable Subject Matter
Claims 5-7 and 15-16 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  
Devlin et al. in combination with Vaswani et al. and Jung et al. teach all of the limitations as in claim 3, above. 
However, the closest prior art of record, Devlin et al. in combination with Vaswani et al. and Jung et al. fail to teach:
The method according to claim 3/13, wherein the method further comprises:
acquiring a text word sequence obtained by performing word segmentation on a target text, and dividing the text word sequence into a plurality of paragraph word sequences;
determining paragraph word sequences related to a target attribute from the plurality of paragraph word sequences;
inputting the target attribute and the paragraph word sequences related to the target attribute into the first model), and 
predicting correlation information between the target attribute and each of the paragraph word sequences related to the target attribute, wherein the correlation information comprises a correlation value;
selecting a preset number of paragraph word sequence from the plurality of paragraph word sequences related to the target attribute in a descending order of the correlation values; 
inputting the target attribute and the preset number of paragraph word sequence into the second model, and predicting an attribute value of the target attribute and a confidence level of the attribute value of the target attribute; and
determining an attribute value of the target attribute from the predicted attribute value of the target attribute, based on the correlation value and the confidence level.

Claims 6-7, and 16, as dependent claims of claim 5, would be allowable if claim 5 is rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Keisha Y Castillo-Torres whose telephone number is (571)272-3975. The examiner can normally be reached Monday - Friday, 9:00 am - 4:00 pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Keisha Y. Castillo-Torres
Examiner
Art Unit 2659



/Keisha Y. Castillo-Torres/Examiner, Art Unit 2659     

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659