DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Introduction
This office action is in response to communications filed on 12/23/2020. Claims 1-20 are pending, and likewise, Claims 1-20 have been examined.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09/28/2021 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claim 19 is rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.
Claim 19 depends on Claim 18. Claim 18 contains all the limitations of Claim 19, word for word. Claim 19 fails to limit the subject matter of the claim upon which it depends.
Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 1, 10-13 and 20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 

	Independent Claims 1,13 and 20 recite “	obtaining a first data set”, “wherein the first data set comprises a plurality of pieces of 5first sample data;”, “extracting structured information from each of the plurality of pieces of first sample data as target structured information corresponding to each of the plurality of pieces of first sample data;”, “inputting the plurality of pieces of first sample data into an initial text generation 10model to generate predicted structured information corresponding to each of the plurality of pieces of first sample data;”, “generating a first loss value based on a difference between the predicted structured information corresponding to each of the plurality of pieces of first sample data and the corresponding target structured information;”, “and 15training a phrase generation ability of the initial text generation model based on the first loss value to generate the text generation model”.
	Claim 1 also recites “A method for training a text generation model, comprising”.
	Claim 13 also recites “An electronic device, comprising: at least one processor; and 15a storage device connected in communication with the at least one processor; wherein, the storage device stores instructions executable by the at least one processor, and when the instructions are executed by the at least one processor, the at least one processor is caused to implement”.
	Claim 20 also recites “A non-transitory computer-readable storage medium storing computer instructions, wherein when the computer instructions are executed, a computer is caused 30to implement a method for obtaining a document layout, the method comprising”.
	The limitations “A method for training a text generation model, comprising”,  “obtaining a first data set”, “wherein the first data set comprises a plurality of pieces of 5first sample data;”, “extracting structured information from each of the plurality of pieces of first sample data as target structured information corresponding to each of the plurality of pieces of first sample data;”, “inputting the plurality of pieces of first sample data into an initial text generation 10model to generate predicted structured information corresponding to each of the plurality of pieces of first sample data;”, “generating a first loss value based on a difference between the predicted structured information corresponding to each of the plurality of pieces of first sample data and the corresponding target structured information;”, “and 15training a phrase generation ability of the initial text generation model based on the first loss value to generate the text generation model” as drafted covers a mental process, as this could be done mentally or by hand with pen and paper.
	This judicial exception is not integrated into a practical application. Claim 13 also recites “An electronic device, comprising: at least one processor; and 15a storage device connected in communication with the at least one processor; wherein, the storage device stores instructions executable by the at least one processor, and when the instructions are executed by the at least one processor, the at least one processor is caused to implement”. Claim 20 Also recites “A non-transitory computer-readable storage medium storing computer instructions, wherein when the computer instructions are executed, a computer is caused 30to implement a method for obtaining a document layout, the method comprising” All of these limitations direct towards using a computer for the method, and do not impose any meaningful limits on practicing the abstract idea. Claim 1, 13 and 20, do not contain any additional limitations.
	The Claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The addition of the generic computer components recited above with regard to claims 13 and 20, do not amount to more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Claim 1, 13 and 20, do not contain any additional limitations. The claims as drafted, are not patent eligible.

	Dependent Claim 10 recites the additional limitations “obtaining a target scene; obtaining a second data set based on the target scene, wherein the second data set is supervised;”, “and 20adjusting parameters of the text generation model based on the second data set to generate a text generation model corresponding to the target scene”. These limitations cover mental processes, as they could be done mentally or by hand with pen and paper. 
These judicial exceptions are not integrated into a practical application, as there are no additional limitations. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as there are not additional limitations. The claim as drafted is not patent eligible.

Dependent Claim 11 recites the additional limitations of “wherein the second data set comprises a plurality of pieces of second sample data, 25each of the plurality of pieces of second sample data comprises source data and labeled data corresponding to the source data”, “adjusting the parameters of the text generation model based on the second data set to generate the text generation model corresponding to the target scene comprises: segmenting the labeled data corresponding to each source data to generate a 30labeled segment sequence corresponding to each source data;”, “37Docket No. 713086inputting source data of the plurality of pieces of second sample data into the text generation model to generate a predicted segment sequence corresponding to each source data;”, “generating a third loss value based on a difference between the predicted 5segment sequence and the labeled segment sequence;”, “and adjusting the parameters of the text generation model based on the third loss value to generate the text generation model corresponding to the target scene”. These limitations cover mental processes, as they could be done mentally or by hand with pen and paper. 
These judicial exceptions are not integrated into a practical application, as there are no additional limitations. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as there are not additional limitations. The claim as drafted is not patent eligible.

Dependent Claim 12 recites the additional limitations of “wherein the target scene comprises one or more 10combinations of a dialogue generation scene, a machine translation scene, a question and answer scene, and a summary generation scene”. These limitations cover mental processes, as they could be done mentally or by hand with pen and paper. 
These judicial exceptions are not integrated into a practical application, as there are no additional limitations. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as there are not additional limitations. The claim as drafted is not patent eligible.

Claims 2 and 14, and likewise dependent claims 3-9 and 15-19, are not rejected under 35 U.S.C 101, because they recite processing using encoder decoder models, which provides sufficient structure for the computer implementation it be necessary for practical performance.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1 is/are rejected under 35 U.S.C. 103 as being unpatentable over Sun et al. “ERNIE: Enhanced Representation through Knowledge Integration” hereinafter Sun, and further in view of Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” hereinafter Devlin.

Regarding Claim 1:
Sun teaches a method for training a text generation model, comprising(Abstract, Ln 6-8, ERNIE is designed to learn language representation enhanced by knowledge masking strategies): 
obtaining a first data set, wherein the first data set comprises a plurality of pieces of 5first sample data(Pg 4, 4.1 Heterogeneous, Para 1, Ln 3-4, we draw the mixed corpus Chinese Wikepedia, Baidu Baike, Baidu news and Baidu Tieba);
extracting structured information from each of the plurality of pieces of first sample data as target structured information corresponding to each of the plurality of pieces of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 4-9, For English…lexical analysis and chunking tools to get the boundary of phrases in the sentences, and.…segmentation tools to get the word/phrase information in….Chinese. Examiner points out that with regard to the “plurality of pieces of first sample data” limitation, this process is done for more than just one training example, as the data sets cited above point out. This note will not be repeated in future citations); 
inputting the plurality of pieces of first sample data into an initial text generation 10model to generate predicted structured information corresponding to each of the plurality of pieces of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase. Pg 3, Fig 1, shows model. Previous citation to data sets shows this is happening to many individual data samples); 
and 15training a phrase generation ability of the initial text generation model(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 8-12, basic language units as training input..… mask and predict all the basic units in the same phrase).
Sun does not teach generating a first loss value based on a difference between the predicted structured information corresponding to each of the plurality of pieces of first sample data and the corresponding target structured information; and 15training a phrase generation ability of the initial text generation model based on the first loss value to generate the text generation model.
In the same field of Masked Language Models, Devlin teaches generating a first loss value based on a difference between the predicted structured information corresponding to each of the plurality of pieces of first sample data and the corresponding target structured information(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 12-14, Then, Ti will be used to predict the original token with cross entropy loss. Cross entropy loss is measuring the difference); 
and 15training a phrase generation ability of the initial text generation model based on the first loss value to generate the text generation model(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 7-14, The training data generator ……Then, Ti will be used to predict the original token with cross entropy loss. Phrases rather than words is taught by Sun).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify Sun et al, with the Language Model of Devlin, as it is conceptually simple, which would improve ease of implementation(Abstract, Para 2, Ln 1).

Claim(s) 2-8 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Sun and Devlin, as applied to Claim 1 above, and further in view of Vaswani et al. “Attention Is All You Need”, hereinafter Vaswani.

Regarding Claim 2: 
The combination of Sun and Devlin teaches the method of claim 1, and Sun teaches wherein, the initial text generation model comprises an initial encoder(Pg 3, 3.1 Transformer Encoder, Ln 1-2, ERNIE use multi-layer Transformer (Vaswani et al., 2017) as basic encoder),
generate a plurality of predicted segments; and generating the predicted structured information corresponding to each of the 30plurality of pieces of first sample data based on the plurality of predicted segments(Pg 3, Fig 3, shows phrases being generated based on the predicted segments that make them up). 34Docket No. 713086
The combination of Sun and Devlin does not teach wherein, the initial text generation model comprises an initial encoder and an initial decoder, inputting the plurality of pieces of first sample data into the initial text generation model to generate predicted structured information corresponding to each of the plurality of pieces of first sample data, comprises: inputting each of the plurality of pieces of first sample data into the initial encoder to generate a group of vector representations corresponding to each of the 25plurality of pieces of first sample data; inputting the group of vector representations corresponding to each of the plurality of pieces of first sample data into the initial decoder to generate a plurality of predicted segments.
In the same field of Machine Learning Language Models, Vaswani teaches wherein, the initial text generation model comprises an initial encoder and an initial decoder, (Pg 2, 3 Model Architecture, Para 1, Ln, 1, encoder-decoder structure, Para 2, Ln 1, The Transformer follows this overall architecture),
20inputting the plurality of pieces of first sample data into the initial text generation model to generate predicted structured information corresponding to each of the plurality of pieces of first sample data, comprises: inputting each of the plurality of pieces of first sample data into the initial encoder to generate a group of vector representations corresponding to each of the 25plurality of pieces of first sample data(Pg 2, 3 Model Architecture, Para 1, Ln 2-4, encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. Pg 3, Encoder, Ln 7, produce outputs of dimension dmodel = 512); 
inputting the group of vector representations corresponding to each of the plurality of pieces of first sample data into the initial decoder to generate a plurality of predicted segments(Pg 2, 3 Model Architecture, Para 1, Ln 2-4, encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. This would happen for each training sample previously cited in Sun); 
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun and Devlin, with the Transformer model of Vaswani, as they improve performance and are more efficient(Abstract, Ln 6-8) and Sun uses a Transformer of Vaswani(Sun, Pg 3, 3.1 Transformer Encoder, Ln 1-2).
The combination of Sun, Devlin and Vaswani does not teach wherein a phrase generation ability of the initial encoder and the initial decoder is trained based on the first loss value.
In the same field of Masked Language Models, Devlin teaches wherein a phrase generation ability of the initial encoder and the initial decoder is trained based on the first loss value(Pg 4173, Model Architecture, Para 1, Ln 1-4, BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017). Addition of Decoder with the combination of Vaswani, Sun teaches phrases rather than words. Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 7-14, The training data generator ……Then, Ti will be used to predict the original token with cross entropy loss).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Devlin and Vaswani, with the Language Model of Devlin, as it is conceptually simple, which would improve ease of implementation (Abstract, Para 2, Ln 1).


Regarding Claim 3:
The combination of Sun, Devlin and Vaswani teaches the method of claim 2, and Sun teaches wherein the target structured information comprises a 5plurality of phrases(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase), 
and the method further comprises: obtaining positions to be masked among the plurality of phrases(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 4-9, For English…lexical analysis and chunking tools to get the boundary of phrases in the sentences, and.…segmentation tools to get the word/phrase information in….Chinese); 
masking phrases above the positions(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase); 
inputting the target structured information after the masking into the text generation model to generate predicted phrases corresponding to the positions(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase); 
and training an inter-phrase relationship ability of the text generation model(Pg 1, Introduction, Para 3, Ln 11-13, In this way, the prior knowledge of phrases and entities are implicitly learned during the training procedure).
The combination of Sun, Devlin and Viswani does not teach generating a second loss value based on the phrases above the positions and the predicted phrases corresponding to the positions and training an inter-phrase relationship ability of the text generation model based on the second loss value.
In the same field of Masked Language Models, Devlin teaches 10generating a second loss value based on the phrases above the positions and the predicted phrases corresponding to the positions(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 7-14, The training data generator … we replace the i-th token with (1) the [MASK] …Then, Ti will be used to predict the original token with cross entropy loss. Sun teaches the phrases being the prediction for the positions, Pg 3, Fig 1); 
and training an inter-phrase relationship ability of the text generation model based on the second loss value(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 7-14, The training data generator … we replace the i-th token with (1) the [MASK] …Then, Ti will be used to predict the original token with cross entropy loss. Pg 4171-4172, Introduction, Para 4, Ln 9-11, the objective is to predict the original vocabulary id of the masked word based only on its context. Phrases are taught by Sun as shown above).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Devlin and Vaswani, with the Language Model of Devlin, as it is conceptually simple, which would improve ease of implementation (Abstract, Para 2, Ln 1).

Regarding Claim 4: 
The combination of Sun, Devlin and Vaswani teaches the method of claim 2, and Sun teaches wherein the plurality of predicted segments comprise N predicted segments, where N is a positive integer(Pg 3, Fig 3, shows a positive number of predicted segments). 
The combination of Sun, Devlin and Vaswani does not teach inputting the group of vector representations corresponding to each of the plurality of pieces of first sample data into the initial decoder to generate the plurality of predicted 20segments, comprises: when predicting the ith predicted segment, generating the ith predicted segment by decoding through the initial decoder based on the group of vector representations, the first predicted segment over the (i-1)th predicted segment, and a position feature of the ith predicted segment, where i is a positive integer less than or equal to N.
In the same field of Machine Learning Language Models, Vaswani teaches inputting the group of vector representations corresponding to each of the plurality of pieces of first sample data into the initial decoder to generate the plurality of predicted 20segments, comprises: when predicting the ith predicted segment, generating the ith predicted segment by decoding through the initial decoder based on the group of vector representations(Pg 2, 3 Model Architecture, Para 1, Ln 2-4, encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. Pg 3, Encoder, Ln 7, produce outputs of dimension dmodel = 512. Each segment being what is predicted is shown in Sun, Pg 3, 3.2.1, Para 1, Ln 2-5, for English, the basic language unit is word, and for Chinese, the basic language unit is Chinese Character. 3.2.2, Para 1, Ln 11-12, mask and predict all the basic units in the same phrase), 
the first predicted segment over the (i-1)th predicted segment(Pg 2, 3 Model Architecture, Para 1, Ln 4-5, At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next. Pg 3, Fig 1, Outputs(shifted right) -> Decoder(is labeled as Decoder below in 3.1 Decoder:)), 
and a position feature of the ith predicted segment(Pg 3, Fig 1, Positional Encoding), 
where i is a positive integer less than or equal to N(Pg 2, 3 Model Architecture, Para 1, Ln 3-4, decoder then generates an output sequence (y1, ..., ym). The i-th predicted segment is going to be one of the segments from (y1, ..., ym)).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Devlin and Vaswani with the Transformer model of Vaswani, as they improve performance and are more efficient(Abstract, Ln 6-8) and Sun uses a Transformer for Vaswani(Sun, Pg 3, 3.1 Transformer Encoder, Ln 1-2).

Regarding Claim 5:
The combination of Sun, Devlin and Vaswani teaches the method of claim 4, and Sun teaches wherein the predicted segment comprises M characters, where M is a positive integer(Pg 3, 3.2.1, Para 1, Ln 2-5, for English, the basic language unit is word, and for Chinese, the basic language unit is Chinese Character. 3.2.2, Para 1, Ln 11-12, mask and predict all the basic units in the same phrase. Pg 3, Fig 1, shown segments with a positive number of characters.).
The combination of Sun, Devlin and Vaswani does not teach and generating the ith predicted segment comprises: when predicting the ith predicted segment, generating the M characters in the ith 30predicted segment simultaneously through the initial decoder.
In the same field of Machine Learning Language Models, Vaswani teaches and generating the ith predicted segment comprises: when predicting the ith predicted segment, generating the M characters in the ith 30predicted segment simultaneously through the initial decoder(Pg 2, 3 Model Architecture, Para 1, Ln 3-4, continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. Pg 3, 3.1, Encoder, Para 1, Ln 6-7, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. Pg 5, 3.4 Embeddings and Softmax, para 1, Ln 1-3, learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel.….convert the decoder output to predicted next-token probabilities. Entire tokens are predicted as one output. In Sun, tokens are an English word, Pg 3, 3.2.1, Para 1, Ln 2-5, for English, the basic language unit is word, and for Chinese, the basic language unit is Chinese Character. Words contain a positive number of letters).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Devlin and Vaswani, with the Transformer model of Vaswani, as they improve performance and are more efficient(Abstract, Ln 6-8) and Sun uses a Transformer of Vaswani(Sun, Pg 3, 3.1 Transformer Encoder, Ln 1-2).


Regarding Claim 6:
The combination of Sun, Devlin and Vaswani teaches the method of claim 3, and Sun teaches wherein the phrase generation ability and the inter-phrase relationship ability are trained in fusion(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase. Pg 3, Fig 3, shows model generating phrases. Pg 1, Introduction, Para 3, Ln 8-13, All of the words in the same unit are masked during word representation training.….In this way, the prior knowledge of phrases and entities are implicitly learned during the training procedure. Generation and using context are learned together in the phrase masking task).

Regarding Claim 7:
The combination of Sun, Devlin and Vaswani teaches the method of claim 6, and Sun teaches wherein the target structured information comprises a plurality of target segments corresponding to the piece of first sample data(Pg 3, Fig 3, shows a plurality of segments in the output for one sample), 
and inputting the plurality of pieces of first sample data into the initial text generation model to generate the predicted structured information corresponding to each of the 10plurality of pieces of first sample data, comprises: determining respective positions of the plurality of target segments in the piece of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 4-9, For English…lexical analysis and chunking tools to get the boundary of phrases in the sentences, and.…segmentation tools to get the word/phrase information in….Chinese); 
masking the plurality of target segments in the piece of first sample data based on respective positions of the plurality of target segments in the piece of first sample 15data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 4-9, For English…lexical analysis and chunking tools to get the boundary of phrases in the sentences, and.…segmentation tools to get the word/phrase information in….Chinese. Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase); 
and inputting the plurality of pieces of first sample data after the masking into the initial text generation model to generate the predicted structured information corresponding to each of the plurality of pieces of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase), 
wherein each of predicted segments in the predicted structured information is corresponding to a 20masked target segment in the piece of first sample data(Pg 3, Fig 3, shows predicted are corresponding to masked),
The combination of Sun, Devlin and Vaswani does not teach and the first loss value is determined based on a difference between each predicted segment and the corresponding masked target segment.
In the same field of Masked Language Models, Devlin teaches and the first loss value is determined based on a difference between each predicted segment and the corresponding masked target segment(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 12-14, Then, Ti will be used to predict the original token with cross entropy loss. Cross entropy loss is calculating the difference. Loss is calculated for each token individually. Also, Pg 4183, A.2 Pre-training Procedure, Para 2, Ln 11-13, The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood. Masked LM likelihood is averaged for pre-training loss).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Devlin and Vaswani, with the Language Model of Devlin, as it is conceptually simple, which would improve ease of implementation (Abstract, Para 2, Ln 1).

Regarding Claim 8:
The combination of Sun, Devlin and Vaswani teaches the method of claim 2, and Sun teaches wherein 25the target structured information comprises a plurality of target segments corresponding to the piece of first sample data(Pg 3, Fig 1, Shows multiple segments for a phrase, for a given sample).
The combination of Sun, Devlin and Vaswani does not teach and generating the first loss value based on the difference between the predicted structured information corresponding to each of the plurality of pieces of first sample data and the corresponding target structured information comprises: 36Docket No. 713086 generating the first loss value based on differences between the plurality of predicted segments in the predicted structured information and the plurality of target segments in the target structured information.
In the same field of Machine Learning Language Models, Devlin teaches and generating the first loss value based on the difference between the predicted structured information corresponding to each of the plurality of pieces of first sample data and the corresponding target structured information comprises: 36Docket No. 713086 generating the first loss value based on differences between the plurality of predicted segments in the predicted structured information and the plurality of target segments in the target structured information(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 12-14, Then, Ti will be used to predict the original token with cross entropy loss. Loss is calculated for an individual token. Pg 4183, A.2 Pre-training Procedure, Para 2, Ln 11-13, The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood. Masked LM likelihood is averaged. Cross Entropy Loss is the difference between target and prediction. Sun teaches multiple segments being part of the structured information as shown in previous claims, also Pg 3, Fig 1).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Devlin and Vaswani, with the Language Model of Devlin, as it is conceptually simple, which would improve ease of implementation (Abstract, Para 2, Ln 1).

Claim(s) 10 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Sun and Devlin, as applied to claim 1 above, and further in view of Liu et al. “Text Summarization with Pretrained Encoders” hereinafter Liu.

Regarding Claim 10:
The combination of Sun and Devlin teaches the method of claim 1, and Sun teaches further comprising: obtaining a target scene(Pg 4, 4.3 Experiments, Para 1, Ln 1, ERNIE is applied to 5 Chinese NLP tasks); 
obtaining a second data set based on the target scene, wherein the second data set is supervised(Pg 4, 4.3.1 Natural Language, Para 1, Ln 1, MultiNLI corpus. The pairs are annotated with textual entailment); 
The combination of Sun and Devlin does not teach and 20adjusting parameters of the text generation model based on the second data set to generate a text generation model corresponding to the target scene.
In the same field of Pre-training NLP models, Liu teaches and 20adjusting parameters of the text generation model based on the second data set to generate a text generation model corresponding to the target scene(Pg 6, Abstractive Summarization, Para 1, Ln 1-8, In all abstractive models…. All models were trained. Pg 5, 4.1 Summarization Datasets, Para 1, Ln 1-2, We evaluated our model on three benchmark datasets. Para 3, Ln 1-2, NYT contains 110,540 articles with abstractive summaries).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun and Devlin, with the text summarization task of Liu, as it provides an application for the pre-training to improve performance(Pg 1, Introduction, Ln 1-2).

Regarding Claim 12:
The combination of Sun, Devlin and Liu teaches the method of claim 10, but does not teach wherein the target scene comprises one or more 10combinations of a dialogue generation scene, a machine translation scene, a question and answer scene, and a summary generation scene.
In the same field of Pre-training NLP models, Liu teaches wherein the target scene comprises one or more 10combinations of a dialogue generation scene, a machine translation scene, a question and answer scene, and a summary generation scene(Pg 4, 3.3 Abstractive Summarization, Para 1, Ln 1-2, abstractive summarization).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun and Devlin, with the text summarization task of Liu, as it provides an application for the pre-training to improve performance(Pg 1, Introduction, Ln 1-2).

Claim(s) 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Sun, Devlin and Liu, as applied to claim 1 above, and further in view of Brownlee “Loss and Loss Functions for Training Deep Learning Neural Network”, hereinafter Brownlee.

Regarding Claim 11:
The combination of Sun, Devlin and Liu teaches the method of claim 10, but does not teach wherein the second data set comprises a plurality of pieces of second sample data, 25each of the plurality of pieces of second sample data comprises source data and labeled data corresponding to the source data, adjusting the parameters of the text generation model based on the second data set to generate the text generation model corresponding to the target scene comprises: segmenting the labeled data corresponding to each source data to generate a 30labeled segment sequence corresponding to each source data; 37Docket No. 713086 inputting source data of the plurality of pieces of second sample data into the text generation model to generate a predicted segment sequence corresponding to each source data; generating a third loss value based on a difference between the predicted 5segment sequence and the labeled segment sequence; and adjusting the parameters of the text generation model based on the third loss value to generate the text generation model corresponding to the target scene.
In the same field of Pre-training NLP models, Liu teaches wherein the second data set comprises a plurality of pieces of second sample data(Pg 5, 4.1 Summarization Datasets, Para 1, Ln 1-2, We evaluated our model on three benchmark datasets. Para 3, Ln 1-2, NYT contains 110,540 articles with abstractive summaries), 
25each of the plurality of pieces of second sample data comprises source data and labeled data corresponding to the source data(Pg 5, 4.1 Summarization Datasets, Para 3, Ln 1-2, NYT contains 110,540 articles with abstractive summaries), 
adjusting the parameters of the text generation model based on the second data set to generate the text generation model corresponding to the target scene comprises: segmenting the labeled data corresponding to each source data to generate a 30labeled segment sequence corresponding to each source data(Pg 5, 4.1 Summarization Datasets, Para 3, Ln 8-14, documents with summaries less than 50 words were removed from the dataset.…..Sentences were split with the Stanford CoreNLP toolkit),
37Docket No. 713086inputting source data of the plurality of pieces of second sample data into the text generation model to generate a predicted segment sequence corresponding to each source data(Pg 3, 2.3 Abstractive Summarization, Para 1, Ln 4-7, the source document x = [x1, ..., xn] to a sequence of continuous representations z = [z1, ..., zn], and a decoder then generates the target summary y = [y1, ..., ym] token-by-token); 
and adjusting the parameters of the text generation model….to generate the text generation model corresponding to the target scene(Pg 6, Abstractive Summarization, Para 1, Ln 1-8, In all abstractive models…. All models were trained).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun and Devlin, with the text summarization task of Liu, as it provides an application for the pre-training to improve performance(Pg 1, Introduction, Ln 1-2).
The combination of Sun, Devlin and Liu does not specifically teach generating a third loss value based on a difference between the predicted 5segment sequence and the labeled segment sequence; and adjusting the parameters of the text generation model based on the third loss value.
In the same field of Machine Learning, Brownlee teaches generating a third loss value based on a difference between the predicted 5segment sequence and the labeled segment sequence(Pg 6, Para 2, Ln 2-3, a loss function estimates how closely the distribution of predictions made by a model matches the distribution of target variables in the training data. Segment sequences taught by Liu as shown above); 
and adjusting the parameters of the text generation model based on the third loss value(Pg 1, Bullet point #1, Neural networks are trained using an optimization process that requires a loss function to calculate the model error).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Devlin and Liu, with the loss function of Brownlee, as they enable the model to be trained for a specific task(Pg 1, Bullet point #1).

Claim(s) 13 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Sun, and further in view of Song (US 20210224660 A1), and further in view of Devlin.

Regarding Claim 13:
Sun teaches obtaining a first data set, wherein the first data set comprises a plurality of pieces of 5first sample data(Pg 4, 4.1 Heterogeneous, Para 1, Ln 3-4, we draw the mixed corpus Chinese Wikepedia, Baidu Baike, Baidu news and Baidu Tieba);
extracting structured information from each of the plurality of pieces of first sample data as target structured information corresponding to each of the plurality of pieces of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 4-9, For English…lexical analysis and chunking tools to get the boundary of phrases in the sentences, and.…segmentation tools to get the word/phrase information in….Chinese. Examiner points out that with regard to the “plurality of pieces of first sample data” limitation, this process is done for more than just one training example, as the data sets cited above point out. This note will not be repeated in future citations); 
inputting the plurality of pieces of first sample data into an initial text generation 10model to generate predicted structured information corresponding to each of the plurality of pieces of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase. Pg 3, Fig 1, shows model. Previous citation to data sets shows this is happening to many individual data samples); 
and 15training a phrase generation ability of the initial text generation model(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 8-12, basic language units as training input..… mask and predict all the basic units in the same phrase).
Sun does not teach an electronic device, comprising: at least one processor; and 15a storage device connected in communication with the at least one processor; wherein, the storage device stores instructions executable by the at least one processor, and when the instructions are executed by the at least one processor, the at least one processor is caused to implement.
	In the same field of Phrase Mask Prediction, Song teaches an electronic device, comprising: at least one processor; and 15a storage device connected in communication with the at least one processor; wherein, the storage device stores instructions executable by the at least one processor, and when the instructions are executed by the at least one processor, the at least one processor is caused to implement(Para [0053], Ln 1-13, The user computing device 102 includes one or more processors 112 and a memory ….memory 114 can store data 116 and instructions 118 which are executed by the processor 112 to cause the user computing device 102 to perform operations).
	It would have been obvious for one skilled in the art, at the effective time of filling, to modify Sun with the generic computer components of Song, as it provides hardware in which the system can be realized(Para [0053], Ln 9-13).
The combination of Sun and Song does not teach generating a first loss value based on a difference between the predicted structured information corresponding to each of the plurality of pieces of first sample data and the corresponding target structured information; and 15training a phrase generation ability of the initial text generation model based on the first loss value to generate the text generation model.
In the same field of Masked Language Models, Devlin teaches generating a first loss value based on a difference between the predicted structured information corresponding to each of the plurality of pieces of first sample data and the corresponding target structured information(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 12-14, Then, Ti will be used to predict the original token with cross entropy loss. Cross entropy loss is measuring the difference); 
and 15training a phrase generation ability of the initial text generation model based on the first loss value to generate the text generation model(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 7-14, The training data generator ……Then, Ti will be used to predict the original token with cross entropy loss. Phrases rather than words is taught by Sun).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun and Song, with the Language Model of Devlin, as it is conceptually simple, which would improve ease of implementation(Abstract, Para 2, Ln 1).

Regarding Claim 20:
Sun teaches a method for obtaining a document layout, the method comprising(Abstract, Ln 6-8, ERNIE is designed to learn language representation enhanced by knowledge masking strategies)
obtaining a first data set, wherein the first data set comprises a plurality of pieces of 5first sample data(Pg 4, 4.1 Heterogeneous, Para 1, Ln 3-4, we draw the mixed corpus Chinese Wikepedia, Baidu Baike, Baidu news and Baidu Tieba);
extracting structured information from each of the plurality of pieces of first sample data as target structured information corresponding to each of the plurality of pieces of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 4-9, For English…lexical analysis and chunking tools to get the boundary of phrases in the sentences, and.…segmentation tools to get the word/phrase information in….Chinese. Examiner points out that with regard to the “plurality of pieces of first sample data” limitation, this process is done for more than just one training example, as the data sets cited above point out. This note will not be repeated in future citations); 
inputting the plurality of pieces of first sample data into an initial text generation 10model to generate predicted structured information corresponding to each of the plurality of pieces of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase. Pg 3, Fig 1, shows model. Previous citation to data sets shows this is happening to many individual data samples); 
and 15training a phrase generation ability of the initial text generation model(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 8-12, basic language units as training input..… mask and predict all the basic units in the same phrase).
Sun does not teach a non-transitory computer-readable storage medium storing computer instructions, wherein when the computer instructions are executed, a computer is caused 30to implement.
In the same field of Phrase Mask Prediction, Song teaches a non-transitory computer-readable storage medium storing computer instructions, wherein when the computer instructions are executed, a computer is caused 30to implement(Para [0053], Ln 1-13, computing device 102 includes one or more processors 112 and a memory …such as RAM….memory 114 can store data 116 and instructions 118 which are executed by the processor 112 to cause the user computing device 102 to perform operations).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify Sun with the generic computer components of Song, as it provides hardware in which the system can be realized(Para [0053], Ln 9-13).
The combination of Sun and Song does not teach generating a first loss value based on a difference between the predicted structured information corresponding to each of the plurality of pieces of first sample data and the corresponding target structured information; and 15training a phrase generation ability of the initial text generation model based on the first loss value to generate the text generation model.
In the same field of Masked Language Models, Devlin teaches generating a first loss value based on a difference between the predicted structured information corresponding to each of the plurality of pieces of first sample data and the corresponding target structured information(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 12-14, Then, Ti will be used to predict the original token with cross entropy loss. Cross entropy loss is measuring the difference); 
and 15training a phrase generation ability of the initial text generation model based on the first loss value to generate the text generation model(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 7-14, The training data generator ……Then, Ti will be used to predict the original token with cross entropy loss. Phrases rather than words is taught by Sun).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun and Song, with the Language Model of Devlin, as it is conceptually simple, which would improve ease of implementation(Abstract, Para 2, Ln 1).

Claim(s) 14-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Sun, Song and Devlin as applied to claim 13 above, and further in view of Vaswani.

Regarding Claim 14:
The combination of Sun, Song and Devlin teaches the electronic device of claim 13, and Sun teaches wherein the initial text generation model comprises an initial encoder(Pg 3, 3.1 Transformer Encoder, Ln 1-2, ERNIE use multi-layer Transformer (Vaswani et al., 2017) as basic encoder),
generate a plurality of predicted segments; and generating the predicted structured information corresponding to each of the 30plurality of pieces of first sample data based on the plurality of predicted segments(Pg 3, Fig 3, shows phrases being generated based on the predicted segments that make them up). 34Docket No. 713086
The combination of Sun, Song and Devlin does not teach wherein, the initial text generation model comprises an initial encoder and an initial decoder, inputting the plurality of pieces of first sample data into the initial text generation model to generate predicted structured information corresponding to each of the plurality of pieces of first sample data, comprises: inputting each of the plurality of pieces of first sample data into the initial encoder to generate a group of vector representations corresponding to each of the 25plurality of pieces of first sample data; inputting the group of vector representations corresponding to each of the plurality of pieces of first sample data into the initial decoder to generate a plurality of predicted segments.
In the same field of Machine Learning Language Models, Vaswani teaches wherein, the initial text generation model comprises an initial encoder and an initial decoder, (Pg 2, 3 Model Architecture, Para 1, Ln, 1, encoder-decoder structure, Para 2, Ln 1, The Transformer follows this overall architecture),
20inputting the plurality of pieces of first sample data into the initial text generation model to generate predicted structured information corresponding to each of the plurality of pieces of first sample data, comprises: inputting each of the plurality of pieces of first sample data into the initial encoder to generate a group of vector representations corresponding to each of the 25plurality of pieces of first sample data(Pg 2, 3 Model Architecture, Para 1, Ln 2-4, encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. Pg 3, Encoder, Ln 7, produce outputs of dimension dmodel = 512); 
inputting the group of vector representations corresponding to each of the plurality of pieces of first sample data into the initial decoder to generate a plurality of predicted segments(Pg 2, 3 Model Architecture, Para 1, Ln 2-4, encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. This would happen for each training sample previously cited in Sun); 
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Song and Devlin, with the Transformer model of Vaswani, as they improve performance and are more efficient(Abstract, Ln 6-8) and Sun uses a Transformer of Vaswani(Sun, Pg 3, 3.1 Transformer Encoder, Ln 1-2).
The combination of Sun, Song, Devlin and Vaswani does not teach wherein a phrase generation ability of the initial encoder and the initial decoder is trained based on the first loss value.
In the same field of Masked Language Models, Devlin teaches wherein a phrase generation ability of the initial encoder and the initial decoder is trained based on the first loss value(Pg 4173, Model Architecture, Para 1, Ln 1-4, BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017). Addition of Decoder with the combination of Vaswani, Sun teaches phrases rather than words. Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 7-14, The training data generator ……Then, Ti will be used to predict the original token with cross entropy loss).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Song, Devlin and Vaswani, with the Language Model of Devlin, as it is conceptually simple, which would improve ease of implementation (Abstract, Para 2, Ln 1).

Regarding Claim 15:
The combination of Sun, Song, Devlin and Vaswani teaches the electronic device of claim 14, and Sun teaches wherein the target structured information comprises a 5plurality of phrases(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase), 
and when the instructions are executed by the at least one processor, the at least one processor is caused to implement(Is taught with the combination of Song, citations in Claim 13):
obtaining positions to be masked among the plurality of phrases(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 4-9, For English…lexical analysis and chunking tools to get the boundary of phrases in the sentences, and.…segmentation tools to get the word/phrase information in….Chinese); 
masking phrases above the positions(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase); 
inputting the target structured information after the masking into the text generation model to generate predicted phrases corresponding to the positions(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase); 
and training an inter-phrase relationship ability of the text generation model(Pg 1, Introduction, Para 3, Ln 11-13, In this way, the prior knowledge of phrases and entities are implicitly learned during the training procedure).
The combination of Sun, Song, Devlin and Viswani does not teach generating a second loss value based on the phrases above the positions and the predicted phrases corresponding to the positions and training an inter-phrase relationship ability of the text generation model based on the second loss value.
In the same field of Masked Language Models, Devlin teaches 10generating a second loss value based on the phrases above the positions and the predicted phrases corresponding to the positions(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 7-14, The training data generator … we replace the i-th token with (1) the [MASK] …Then, Ti will be used to predict the original token with cross entropy loss. Sun teaches the phrases being the prediction for the positions, Pg 3, Fig 1); 
and training an inter-phrase relationship ability of the text generation model based on the second loss value(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 7-14, The training data generator … we replace the i-th token with (1) the [MASK] …Then, Ti will be used to predict the original token with cross entropy loss. Pg 4171-4172, Introduction, Para 4, Ln 9-11, the objective is to predict the original vocabulary id of the masked word based only on its context. Phrases are taught by Sun as shown above).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Song, Devlin and Vaswani, with the Language Model of Devlin, as it is conceptually simple, which would improve ease of implementation (Abstract, Para 2, Ln 1).

Regarding Claim 16:
The combination of Sun, Song, Devlin and Vaswani teaches the electronic device of claim 14, and Sun teaches wherein the plurality of predicted segments comprise N predicted segments, where N is a positive integer(Pg 3, Fig 3, shows a positive number of predicted segments). 
The combination of Sun, Song, Devlin and Vaswani does not teach inputting the group of vector representations corresponding to each of the plurality of pieces of first sample data into the initial decoder to generate the plurality of predicted 20segments, comprises: when predicting the ith predicted segment, generating the ith predicted segment by decoding through the initial decoder based on the group of vector representations, the first predicted segment over the (i-1)th predicted segment, and a position feature of the ith predicted segment, where i is a positive integer less than or equal to N.
In the same field of Machine Learning Language Models, Vaswani teaches inputting the group of vector representations corresponding to each of the plurality of pieces of first sample data into the initial decoder to generate the plurality of predicted 20segments, comprises: when predicting the ith predicted segment, generating the ith predicted segment by decoding through the initial decoder based on the group of vector representations(Pg 2, 3 Model Architecture, Para 1, Ln 2-4, encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. Pg 3, Encoder, Ln 7, produce outputs of dimension dmodel = 512. Each segment being what is predicted is shown in Sun, Pg 3, 3.2.1, Para 1, Ln 2-5, for English, the basic language unit is word, and for Chinese, the basic language unit is Chinese Character. 3.2.2, Para 1, Ln 11-12, mask and predict all the basic units in the same phrase), 
the first predicted segment over the (i-1)th predicted segment(Pg 2, 3 Model Architecture, Para 1, Ln 4-5, At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next. Pg 3, Fig 1, Outputs(shifted right) -> Decoder(is labeled as Decoder below in 3.1 Decoder:)), 
and a position feature of the ith predicted segment(Pg 3, Fig 1, Positional Encoding), 
where i is a positive integer less than or equal to N(Pg 2, 3 Model Architecture, Para 1, Ln 3-4, decoder then generates an output sequence (y1, ..., ym). The i-th predicted segment is going to be one of the segments from (y1, ..., ym)).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Song, Devlin and Vaswani with the Transformer model of Vaswani, as they improve performance and are more efficient(Abstract, Ln 6-8) and Sun uses a Transformer for Vaswani(Sun, Pg 3, 3.1 Transformer Encoder, Ln 1-2).

Regarding Claim 17:
The combination of Sun, Song, Devlin and Vaswani teaches the electronic device of claim 16, and Sun teaches wherein the predicted segment comprises M characters, where M is a positive integer(Pg 3, 3.2.1, Para 1, Ln 2-5, for English, the basic language unit is word, and for Chinese, the basic language unit is Chinese Character. 3.2.2, Para 1, Ln 11-12, mask and predict all the basic units in the same phrase. Pg 3, Fig 1, shown segments with a positive number of characters).
The combination of Sun, Song, Devlin and Vaswani does not teach and generating the ith predicted segment comprises: when predicting the ith predicted segment, generating the M characters in the ith 30predicted segment simultaneously through the initial decoder.
In the same field of Machine Learning Language Models, Vaswani teaches and generating the ith predicted segment comprises: when predicting the ith predicted segment, generating the M characters in the ith 30predicted segment simultaneously through the initial decoder(Pg 2, 3 Model Architecture, Para 1, Ln 3-4, continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. Pg 3, 3.1, Encoder, Para 1, Ln 6-7, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. Pg 5, 3.4 Embeddings and Softmax, para 1, Ln 1-3, learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel.….convert the decoder output to predicted next-token probabilities. Entire tokens are predicted as one output. In Sun, tokens are an English word, Pg 3, 3.2.1, Para 1, Ln 2-5, for English, the basic language unit is word, and for Chinese, the basic language unit is Chinese Character. Words contain a positive number of letters).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Song, Devlin and Vaswani, with the Transformer model of Vaswani, as they improve performance and are more efficient(Abstract, Ln 6-8) and Sun uses a Transformer of Vaswani(Sun, Pg 3, 3.1 Transformer Encoder, Ln 1-2).

Regarding Claim 18:
The combination of Sun, Song, Devlin and Vaswani teaches the electronic device of claim 15, and Sun teaches wherein the phrase generation ability and the inter-phrase relationship ability are trained in fusion(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase. Pg 3, Fig 3, shows model generating phrases. Pg 1, Introduction, Para 3, Ln 8-13, All of the words in the same unit are masked during word representation training.….In this way, the prior knowledge of phrases and entities are implicitly learned during the training procedure. Generation and using context are learned together in the phrase masking task),
the target structured information comprises a plurality of target segments corresponding to the piece of first sample data(Pg 3, Fig 3, shows a plurality of segments in the output for one sample), 
and inputting the plurality of pieces of first sample data into the initial text generation model to generate the predicted structured information corresponding to each of the 10plurality of pieces of first sample data, comprises: determining respective positions of the plurality of target segments in the piece of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 4-9, For English…lexical analysis and chunking tools to get the boundary of phrases in the sentences, and.…segmentation tools to get the word/phrase information in….Chinese); 
masking the plurality of target segments in the piece of first sample data based on respective positions of the plurality of target segments in the piece of first sample 15data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 4-9, For English…lexical analysis and chunking tools to get the boundary of phrases in the sentences, and.…segmentation tools to get the word/phrase information in….Chinese. Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase); 
and inputting the plurality of pieces of first sample data after the masking into the initial text generation model to generate the predicted structured information corresponding to each of the plurality of pieces of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase), 
wherein each of predicted segments in the predicted structured information is corresponding to a 20masked target segment in the piece of first sample data(Pg 3, Fig 3, shows predicted are corresponding to masked),
The combination of Sun, Song, Devlin and Vaswani does not teach and the first loss value is determined based on a difference between each predicted segment and the corresponding masked target segment.
In the same field of Masked Language Models, Devlin teaches and the first loss value is determined based on a difference between each predicted segment and the corresponding masked target segment(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 12-14, Then, Ti will be used to predict the original token with cross entropy loss. Cross entropy loss is calculating the difference. Loss is calculated for each token individually. Also, Pg 4183, A.2 Pre-training Procedure, Para 2, Ln 11-13, The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood. Masked LM likelihood is averaged for pre-training loss).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Song, Devlin and Vaswani, with the Language Model of Devlin, as it is conceptually simple, which would improve ease of implementation (Abstract, Para 2, Ln 1).

Regarding Claim 19:
The combination of Sun, Song, Devlin and Vaswani teaches the electronic device of claim 18, and Sun teaches wherein the target structured information comprises a plurality of target segments corresponding to the piece of first sample data(Pg 3, Fig 3, shows a plurality of segments in the output for one sample), 
and inputting the plurality of pieces of first sample data into the initial text generation model to generate the predicted structured information corresponding to each of the 10plurality of pieces of first sample data, comprises: determining respective positions of the plurality of target segments in the piece of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 4-9, For English…lexical analysis and chunking tools to get the boundary of phrases in the sentences, and.…segmentation tools to get the word/phrase information in….Chinese); 
masking the plurality of target segments in the piece of first sample data based on respective positions of the plurality of target segments in the piece of first sample 15data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 4-9, For English…lexical analysis and chunking tools to get the boundary of phrases in the sentences, and.…segmentation tools to get the word/phrase information in….Chinese. Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase); 
and inputting the plurality of pieces of first sample data after the masking into the initial text generation model to generate the predicted structured information corresponding to each of the plurality of pieces of first sample data(Pg 3, 3.2.2 Phrase-Level Masking, Para 1, Ln 10-12, randomly select a few phrases in the sentence, mask and predict all the basic units in the same phrase), 
wherein each of predicted segments in the predicted structured information is corresponding to a 20masked target segment in the piece of first sample data(Pg 3, Fig 3, shows predicted are corresponding to masked),
The combination of Sun, Song, Devlin and Vaswani does not teach and the first loss value is determined based on a difference between each predicted segment and the corresponding masked target segment.
In the same field of Masked Language Models, Devlin teaches and the first loss value is determined based on a difference between each predicted segment and the corresponding masked target segment(Pg 4174, 3.1 Pre-training BERT, Para 4, Ln 12-14, Then, Ti will be used to predict the original token with cross entropy loss. Cross entropy loss is calculating the difference. Loss is calculated for each token individually. Also, Pg 4183, A.2 Pre-training Procedure, Para 2, Ln 11-13, The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood. Masked LM likelihood is averaged for pre-training loss).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Song, Devlin and Vaswani, with the Language Model of Devlin, as it is conceptually simple, which would improve ease of implementation (Abstract, Para 2, Ln 1).

Allowable Subject Matter
Claim 9 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is the reasons for containing allowable subject matter:

Regarding Claim 9:
The combination of Sun, Devlin and Vaswani teaches the method of claim 2, but does not teach wherein inputting the group of vector representations corresponding to each of the plurality of pieces of first sample data into the initial decoder to generate the plurality of predicted segments comprises: obtaining a preset length; decoding through the initial decoder the group of vector representations 10corresponding to each of the plurality of pieces of first sample data, and the preset length to generate the plurality of predicted segments with the preset length; in which under a case that a length of the predicted segment is less than the preset length, the predicted segment is supplemented with a preset complementing symbol so that the length of the predicted segment is equal to the preset length.
In the same field of Machine Learning Language Models, Vaswani teaches wherein inputting the group of vector representations corresponding to each of the plurality of pieces of first sample data into the initial decoder to generate the plurality of predicted segments comprises: decoding through the initial decoder the group of vector representations 10corresponding to each of the plurality of pieces of first sample data(Vaswani, Pg 2, 3 Model Architecture, Para 1, Ln 2-4, encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. Sun teaches that it is happening for every sample and not just one.),
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Sun, Devlin and Vaswani, with the Transformer model of Vaswani, as they improve performance and are more efficient(Abstract, Ln 6-8) and Sun uses a Transformer of Vaswani(Sun, Pg 3, 3.1 Transformer Encoder, Ln 1-2).
In the same field of Machine Learning, Gao “Ensemble Attention For Text Recognition In Natural Images” hereinafter Gao, teaches obtaining a preset length(Pg 4, B Ensemble Decoder, Para 2, Ln 6-8, ensure same length of labels); 
But Gao does not teach and the preset length to generate the plurality of predicted segments with the preset length; in which under a case that a length of the predicted segment is less than the preset length, the predicted segment is supplemented with a preset complementing symbol so that the length of the predicted segment is equal to the preset length.
While Gao does teach padding to a preset length, the padding is done to the input of the decoder(Pg 4, B Ensemble Decoder, Para 2, Ln 1), and not after the decoder. The claim limitations in Claim 2 state that “predicted segments” are the output of the decoder, so these last claim limitations of Claim 9 are interpreted as limiting the padding to be done to the decoder output, after it has been output from the decoder. For this reason the claim contains allowable subject matter over cited prior art.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Yuanxin Liu et al. “Unsupervised Pre-training for Natural Language Generation: A Literature Review”.
A review paper covering different implementations of pre-training for language generation tasks.
Wu et al. “Mask and Infill: Applying Masked Language Model to Sentiment Transfer".
Contains phrase masking and prediction.
	Zhou et al. “LIMIT-BERT : Linguistic Informed Multi-Task BERT".
Contains phrase masking and prediction.
	Joshi et al. “SpanBERT: Improving Pre-training by Representing and Predicting Spans".
BERT variant containing Span Masking and prediction, similar to phrase masking.
Yinhan Liu et al. “RoBERTa: A Robustly Optimized BERT Pretraining Approach”.
Pre-training representation model for other NLP tasks.
	Radford et al. “Language Models are Unsupervised Multitask Learners”.
Pre-training model for other NLP tasks.
	Stewart et al. “Word-level Lexical Normalisation using Context-Dependent Embeddings”.
Contains padding model inputs to a preset length.
	Reddy et al. “EFFECTS OF PADDING ON LSTMS AND CNNS”.
Covers different methods of model input padding.
	Peleg et al. (US 20220198136 A1)
Contains pre-training involving phrase masking, and provisional for priority.
	Galley et al. (US 20210192140 A1)
Pre-training for language generation task.
	Saleh et al.  (US 10885436 B1)
Pre-training for text summarization.
	Xu et al. (US 20200372225 A1)
Pre-training for language generation.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDER G MARLOW whose telephone number is (571)272-4536. The examiner can normally be reached Monday - Thursday 10:00 am - 8:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richmond Dorvil can be reached on (571)272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/ALEXANDER G MARLOW/Assistant Examiner, Art Unit 2658                                                                                                                                                                                                        
/DOUGLAS GODBOLD/Primary Examiner, Art Unit 2655