DETAILED ACTION

Introduction
This office action is in response to Applicant’s submission filed on 12/10/2020. Claims
1-20 are pending in the application. As such, claims 1-20 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. CN2020104797441, filed on 05/29/2020.
Applicant is advised of possible benefits under 35 U.S.C. 119(a)-(d) and (f), wherein an application for patent filed in the United States may be entitled to claim priority to an application filed in a foreign country.
Should applicant desire to obtain the benefit of foreign priority under 35 U.S.C. 119(a)-(d) prior to declaration of an interference, a certified English translation of the foreign application must be submitted in reply to this action.  37 CFR 41.154(b) and 41.202(e).
Failure to provide a certified translation may result in no benefit being accorded for the non-English application.  A certified translation of the parent application No. CN2020104797441 would need to be provided in order to satisfied the requirement stated above.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 11/11/2021 and 3/14/2022.  The submissions are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Drawings
The drawings filed on 12/10/2020 have been accepted and considered by the Examiner.


Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-5, 7-11, and 13-17 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 3, 5, 7-10, 12, 14, 16-19 of U.S. Patent Application Serial No. 16951702 (hereinafter as the “702” application which is pending issuance), and also rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-3, 5-10, 12-20 of  U.S. Patent No. 11526668 (hereinafter as the “668” patent).  Although the claims at issue are not identical, they are not patentably distinct from each other because claims 1, 7 and 13 of the instant application corresponds directly to claims (7, 16) of 702 application, and the combinations of claims (1, 8, 15) and (5, 12, 19) of the 668 patent. The rejection over the 702 application is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.

Claims 2, 8 and 14 of the instant application corresponds to claims (8, 17) of 702 application, and claims (6, 13, 20) of the 668 patent.
Claim 3, 9 and 15 of the instant application corresponds to combination of claims (1, 10, 19) and (3, 12) of the 702 application, and the combination of claims (2, 9, 16) and (3, 10, 17) of the 668 patent.
Claims 4, 10 and 16 of the instant application corresponds to the combination of claims (1, 10, 19) and (5, 14) of the 702 application, and claims (5, 12, 19) of the 668 patent.
Claims 5, 13 and 17 of the instant application corresponds to the claims (9, 18) of the 702 application, and the claims (7, 14) of the 668 patent.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4, 6, 7-8, 10, 12-14, 16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Reisswig et al. (US Patent Application Publication No.: US 20200258498 A1) hereinafter as Reisswig, in view of Yin et al. (US Patent Application Publication No: US 20220180202 A1) hereinafter as Yin.
		
Regarding claim 1, Reisswig discloses: A method for training a language model, comprising: pre-training the language model using preset text language materials in a corpus; ([0026] The computing device 102 executes a target model 104 and a language model 106. The language model 106 may be a convolutional autoencoder language model 106 and may be pre-trained to a denoising task utilizing an unlabeled corpus including masked training samples. Further details of an example denoising task for training the language model 106 are provided herein with respect to FIGS. 3 and 4.)
replacing at least one word in a sample text language material with a word mask respectively to obtain a sample text language material comprising at least one word mask; ([0023] In some examples, a convolutional autoencoder language model is trained using a denoising task. A training corpus includes training samples, where each training sample includes an ordered set of strings. One or more strings from the ordered set of strings can be masked by replacing the original characters of the string with randomly selected characters. Each character in the selected string may be replaced with a randomly selected character. In examples in which the models are arranged at a word level, the selecting string for scrambling may be a word. Also, in examples in which the models are arranged at a character level, each character in a selected word can be replaced with a randomly-selected character as described herein.) 
inputting the sample text language material comprising the at least one word mask into the language model, ([0024] The resulting string can be referred to as a masked string or scrambled string. The convolutional autoencoder language model can be trained to reconstruct at least one masked string from the training samples based on the other, unmasked strings in the training samples. In this way, the convolutional autoencoder language model can be trained with unlabeled data in an unsupervised manner. Because of the masked strings, the model may be trained to infer from the context of neighboring strings how to reconstruct or fill in the masked strings. As a result, the feature vectors generated by the layers of the model may reflect the context of the input (e.g., forward and reverse context).)
and training the language model based on the word vector corresponding to each word mask until a preset training completion condition is met. ([0046] At operation 412, the training service 304 determines if more training is to be performed. Additional training may be performed, for example, if the language model 106 has an error rate that is greater than a threshold. If there is no additional training to be performed, the process may be completed at operation 414. If more training is to be performed, the training service accesses a next language model training sample at operation 402 and proceeds to utilize the next language model training sample as described herein.)
Reisswig does not explicitly, but Yin discloses: and outputting a context vector of each of the at least one word mask the language model; ([0230] Further, in a data enhancement method based on a pre-trained language model proposed in this disclosure, context information can be well encoded by using the data enhancement method in this embodiment of this disclosure, and appropriate replacement text can be generated by using the context information, so that the replacement text has smoother syntax and higher quality.)
determining a word vector corresponding to each word mask based on the context vector of the word mask and a word vector parameter matrix; ([0252] Specifically, the pre-trained language model BERT is used to predict all [Mask] symbols each time, then probabilities of candidate words corresponding to all [Mask] symbols are uniformly compared, and a candidate word with a largest probability is selected to replace a [Mask] symbol. This is repeated for m times until all [Mask] symbols are replaced. [0253] Step 704: Output the replacement text.)
Reisswig and Yin are considered analogous art because they are both in the related art of natural language process/text processing model training. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Reisswig to combine the teaching of Yin, to incorporate the above mentioned claim limitations, so that the replacement text has smoother syntax and higher quality (Yin, [0230]).

Regarding claim 2, Reisswig in view of Yin discloses: The method according to claim 1,  Reisswig additionally discloses:  wherein the replacing at least one word in a sample text language material with a word mask respectively comprises: performing word segmentation on the sample text language material, and replacing each of thc at least one word in the sample text language material with one word mask based on the word segmentation result. ([0026] The computing device 102 executes a target model 104 and a language model 106. The language model 106 may be a convolutional autoencoder language model 106 and may be pre-trained to a denoising task utilizing an unlabeled corpus including masked training samples. Further details of an example denoising task for training the language model 106 are provided herein with respect to FIGS. 3 and 4. [segmentation here means only some of the words in the sentences are being masked]

Regarding claim 4, Reisswig in view of Yin discloses: The method according to claim 1,  Reisswig additionally discloses: wherein the word vector parameter matrix is configured as a pre-trained word vector parameter matrix or an initialized word vector parameter matrix; ([0028] Each node 120A, 120B, 120N, 122A, 122B, 122N may apply a feature filter to its node input. The feature filter includes a vector and/or matrix that is convolved with the respective node input to generate a respective node output. [where the word vector parameter matrix corresponds to the language models])
Yin also additionally discloses: the training the language model based on the word vector corresponding to each word mask until a preset training completion condition is met comprises: training the language model and the initialized word vector parameter matrix based on the word vector corresponding to the word mask until the preset training completion condition is met. ([0128] In a process of training a deep neural network, because it is expected that output of the deep neural network is as close as possible to a value that is actually expected to be predicted, a current predicted value of the network may be compared with a target value that is actually expected, and then a weight vector at each layer of the neural network is updated based on a difference between the current predicted value and the target value (there is usually an initialization process before the first update, namely, a parameter is preconfigured for each layer of the deep neural network). For example, if the predicted value of the network is large, the weight vector is adjusted to lower the predicted value until the deep neural network can predict the target value that is actually expected or a value close to the target value that is actually expected. Therefore, “how to obtain, through comparison, the difference between the predicted value and the target value” needs to be predefined. This is the loss function or an objective function. The loss function and the objective function are important equations used to measure the difference between the predicted value and the target value. The loss function is used as an example. A higher output value (loss) of the loss function indicates a larger difference. Therefore, training of the deep neural network is a process of minimizing the loss as much as possible.)

Regarding claim 6, Reisswig in view of Yin discloses: The method according to claim 1,  Reisswig additionally discloses: further comprising: after the preset training completion condition is met, performing a natural language processing task with the trained language model to obtain a processing result; ([0005] FIG. 2 is a flowchart showing one example of a process flow that can be executed by the computing device of FIG. 1 to execute a natural language processing task.  Also see [0031] and [0036-0039] for details on the natural language processing task using the trained language models.)
Yin also additionally discloses: and finely tuning parameter values in the language model according to the difference between the processing result and annotated result information. ([0128] In a process of training a deep neural network, because it is expected that output of the deep neural network is as close as possible to a value that is actually expected to be predicted, a current predicted value of the network may be compared with a target value that is actually expected, and then a weight vector at each layer of the neural network is updated based on a difference between the current predicted value and the target value (there is usually an initialization process before the first update, namely, a parameter is preconfigured for each layer of the deep neural network). For example, if the predicted value of the network is large, the weight vector is adjusted to lower the predicted value until the deep neural network can predict the target value that is actually expected or a value close to the target value that is actually expected. Therefore, “how to obtain, through comparison, the difference between the predicted value and the target value” needs to be predefined. This is the loss function or an objective function. The loss function and the objective function are important equations used to measure the difference between the predicted value and the target value. The loss function is used as an example. A higher output value (loss) of the loss function indicates a larger difference. Therefore, training of the deep neural network is a process of minimizing the loss as much as possible. [0137] For another subsequent task, feature extraction or task fine-tuning may be performed based on the model to fulfill a specific task.  Also see [0138] as it also discuss fine-tuning of the language model to improve results.)

Regarding claim 7, Reisswig discloses: An electronic device, comprising: at least one processor; and a memory connected with the at least one processor communicatively; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method for training a language model, which comprises: ([0095] The example computer system 800 includes a processor 802 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both), a main memory 804, and a static memory 806, which communicate with each other via a bus 808.)  [0053] Example 1 is a computerized system for analyzing text, the computerized system comprising: at least one programmable processor; and a machine-readable medium having instructions stored thereon that, when executed by the at least one programmable processor, cause the at least one programmable processor to execute operations comprising:)
As for the rest of the claim, they recite the same elements of claim 1, thus, the rationale applied in the rejection of claim 1 also applies.  
	
Regarding claim 13, Reisswig discloses: A non-transitory computer-readable storage medium with computer instructions stored thereon, wherein the computer instructions are used for causing a computer to perform a method for training a language model, which comprises: pre-training the language model using preset text language materials in a corpus: ([0069] Example 17 is a non-transitory machine-readable medium comprising instructions stored thereon that, when executed by at least one processor, cause the at least one processor to execute operations comprising: training an autoencoder language model using a plurality of language model training samples, ...)
As for the rest of the claim, they recite the same elements of claim 1, thus, the rationale applied in the rejection of claim 1 also applies.  

Regarding claims 8 and 14, although different in scope from claim 2 and each other, they recite elements of the method of claim 2 as an electronic device and non-transitory computer-readable medium.  Thus, the analysis in rejecting claim 2 is equally applicable to claims 8 and 14.

Regarding claims 10 and 16, although different in scope from claim 4 and each other, they recite elements of the method of claim 4 as an electronic device and non-transitory computer-readable medium.  Thus, the analysis in rejecting claim 4 is equally applicable to claims 10 and 16.

Regarding claims 12 and 18, although different in scope from claim 6 and each other, they recite elements of the method of claim 6 as an electronic device and non-transitory computer-readable medium.  Thus, the analysis in rejecting claim 6 is equally applicable to claims 12 and 18.


Claims 3, 9 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Reisswig, in view of Yin, and further in view of Yang (CN 110399454 A) with reference to the English machine translation provided, hereinafter as Yang.

Regarding claim 3, Reisswig in view of Yin discloses: The method according to claim 1,  Yin additionally discloses: wherein the determining a word vector corresponding to each word mask based on the context vector of the word mask and a word vector parameter matrix comprises: multiplying the context vector of cachi word mask by the word vector parameter matrix to obtain probability values oflthe word mask corresponding to a plurality of word vectors; (In para [0166-0167], it is suggested that these vectors and matricies can be multiplied in the vector calculation unit.)
Reisswig in view of Yin does not explicitly, but Yang discloses: normalizing the probability values of the word mask corresponding to the word vectors, so as to obtain a plurality of normalized probability values of the word mask corresponding to the word vectors; ([pg, 5, last para] between semantic prediction result of the word vector obtained by calculation and the real term of the loss function, the loss function can be in accordance with the set probability and labeling to generate a negative log-likelihood of loss, thereby firstly, based on semantic forecasting result of the word vector after normalization, constructing each masked word vector wi of the probability vector, dimension of each probability vector is vocabulary in the total semantic concept, one word in each one-dimensional representative vocabulary set, taking the value of the vector element masked word vector wi is the probability of the words; ) 
and determining the word vector corresponding to the word mask based on the normalized probability values of the word mask corresponding to the word vectors. ([pg. 5 last para to pg. 6, first para] then, according to the probability and the annotations to generate a negative log-likelihood loss, for updating the encoding layer parameters, so that text to obtain better semantic representation, the problem therefore, calculating loss in the most accurate reference system, solves the problem that the words more difficult to study. In the figure, based on the logistic regression best multi-reference semantic representation best logit i is normalized and loss function calculating to obtain each masked acceptation probability vector probs i of the word vector wi.)
Reisswig, Yin and Yang are considered analogous art because they are both in the related art of natural language process/text processing model training. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Reisswig, in view of Yin, to combine the teaching of Yang, to incorporate the above mentioned claim limitations, as to better provide the semantics of the text representation (Yang, summary of the invention).

Regarding claims 9 and 15, although different in scope from claim 3 and each other, they recite elements of the method of claim 3 as an electronic device and non-transitory computer-readable medium.  Thus, the analysis in rejecting claim 3 is equally applicable to claims 9 and 15.

Claims 5, 11 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Reisswig, in view of Yin, and further in view of applicant supplied reference, Sun et al. (Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). ERNIE: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129.) hereinafter as Sun.
Regarding claim 5, Reisswig in view of Yin discloses: The method according to claim 1, Reisswig in view of Yin does not explicitly, but Sun discloses: wherein the language model comprises an enhanced representation from knowledge Integration (ERNlE) model. ([sect I, Introduction] In this paper, we propose a model called ERNIE (enhanced representation through knowledge integration) by using knowledge masking strategies.)
Reisswig, Yin and Sun are considered analogous art because they are both in the related art of natural language process/text processing model training. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Reisswig, in view of Yin, to combine the teaching of Sun, to incorporate the above mentioned claim limitations, because the method outperforms BERT over all of these tasks.  Also, both the knowledge integration and pre-training on heterogeneous data enable the model to obtain better language representation (Sun, Conclusion).

Regarding claims 11 and 17, although different in scope from claim 5 and each other, they recite elements of the method of claim 5 as an electronic device and non-transitory computer-readable medium.  Thus, the analysis in rejecting claim 5 is equally applicable to claims 11 and 17.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Shaib et al. (US Patent Application Publication No: US 20210183484 A1) hereinafter as Shaib.  Shaib discloses a method and system for predicting text using a language model.
Liu et al. (US Patent Application Publication No: US 20210142164 A1) hereinafter as Liu.  Liu discloses a method and system for processing masked text using a language model.
Sun et al. (Sun, H., Tan, X., Gan, J. W., Zhao, S., Han, D., Liu, H., ... & Liu, T. Y. (2019, December). Knowledge distillation from bert in pre-training and fine-tuning for polyphone disambiguation. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 168-175). IEEE.) hereinafter as Sun.  Sun teaches a two stage knowledge distillation method using a light-weight BERT in pre-training and fine-tuning for polyphone disambiguation.
Cui et al. (Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., & Hu, G. (2020). Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922.) hereinafter as Cui.  Cui teaches a effective model called MacBERT for performing NLP tasks especially in regards to Chinese NLP tasks.
Diao et al. (Diao, S., Bai, J., Song, Y., Zhang, T., & Wang, Y. (2019). ZEN: Pre-training Chinese text encoder enhanced by n-gram representations. arXiv preprint arXiv:1911.00720.) hereinafter as Diao.  Diao teaches a technique of pre-training text encoders enhanced by N-gram model.
Luo et al. (Luo, R., Xu, J., Zhang, Y., Ren, X., & Sun, X. (2019). Pkuseg: A toolkit for multi-domain chinese word segmentation. arXiv preprint arXiv:1906.11455.) hereinafter as Luo.  Luo teaches a model for word segmentation named PKUSEG, which supports POS tagging and model training for multi-domain word segmentation.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Phillip H Lam whose telephone number is (571)272-1721. The examiner can normally be reached 10 AM-6 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PHILIP H LAM/Examiner, Art Unit 2656                                                             

	/HUYEN X VO/            Primary Examiner, Art Unit 2656