DETAILED ACTION
This Office Action is in response to the correspondence filed by the applicant on 6/8/2020.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers submitted under 35 U.S.C. 119(a)-(d), which papers have been placed of record in the file.

Information Disclosure Statement
The Information Statements (IDS) filed on 10/16/2020, 2/19/2021, 10/12/2021, and 6/10/2022 have been accepted and considered in this office action and are in compliance with the provisions of 37 CFR 1.97.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 2-5, 7-10, and 12-15 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
Claims 2, 7, and 12 recite the limitation, “wherein the label of the target sample”.  There is insufficient antecedent basis for these limitations in bold in the claims.

Claims 3, 8, and 13 recite the limitation, “replacing a target word of the sample in the sample set”.  There is insufficient antecedent basis for this limitation in bold in the claims.

Claims 4, 9, and 14 recite the limitation, “updating a target word of the sample in the sample set”.  There is insufficient antecedent basis for this limitation in bold in the claims.

Claims 5, 10, and 15 recite the limitation, “for the sample of the sample set”.  There is insufficient antecedent basis for this limitation in bold in the claims.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-15 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The independent claims 1, 6, and 11 recite, “acquiring a sample set, wherein samples in the sample set are unlabeled sentences; inputting a plurality of target samples in the sample set into a pre-trained first natural language processing model, respectively, to obtain prediction results output from the pre-trained first natural language processing model; determining the obtained prediction results as labels of the target samples in the plurality of target samples, respectively; and training a to-be-trained second natural language processing model, based on the plurality of target samples and the labels of the target samples to obtain a trained second natural language processing model, wherein parameters in the first natural language processing model are more than parameters in the second natural language processing model.”
These limitations, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting “at least one processor; and a memory …” in claims 6 and 11, nothing in the claim element precludes the step from practically being performed in the mind. For example, a person can read sentences and group them into classes and teach others using the grouped sentences. The limitations, as drafted, are processes that, under its broadest reasonable interpretation, cover performance of the limitations in the mind. 
If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas.  Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claims only recite additional elements – “at least one processor; and a memory …”. The additional elements in both steps is recited at a high-level of generality (i.e., as a generic processor performing a generic computer function of the recited steps) such that it amounts no more than mere instructions to apply the exception using a generic computer component.  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.  The claim is directed to an abstract idea.
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using a processor to perform the recited steps amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.  The claim is not patent eligible.
Regarding the dependent claims, Claims 2, 7, and 12 recite a probability associated with a type; Claims 3, 8, and 13 recite replacing a target word with a specified identifier; Claims 4, 9, and 14 recite updating target word with another word; Claims 5, 10, and 15 recite intercepting a segment of a target length.
  Even though the disclosed invention is described in the specification as improving computer technology, the claim provides no meaningful limitations such that this improvement is realized. Therefore, the claim does not amount to significantly more than the abstract idea itself. 
Accordingly, the limitations of the Claims, whether considered individually or as an ordered combination, are not sufficient to add significantly more to improve technological functionality. As such, Claims 1-15 are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.



Claims 1-15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by TANG (Tang R, Lu Y, Liu L, Mou L, Vechtomova O, Lin J. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136. 2019 Mar 28.).

REGARDING CLAIM 1, TANG discloses a method for processing data, the method comprising: 
acquiring a sample set, wherein samples in the sample set are unlabeled sentences (TANG Pg. 1 2nd Col –“To facilitate effective knowledge transfer, however, we often require a large, unlabeled dataset. The teacher model provides the probability logits and estimated labels for these unannotated samples, and the student network learns from the teacher’s outputs.”; Pg. 3 1st Col Section 3.1 – “For the teacher network, we use the pretrained, fine-tuned BERT (Devlin et al., 2018) model, a deep, bidirectional transformer encoder that achieves state of the art on a variety of language understanding tasks. From an input sentence (pair), BERT computes a feature vector h ϵ Rd, upon which we build a classifier for the task.”; Pg. 4 1st Col – “When distilling with a labeled dataset, the one-hot target t is simply the ground-truth label. When distilling with an unlabeled dataset, we use the predicted label by the teacher, i.e., ti = 1 if i = argmax y(B) and 0 otherwise.”); 
inputting a plurality of target samples in the sample set into a pre-trained first natural language processing model, respectively, to obtain prediction results output from the pre-trained first natural language processing model (TANG Pg. 1 2nd Col –“To facilitate effective knowledge transfer, however, we often require a large, unlabeled dataset. The teacher model provides the probability logits and estimated labels for these unannotated samples, and the student network learns from the teacher’s outputs.”; Pg. 3 1st Col Section 3.1 – “For the teacher network, we use the pretrained, fine-tuned BERT (Devlin et al., 2018) model, a deep, bidirectional transformer encoder that achieves state of the art on a variety of language understanding tasks. From an input sentence (pair), BERT computes a feature vector h ϵ Rd, upon which we build a classifier for the task.”; Pg. 4 1st Col – “When distilling with a labeled dataset, the one-hot target t is simply the ground-truth label. When distilling with an unlabeled dataset, we use the predicted label by the teacher, i.e., ti = 1 if i = argmax y(B) and 0 otherwise.”); 
determining the obtained prediction results as labels of the target samples in the plurality of target samples, respectively (TANG Pg. 1 2nd Col –“To facilitate effective knowledge transfer, however, we often require a large, unlabeled dataset. The teacher model provides the probability logits and estimated labels for these unannotated samples, and the student network learns from the teacher’s outputs.”; Pg. 3 1st Col Section 3.1 – “For the teacher network, we use the pretrained, fine-tuned BERT (Devlin et al., 2018) model, a deep, bidirectional transformer encoder that achieves state of the art on a variety of language understanding tasks. From an input sentence (pair), BERT computes a feature vector h ϵ Rd, upon which we build a classifier for the task. For single-sentence classification, we directly build a softmax layer, i.e., the predicted probabilities are y(B) = softmax(Wh), where W ϵ Rk x d is the softmax weight matrix and k is the number of labels.”; Pg. 3 2nd Col – “The discrete probability output of a neural network is given by 
    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale
 where wi denotes the ith row of softmax weight W, and z is equivalent to wTh.  ”; Pg. 4 1st Col – “When distilling with a labeled dataset, the one-hot target t is simply the ground-truth label. When distilling with an unlabeled dataset, we use the predicted label by the teacher, i.e., ti = 1 if i = argmax y(B) and 0 otherwise.”; Note that the discrete probabilities are calculated for k classes, the predicted label for the teacher model is determined by ti = 1 if i = argmax y(B)  i.e., the class with the maximum probability, and the predicted label is denoted with “1” while other labels are denoted with “0”.); and 
training a to-be-trained second natural language processing model, based on the plurality of target samples and the labels of the target samples to obtain a trained second natural language processing model (TANG Pg. 1 2nd Col – “To facilitate effective knowledge transfer, however, we often require a large, unlabeled dataset. The teacher model provides the probability logits and estimated labels for these unannotated samples, and the student network learns from the teacher’s outputs.”; Pgs. 3-4 Section 3.2 – “The distillation approach accomplishes knowledge transfer at the output level; that is, the student network learns to mimic a teacher network’s behavior given any data point.  …  Training on logits makes learning easier for the student model since the relationship learned by the teacher model across all of the targets are equally emphasized (Ba and Caruana, 2014).”), wherein parameters in the first natural language processing model are more than parameters in the second natural language processing model (TANG Pg. 1 2nd Col – “In this paper, we propose a simple yet effective approach that transfers task-specific knowledge from BERT to a shallow neural architecture—in particular, a bidirectional long short-term memory network (BiLSTM). Our motivation is twofold: we question whether a simple architecture actually lacks representation power for text modeling, and we wish to study effective approaches to transfer knowledge from BERT to a BiLSTM. Concretely, we leverage the knowledge distillation approach (Ba and Caruana, 2014; Hinton et al., 2015), where a larger model serves as a teacher and a small model learns to mimic the teacher as a student. This approach is model agnostic, making knowledge transfer possible between BERT and a different neural architecture, such as a single-layer BiLSTM, in our case.”; Pgs. 5-6 Section 4 – “For BERT, we use the large variant BERTLARGE (described below) as the teacher network, starting with the pretrained weights and following the original, task-specific fine-tuning procedure (Devlin et al., 2018). …. For our models, we feed the original dataset together with the synthesized examples to the task-specific, fine-tuned BERT model to obtain the predicted logits. We denote our distilled BiLSTM trained on soft logit targets as BiLSTMSOFT, which corresponds to choosing α = 0 in Section 3.2. Preliminary experiments suggest that using only the distillation objective works best.”; Pg. 6 Section 5.2 – “As shown in Table 2, our single-sentence model uses 98 and 349 times fewer parameters than ELMo and BERTLARGE, respectively, and is 15 and 434 times faster”).

REGARDING CLAIM 2, TANG discloses the method according to claim 1, wherein the label of the target sample is used to indicate a probability that the target sample belongs to any one of at least two types (TANG Pg. 1 2nd Col –“To facilitate effective knowledge transfer, however, we often require a large, unlabeled dataset. The teacher model provides the probability logits and estimated labels for these unannotated samples, and the student network learns from the teacher’s outputs.”; Pg. 3 1st Col Section 3.1 – “From an input sentence (pair), BERT computes a feature vector h ϵ Rd, upon which we build a classifier for the task. For single-sentence classification, we directly build a softmax layer, i.e., the predicted probabilities are y(B) = softmax(Wh), where W ϵ Rk x d is the softmax weight matrix and k is the number of labels.”;  Pgs. 3-4 Section 3.2 – “In particular, Ba and Caruana (2014) posit that, in addition to a one-hot predicted label, the teacher’s predicted probability is also important. In binary sentiment classification, for example, some sentences have a strong sentiment polarity, whereas others appear neutral. If we use only the teacher’s predicted one-hot label to train the student, we may lose valuable information about the prediction uncertainty.  The discrete probability output of a neural network is given by 
    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale
 where wi denotes the ith row of softmax weight W, and z is equivalent to wTh. The argument of the softmax function is known as logits. Training on logits makes learning easier for the student model since the relationship learned by the teacher model across all of the targets are equally emphasized (Ba and Caruana, 2014).  … When distilling with a labeled dataset, the one-hot target t is simply the ground-truth label. When distilling with an unlabeled dataset, we use the predicted label by the teacher, i.e., ti = 1 if i = argmax y(B) and 0 otherwise.”).

REGARDING CLAIM 3, TANG discloses the method according to claim 1, wherein the method further comprises: 
replacing a target word of the sample in the sample set with a specified identifier (TANG Pg. 4 Section 3.3 – “In our work, we propose a set of heuristics for task-agnostic data augmentation: we use the original sentences in the small dataset as blueprints, and then modify them with our heuristics, a process analogous to image distortion. Specifically, we randomly perform the following operations. Masking. With probability pmask, we randomly replace a word with [MASK], which corresponds to an unknown token in our models and the masked word token in BERT. Intuitively, this rule helps to clarify the contribution of each word toward the label, e.g., the teacher network produces less confident logits for “I [MASK] the comedy” than for “I loved the comedy.””), wherein, in the sample containing the specified identifier, a number of target word accounts for a target ratio or a target number of a number of words in the sample (TANG Pg. 4 Section 3.3 – “Our data augmentation procedure is as follows: given a training example {w1, …  wn}, we iterate over the words, drawing from the uniform distribution Xi ~ UNIFORM [0, 1] for each wi. If Xi < pmask, we apply masking to wi.”; Note that masking a target word wi is based on a uniform distribution. Thus, the ratio of masked words is equivalent to the pmask. In other words, if the pmask is set to 0.1, 10% of the words will be masked.); and 
adding the sample containing the specified identifier as a new sample of the sample set (TANG Pg.4 2nd Col – “After iterating through the words, with probability png, we apply n-gram sampling to this entire synthetic example. The final synthetic example is appended to the augmented, unlabeled dataset.”).

REGARDING CLAIM 4, TANG discloses the method according to claim 1, wherein the method further comprises: 
updating a target word of the sample in the sample set to another word with a same part of speech (TANG Pg. 4 Section 3.3 – “In our work, we propose a set of heuristics for task-agnostic data augmentation: we use the original sentences in the small dataset as blueprints, and then modify them with our heuristics, a process analogous to image distortion. Specifically, we randomly perform the following operations.  … POS-guided word replacement. With probability ppos, we replace a word with another of the same POS tag. To preserve the original training distribution, the new word is sampled from the unigram word distribution re-normalized by the part-of-speech (POS) tag. This rule perturbs the semantics of each example, e.g., “What do pigs eat?” is different from “How do pigs eat?””), wherein, in the updated sample, a number of target word accounts for a target ratio or a target number of a number of words in the sample (TANG Pg. 4 Section 3.3 – “Our data augmentation procedure is as follows: given a training example {w1, …  wn}, we iterate over the words, drawing from the uniform distribution Xi ~ UNIFORM [0, 1] for each wi. If Xi < pmask, we apply masking to wi. If pmask ≤ Xi < pmask + ppos, we apply POS-guided word replacement. We treat masking and POS-guided swapping as mutually exclusive: once one rule is applied, the other is disregarded.”; Note that POS-guided word replacement of a target word wi is based on a uniform distribution. Thus, the ratio of the replacement is equivalent to the pmask -  pmask. In other words, if the pmask is set to 0.1 and ppos is set to 0.3, then 20% (i.e., 0.3-0.1) of the words will be replaced by a POS-guided word.); and 
adding the updated sample as a new sample of the sample set (TANG Pg.4 2nd Col – “After iterating through the words, with probability png, we apply n-gram sampling to this entire synthetic example. The final synthetic example is appended to the augmented, unlabeled dataset.”).

REGARDING CLAIM 5, TANG discloses the method according to claim 1, wherein the method further comprises: 
for the sample of the sample set, intercepting a segment with a target length (TANG Pg. 4 Section 3.3 – “In our work, we propose a set of heuristics for task-agnostic data augmentation: we use the original sentences in the small dataset as blueprints, and then modify them with our heuristics, a process analogous to image distortion. Specifically, we randomly perform the following operations.  …n-gram sampling. With probability png, we randomly sample an n-gram from the example, where n is randomly selected from {1, 2, …, 5}. This rule is conceptually equivalent to dropping out all other words in the example, which is a more aggressive form of masking.); and 
adding the intercepted segment as a new sample of the sample set  (TANG Pg.4 2nd Col – “After iterating through the words, with probability png, we apply n-gram sampling to this entire synthetic example. The final synthetic example is appended to the augmented, unlabeled dataset.”).

REGARDING CLAIM 6, TANG discloses an apparatus for processing data, the apparatus comprising: at least one processor; and a memory storing instructions, wherein the instructions when executed by the at least one processor, cause the at least one processor to perform operations (TANG Pg. 6 Section 5.2 – “For our inference speed and parameter analysis, we use the open-source PyTorch implementations for BERT2 and ELMo (Gardner et al., 2017). On a single NVIDIA V100 GPU, we perform model inference with a batch size of 512 on all 67350 sentences of the SST-2 training set.”), the operations comprising: performing the steps of Claim 1; thus, it is rejected under the same rationale.

Claim 7 is similar to the method of Claim 2; thus, it is rejected under the same rationale.

Claim 8 is similar to the method of Claim 3; thus, it is rejected under the same rationale.

Claim 9 is similar to the method of Claim 4; thus, it is rejected under the same rationale.

Claim 10 is similar to the method of Claim 5; thus, it is rejected under the same rationale.

REGARDING CLAIM 11, TANG discloses a non-transitory computer readable storage medium, storing a computer program thereon, the program, when executed by a processor, causes the processor to perform operations (TANG Pg. 6 Section 5.2 – “For our inference speed and parameter analysis, we use the open-source PyTorch implementations for BERT2 and ELMo (Gardner et al., 2017). On a single NVIDIA V100 GPU, we perform model inference with a batch size of 512 on all 67350 sentences of the SST-2 training set.”), the operations comprising: performing the steps of Claim 1; thus, it is rejected under the same rationale.

Claim 12 is similar to the method of Claim 2; thus, it is rejected under the same rationale.

Claim 13 is similar to the method of Claim 3; thus, it is rejected under the same rationale.

Claim 14 is similar to the method of Claim 4; thus, it is rejected under the same rationale.

Claim 15 is similar to the method of Claim 5; thus, it is rejected under the same rationale.


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
LIU (Liu X, He P, Chen W, Gao J. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482. 2019 Apr 20.) also discloses a knowledge distillation algorithm or transferring the knowledge from a large model to a lighter model (Figs. 1-2 ) for natural language processing.

LAI (US 2021/0182662 A1) also discloses a method for knowledge distillation for natural language processing, wherein a large model (teacher)’s output is used to train a smaller model (student), and LAI further discloses masking training data.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN C KIM whose telephone number is (571)272-3327. The examiner can normally be reached Monday to Friday 8:00 AM thru 4:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JONATHAN C KIM/Primary Examiner, Art Unit 2655