Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103 is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.   
Invitation to Participate in DSMER Pilot Program
The present application satisfies the criteria for participation set forth in the Federal Register Notice entitled “Deferred Subject Matter Eligibility Response (DSMER) Pilot Program.” Therefore, the examiner invites applicant to participate in the DSMER pilot program. 
An applicant who accepts the invitation to participate in this pilot program must still file a reply to every Office action mailed in this application, but may defer presenting arguments or amendments in response to subject matter eligibility (SME) rejection(s) until the earlier of final disposition of the application, or the withdrawal or obviation of all other outstanding non-SME rejections. A final disposition for purposes of this pilot program occurs upon the earliest of: mailing of a notice of allowance; mailing of a final Office action; filing of a notice of appeal; filing of a request for continued examination; or abandonment of the application. Other than applicant’s ability to defer responding to SME rejections, participation in the DSMER pilot program does not alter the normal examination process (e.g., as outlined in MPEP 700), and applicant must still respond to all non-SME rejections when replying to Office actions. 
Further information about the pilot program, including an explanation of the criteria for receiving an invitation, and the conditions of participation, is provided in the Federal Register Notice announcing the program, which is available on the pilot program website https://www.uspto.gov/patents/initiatives/patent-application-initiatives/deferred-subject-matter-eligibility-response.
Applicant has two choices with respect to this invitation:
(1) Applicant may elect to participate in the DSMER pilot program. To effect this choice, applicant MUST accept this invitation by filing a completed request form PTO/SB/456 with a timely response to this Office action. The DSMER Pilot request form must be signed in accordance with 37 CFR § 1.33(b) by a person having authority to prosecute the application, and must be submitted via the USPTO’s patent electronic filing systems (EFS-Web or Patent Center). The form is available on the pilot program website https://www.uspto.gov/patents/initiatives/patent-application-initiatives/deferred-subject-matter-eligibility-response. If the form is properly completed and timely received, the application will be entered into the pilot program.
(2) Applicant may decline to participate in the pilot program. No action is required from applicant to effect this choice, because if applicant does not timely file a properly completed form PTO/SB/456, the application will not be entered into the pilot program.
Claim Rejections - 35 USC § 101
35 U.S.C. §101 reads as follows: 
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 USC 101 as directing toward non-statutory subject matter. 
Claim 1 recites a data processing system (i.e., a machine). Claim 8 recites a method (i.e., a process). Reply to Decision on Appeal of June 8, 2021Claim 15 recites a non-transitory computer-readable medium (i.e., manufacture).
To distinguish ineligible claims that merely recite a judicial exception from eligible claims that require an implementation of judicial exception, the Supreme Court uses a two-step framework: Step One (Step 2A), determine whether the claims at issue are directed to one of those patent-ineligible concepts; and Step Two (Step 2B), if so, ask “what else is there in the claims?” to determine whether the additional elements transform the nature of the claim into a patent eligible application. Alice Corp. Pty. Ltd. v. CLS Bank Int’l., 134 S. Ct. 2347, 2355 (2014).
Step One (Step 2A) is a two prong test that requires the determination of whether the claims at issue are directed to an enumerated patent ineligible concept. See MPEP 2106.04. 
Step 2A Prong (1) requires the determination of the specific limitations in the claim under examination (individually or in combination) that the examiner believes recites an abstract idea and determining whether the identified limitations falls within the subject matter groupings of abstract ideas enumerated. See MPEP 2106.04(a).
The enumerated patent ineligible concepts comprising:
(a) Mathematical Concepts – mathematical relationships, mathematical formulas or equations, mathematical calculations;
(b) Certain methods of organizing human activity – fundamental economic principles / practices (including hedging, insurance, mitigating risk); commercial or legal interactions (including agreements in the form of contracts; legal obligations; advertising, marketing or sales activities or behaviors; business relations); managing personal behavior or relationships or interactions between people (including social activities, teaching, and following rules / instructions) and 
(c) Mental processes – concepts performed in the human mind (including an observation, evaluation, judgment, opinion). See MPEP 2106.04(a).
If the claim recites an enumerated patent ineligible concept, then Prong (2) of Step One (Step 2A) requires the determination of whether the claim integrates the patent ineligible concept into a practical application. Individually and in combination, identifying whether there are any additional elements recited in the claim beyond the judicial exceptions and evaluating those additional elements to determine whether they integrate the exception into a practical application, using one or more of the considerations laid out by the Supreme Court and the Federal Circuit. See MPEP 2106.04(d).
Under Step 2B, if the claim does not integrate the ineligible concept into a practical application and therefore directed to a judicial exception, evaluate whether the claim provides an inventive concept by determining whether there are additional elements, individually and in ordered combination, amount to significantly more than the exception itself. See MPEP 2106.04.
Step 2A Prong (1)
The “directed to” inquiry does not ask whether the claims involve a patent ineligible concept but, considered in light of the specification, whether the claim as a whole is directed to excluded subject matter or directed to an improvement to computer functionality. Enfish L.L.C. v. Microsoft Corp., 822 F.3d 1327, 1335 (Fed. Cir. 2016).
Therefore, Prong (1) of Step 2A requires identifying specific limitations in the claims that recites (“describes” or “set forth”) an abstract idea and determine whether the identified limitations falls within the subject matter groupings of abstract ideas enumerated. See MPEP 2106.04 (“Thus, it is sufficient for this analysis for the examiner to identify that the claimed concept (the specific claim limitation(s) that the examiner believes may recite an exception) aligns with at least one judicial exception”).
Under Prong (1), Claim 8 recites a method of a method for training a text-to-content recommendation machine-learning (ML) model, the method comprising: 
(1) training a first ML model using a first training data set; 
utilizing the trained first ML model to infer information about the data contained in the first training data set; 
collecting the inferred information to generate a second training data set; and 
(2) utilizing the first training data set and the second training data set to train a second ML model, wherein the second ML model is a text-to-content recommendation ML model.
Claim 1 recites a data processing system comprising: a processor; and a memory in communication with the processor, the memory comprising executable instructions that, when executed by the processor, cause the data processing system to perform functions of claim 8. 
Claim 15 recites a non-transitory computer readable medium on which are stored instructions that, when executed, cause a programmable device to to implement the method of claim 8. 
With regard to claims 1, 8, and 15, individually and considered in light of the specification, US 2021/0264106 A1 at ¶38: “Once the teacher models 330 are trained and the advanced model 320 is finetuned, the output of the teacher models 330 and the advanced model 350 may be used along with the production data set 310 and the annotated data set 320 to train a light-weight student model 350. In one implementation, the student model 350 may be a shallow neural network model (e.g., having one or two layers). Thus, the student model is configured to distill knowledge from the deep neural network models and the pretrained NLP model to provide a shallow neural network student model at minimal loss of accuracy”.
Further, US 2021/0264106 A1 at ¶39: “The resulting student model may be able to provide similar or better results than more complex models because of the process of teacher-student training. This is because, in deep neural networks, the normal training objective is to maximize the average log probability of the correct answers. However, during the training process small probabilities are assigned to incorrect answers. Even when these probabilities are very small, some of them are much larger than others. The relative probabilities of incorrect answers contain important information about the differences among incorrect answers. Accordingly, a set of accurate teacher models may be trained to learn the small probabilities among incorrect labels of a training data set. This learned information may then be used to train a student model that learns from both the original labels and the soft labels generated by the teacher models to improve its accuracy without adding significant model parameters”.
Accordingly, the first machine learning model and the second machine learning model correspond to mathematical relationship models that learned the mathematical relationship of log probabilities and natural language labels (be it correct and incorrect). This is akin to the device profile of Digitech that employed mathematical algorithms to manipulate existing information to generate additional information, which is not patent eligible. DIGITECH IMAGE TECH’s v. Electronics for Imaging, 758 F.3d 1344, 1351 (Fed. Cir. 2014). 
In ordered combination, the steps of claims 1, 8, and 15 for training / generating the first machine learning model corresponds to a mathematical relationship model to learn average log probabilities among incorrect labels of a training data set to thereafter generate or train the second machine learning model that learns from both the original labels (maximized average log probabilities of the correct answer) and the soft labels generated by the teacher model (the small average log probabilities of the incorrect labels / answers). This correspond to a process that employs mathematical algorithms to manipulate existing information (trained first machine learning model / log probabilities model) to generate additional information (second machine learning model / log probabilities model). Thus, claims 1, 8, and 15 described patent ineligible subject matter enumerated under category (a) Mathematical Concepts – mathematical relationships, mathematical formulas or equations, mathematical calculations. 
Step 2A Prong (2). 
Under Prong (2) of Step 2A, the goal is to determine whether the claim is directed to the recited exception by evaluating whether the claim as a whole integrates the recited judicial exception into a practical application of the exception. See MPEP 2106.04II(A). 
In particular, evaluating integration into a practical application requires identifying whether there are any additional elements recited in the claim beyond the judicial exception and evaluating those additional elements, individually and in combination, to determine whether they integrate the exception into a practical application, using one or more of the considerations laid out by the Supreme Court and the Federal Circuit (“CAFC”). See MPEP 2106.04(d). 
The Supreme Court held that when a claim containing an abstract idea (e.g., mathematical formula) implements or applies that abstract idea (e.g., math formula) in a structure or process which, when considered as a whole, is performing a function which the patent laws were designed to protect (e. g., transforming or reducing an article to a different state or thing), then the claim satisfies the requirements of §101. Diamond v. Diehr, 450 U.S. 175, 192 (1981); MPEP 2106.04(d)I (“Implementing a judicial exception with, or using a judicial exception in conjunction with, a particular machine or manufacture that is integral to the claim, as discussed in MPEP 2106.05(b)”). See also Gottschalk v. Benson, 409 U.S. 63, 70 (1972) (“Transformation and reduction of an article "to a different state or thing" is the clue to the patentability of a process claim that does not include particular machines”).
In particular, the Supreme Court and the CAFC distinguished between computer-functionality improvements from the uses of existing computers as tools in aid of processes focused on abstract ideas. Electric Power Grp., L.L.C. v. Alstom SA, 830 F.3d 1350, 1354 (Fed. Cir. 2016) (“…we relied on the distinction made in Alice between, on one hand, computer-functionality improvement and, on the other, uses of existing computers as tools in aid of processes focused on “abstract ideas”…”).  
In one example, the CAFC applied Alice inquiry to ask whether the focus of the claims is on the specific asserted improvement in computer capabilities (i.e., the self-referential table for a computer database) or instead, on a process that qualifies as an abstract idea for which computers are invoked merely as a tool. Enfish L.L.C. v. Microsoft Corp., 822 F.3d 1327, 1335-36 (Fed. Cir. 2016).  
In Enfish, the claims were specifically directed to a self-referential table for a computer database. Id. at 1337. In particular, the claim language required a four step algorithm specifically directed to a self-referential table for a computer database that improves upon prior art information search and retrieval systems by employing a flexible, self-referential table to store data. Id. at 1336-37. CAFC determined that the plain focus of the claims was on an improvement to computer functionality itself (i.e., the self-referential table for a computer database), not on economic or other tasks for which a computer is used in its ordinary capacity. Id at 1335-36.
Therefore, the focus of the claims is on a specific asserted improvement in computer capabilities (i.e., the self-referential table for a computer database), not on economic or other tasks for which a computer is used in its ordinary capacity. Id. at 1336. See also MPEP 2106.04(d)I (“an improvement in the functioning of a computer or an improvement to other technology or technical field, as discussed in MPEP 2106.04(d)(1) and 2106.05(a)”).
In another example, in Diehr, the claims involved a method for curing rubber by using Arrhenius equation to constantly measure actual temperature inside a mold and feeding the temperature measurements into a computer to repeatedly recalculate the cure time to open the press. Diehr, 450 U.S. at 178-79. Since the Supreme Court viewed the claims not as an attempt to patent a mathematical formula, but to an industrial process for molding of rubber products, the claims were statutory. Id. at 192-93.
The key here, as noted by the CAFC, is that the Supreme Court in Diehr looked to how the claims "used that equation in a process designed to solve a technological problem in `conventional industry practice.'" McRO, Inc. v. Bandai Namco Games America, Inc., 837 F.3d 1299, 1312 (Fed. Cir. 2016).When looked at as a whole, "the claims in Diehr were patent eligible because they improved an existing technological process, not because they were implemented on a computer." Id. at 1312-13.
On the other hand, in a case where selecting information for collection, analysis, and display by content or source that did nothing significant to differentiate a process from ordinary mental processes. Electric Power Grp., 830 F.3d at 1355. There, claims specified what information in the power-grid field it is desirable to gather, analyze, and display in “real time” but they do not include any requirement for performing the claimed functions of gathering, analyzing, and displaying in real time by use of anything but entirely conventional, generic technology, the claims failed to state an inventive concept. Id. at 1356. 
Finally, the Supreme Court held that mere recitation of a generic computer cannot transform a patent-ineligible abstract idea into a patent-eligible invention. Alice, 134 S. Ct. at 2358. For example, in Alice, the Supreme Court held that data processing systems with data storage unit and transmission units were purely functional and generic and such recitation of hardware failed to offer any meaningful limitation beyond generally linking the use of a method to a particular technological environment. Id. at 2360. See MPEP 2106.04(d)I (“Generally linking the use of a judicial exception to a particular technological environment or field of use, as discussed in MPEP 2106.05(h)”).
Individually, claims 1, 8, and 15 required a data processing system, one or more processors, and computer readable non-transitory storage medium for performing the aforementioned machine learning models training processes (1) and (2) that employed mathematical algorithms to manipulate existing information (trained first machine learning model with maximized average log probabilities of correct natural language label and small log probabilities of incorrect natural language labels) to generate additional information (training the second machine learning model). 
The data processing system, the processor, and the computer readable non-transitory storage medium required by the claims are akin to the recitation of purely functional and generic hardware (i.e., data processing system and data storage unit) in Alice that failed to offer any meaningful limitation beyond generally linking the use of a method to a particular technological environment (i.e., text to content recommendation / natural language processing). 
As an ordered combination, the utilization of data processing system, processor, computer readable non-transitory storage media failed to integrate the machine learning models for text to content recommendation of claims 1, 8, and 15 into a practical application because the claims merely used computer components as tools to perform an abstract idea; i.e., to generate the second machine learning model for calculating log probabilities of natural language labels. 
Specifically, unlike the industrial process for curing rubber in Diehr, the claimed steps for generating machine learning models to calculate log probabilities of natural language labels are not a structure or process which, when considered as a whole, is performing a function which the patent laws were designed to protect. 
Further, the claims do not use computer components (processors and non-transitory computer readable storage medium) to solve a technological problem or to improve an existing technological process. Instead of being applied in a process designed to solve a technological problem like the conventional industry practice in Diehr or to a specific asserted improvement in computer capabilities in Enfish, the components were generic machinery invoked as tools to generate / train mathematical machine learning models to calculate log probabilities for natural language labels.
Finally, even if the collection of inferred information to generate a second training data set and thereafter to train the second machine learning model (which in turn calculates log probabilities for natural language labels) yield desirable information, the ordered combination amounts to collection and analysis of information similar to Electric Power Grp. 
Therefore, as an ordered combination, claims 1, 8, and 15 do not integral abstract mental processes into a practical application and the claims are instead directed toward patent ineligible mathematical relationships set forth in machine learning models.
Step 2B Inventive Concept.
The Guideline stated that if the additional elements do not integrate the exception into a practical application, then the claim is directed to the recited judicial exception, and requires further analysis under Step 2B where it may still be eligible if it amounts to an “inventive concept”. See MPEP 2106.04IIA and MPEP 2106.05.  
Further, an inventive concept can be found in the non-conventional and non-generic arrangement of known conventional pieces. BASCOM Global Internet Servs. v. AT&T Mobility, 827, F3d 1341, 1350 (Fed. Cir. 2016). 
In BASCOM, the CAFC held that filtering content is an abstract idea because it is a longstanding, well-known method of organizing human behavior similar to concepts previously found to be abstract. BASCOM, 827 F.3d at 1348. However, the CAFC determined that the claims did not merely recite filtering content along with the requirement to perform it on the internet or on a set of generic computer components, nor did the claims preempt all ways of filtering content on the internet. Id. at 1350.
Rather, the inventive concept described and claimed was the installation of a filtering tool at a specific location, remote from the end-users, with customizable filtering features specific to each end user that gives the filtering tool both the benefits of a filter on a local computer and the benefits of a filter on an internet service provider “ISP” server. Id. By taking a prior art filter solution (one size fits all filter at internet service provider “ISP” server) and making it more dynamic and efficient (providing individualized filtering at the ISP server), the claimed invention improves the performance of the computer system itself. Id. at 1351. 
On the other hand, implementation via computers does not offer a meaningful limitation beyond generally linking the use of an abstract idea to a particular technological environment. Alice, 134 S. Ct. at 2360 (“Nearly every computer will include a “communications controller” and “data storage unit” capable of performing the basic calculation, storage, and transmission functions required by the method claims”). Intellectual Ventures I L.L.C. v. Capital One Bank, 792 F.3d 1363, 1370-71 (Fed. Cir. 2015) (“Steps that do nothing more than spell out what it means to “apply it on a computer” cannot confer patent-eligibility). 
Similarly, limiting an abstract idea to one field of use do not convert otherwise ineligible concept into an inventive concept. Intellectual Ventures I L.L.C. v. Erie Indem. Co., 850 F.3d 1315, 1328 (Fed. Cir. 2017).  Neither does adding computer functionality to increase the speed or efficiency of the process confer patent eligibility on an otherwise abstract idea. Intellectual Ventures I, 792 F.3d at 1367 (citing Bancorp Servs., LLC v. Sun Life Insurance Co. of Can., 687 F.3d 1266, 1278 (Fed. Cir. 2012) (“The fact that the required calculations could be performed more efficiently via a computer does not materially alter the patent eligibility of the claimed subject matter”)).    
For example, in Intellectual Ventures I, the claims generally relates to customizing web page content as a function of navigation history and information known about the user via an interactive interface or a selectively tailored medium by which a web site user communicates with a web site information provider. Intellectual Ventures I., 792, F.3d at 1369. The CAFC held that the claim relates to an abstract concept of customizing information based on information known about the user and navigation history. Id. 
Further, the claim provided no inventive concept to support patent eligibility because the interactive interface simply describes a generic web server with attendant software, tasked with providing web pages to and communicating with the user’s computer. Id. at 1370. Such required use of a software brain tasked with tailoring information and providing it to the user provides no additional limitation beyond applying an abstract idea, restricted to the internet, on a generic computer. Id. at 1371.
In the instant application, the individual recitation of data processing system, processor, computer readable non-transitory storage medium in claims 1, 8, and 15 merely invoke generic machinery rather than focus on any particular technological device. Rather, the method and the data processing system for generating machine learning models to calculate log probabilities for natural language labels did not offer a meaningful limitation beyond generally linking the use of an abstract idea (log probabilities / mathematical relationships between correct and incorrect natural language labels) to a conventional computer environment.       
As an ordered combination, unlike BASCOM that described an unconventional combination to provide both the benefits of a filter on a conventional local computer and the benefits of a filter on the conventional ISP server, the utilization of generic processors and computer readable non-transitory storage medium merely perform its conventional established function to use a computer as a tool to generate mathematical relationship / machine learning models.
Therefore, Claims 1, 8, and 15 do not supply an inventive concept.
Dependent claims failed to integrate the abstract idea into a practical application or provide an inventive concept. 
In particular, claims 2-7, 9-14, and 16-20 correspond to training a plurality of first machine learning / log probabilities models, the first machine learning model being NLP and deep neural model, and what data are collected to train / generate the log probabilities mathematical relationships of the first machine learning model and the second machine learning model.
Thus, dependent claims recited no particular means or method of applying the generated machine learning models / log probabilities mathematical relationship models to improve a specific asserted technology in the field of natural language processing. Likewise, the dependent claims failed to provide any inventive concept beyond employing mathematical algorithms to manipulate existing information to generate additional information. 
For the above reasons, Claims 1-20 are patent ineligible.
Claim Rejections - 35 USC § 103
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 103 that form the basis for the rejections under this section made in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2, 4-5, 7-10, 12-13, 15, 17-18, and 20 are rejected under 35 USC 103(a) as being unpatentable over Liu et al. (US 2021/0142164 A1) in view of Bojja et al (US 2017/0185581 A1).
Regarding Claims 1, 8, and 15, Liu discloses a data processing system (Fig. 1) comprising: 
a processor (¶24, processor 110); and 
a memory in communication with the processor, the memory comprising executable instructions that, when executed by the processor (¶25, memory 120 stores software executed by computing device 100), cause the data processing system to perform functions of: 
training a first machine-learning (ML) model using a first training data set (Fig. 2, ¶28 and ¶39, teacher module 130 implements a pre-trained neural network model used for natural language processing tasks such as natural language inference, sentiment classification and etc.; ¶34 and ¶40, pre-training the neural network model on large-scale unlabeled corpus and then fine-tuned with in-domain labeled data for a supervised downstream task; ¶42, dataset for training the teacher model: GLUE, SST-2, QQP and MRPC, STS-B, MNLI, QNLI, and RTE); 
utilizing the trained first ML model to infer information about the data contained in the first training data set (¶¶64-66, considering one text classification problem, teacher model 200 first predicts teacher logits zT or raw prediction vectors / log probability values); 
collecting the inferred information to generate a second training data set (¶¶59-60, dividing training dataset into smaller batches b; ¶62, for multi-task distillation, training of student model 400 is performed over epochs where for each epoch, all of the batch datasets Dt are merged into one dataset: combined dataset D; see Fig. 6, for bt in D do 3. Predict logits zT from teacher model); and 
utilizing the first training data set (Fig. 6 and see ¶58, training of student model is performed over a number of epochs (i.e., when an entire dataset is passed both forward and backward through the neural network model only once) and the second training data set to train a second ML model (¶66, during training of student model, for each batch bt in D, teacher model 200 first predicts teacher logits zT and student model 400 updates its bottom shared layer and the upper task specific layers according to the teacher logits),
Liu does not disclose wherein the second ML model is a text-to-content recommendation ML model.
Bojja teaches it is well known to use machine learning models to implement a text-to-content recommendation machine learning model (Abstract, ¶5, ¶¶46-47, ¶104, using trained neural network to predict likely next word / phrase that will be entered by the user and suggest one or more emoji that relate to the predicted next word or phrase).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement the machine learning models of Liu for natural language processing tasks such as natural language inference and sentiment classification (Liu ¶32) as a content recommendation machine learning model to recommend or map words in a communication to corresponding emoji (Bojja ¶55, NLP module 312 analyzes content for sentiment and identify emoji that match the content).
Further regarding Claim 15, Liu further discloses a non-transitory computer readable medium on which are stored instructions that, when executed, cause a programmable device to perform the method of claim 8 and the data processing system of claim 1 (¶25, memory 120 stores software executed by computing device 100).
Regarding claims 2, 4-5, 9-10, and 17-18, Liu discloses training a plurality of first ML models wherein the plurality of first ML models include a pretrained natural language processing (NLP) model (¶34 and ¶57, first, use pre-trained BERT model to initialize parameters of teacher model 200’s shared layers) and a deep neural network (DNN) model (¶57 and Fig. 2, MT-DNN model is used to initialize the teacher model 200; ¶75, MT-DNN model is used as teacher model 200).
Regarding Claims 7, 12, and 20, Liu discloses wherein the second training data set includes soft labeled data (¶66, during training of the student model, for each batch bt, the teacher model 200 first predicts teacher logits zT1).
Regarding claim 13, Liu discloses determining a weighted sum of data in the first training data set and data in the second training data set; and utilizing the weighted sum to train the second ML model (¶¶49-51, during training, student model receives text input sequences and generate vector matrix for each sequence; ¶53, integrate sequence vectors and context vectors obtained from bi-attention mechanism 408 of ¶52; ¶54, apply pooling on the integrated outputs as weighted summations of each sequence set forth in equations (8)-(9)).
Claims 3 and 16 are rejected under 35 USC 103(a) as being unpatentable over Liu et al. (US 2021/0142164 A1) in view of Bojja et al (US 2017/0185581 A1) as applied to claims 1 and 15, in further view of Hedge et al. (US 2020/0387782 A1).
Liu teaches knowledge distillation for language model to train a student model / second machine learning model that is smaller compared to the teacher / first machine learning model (¶28).
Liu does not suggest that the student model / second machine learning model is a shallow neural network model.
Hedge teaches that Knowledge distillation framework corresponds to the transfer of relevant information from a complex deep neural network called teacher networks to a simpler shallow network called student network (¶44).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement the smaller student model / second machine learning model as a shallow neural network as part of the knowledge distillation framework to transfer relevant information from the complex DNN teacher neural network to the simpler shallow student neural network (Hedge, ¶44).
Claims 6, 11, 14, and 19 are rejected under 35 USC 103(a) as being unpatentable over Liu et al. (US 2021/0142164 A1) in view of Bojja et al (US 2017/0185581 A1) as applied to claims 1, 8, and 15, in further view of Lai et al. (US 2021/0182662 A1).
Regarding Claim 14, Liu does not teach receiving hyper parameters for each of the first and the second ML models.
Lai teaches using a pre-trained neural network model to train a student neural network model (Abstract) by receiving hyper parameters for each of the pre-trained neural network model (teacher model) and the student neural network models (¶25, loss function generation module of model training system generates a loss function L3 based on T, a temperature hyperparameter of the teacher and the student models and the function H, which is a cross entropy function).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to receive hyper parameters for each of the first and the second ML models as part of the distillation objective to minimize mean squared error between student network logits and teacher’s logits (Liu, ¶67), especially when computing cross entropy loss for text classification tasks (Liu, ¶46; Lai, ¶¶24-26, using cross entropy function, student model and teacher model logits, and temperature hyperparameter of the teacher and student models to generate loss function).
Regarding Claims 6, 11, and 19, Liu discloses wherein: the plurality of first ML models include a DNN model (¶57 and ¶75, MT-DNN model is used as teacher model 200), the pretrained advanced NLP model is trained using labeled full sentence data in the first training data set (¶43, teacher model 200 performs training NLU tasks such as single sentence classification (e.g., CoLA, SST-2); note Corpus of Linguistic Acceptability or CoLA comprises examples drawn from books and journal articles on linguistic theory where each example is a sequence of words annotated with whether it is a grammatical English sentence2; see also STS-B, MRPC, etc.).
 Liu does not teach the DNN model is trained using labeled unordered words3 in the first training data set.
Lai teaches training data set for training teacher models using labeled unordered words such as sentences with masked tokens (¶16).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to train the DNN / teacher model using labeled unordered words in the first training data set in order to effectively train the student model to serve its purpose (Lai, ¶15; i.e., by training the teacher model to predict words corresponding to masked tokens per ¶16). 
Conclusion
Prior art made of record and not relied upon is considered pertinent to applicant's disclosure: 
US 2018/0158552 A1 teaches knowledge distillation in healthcare domain for using a complex / deep neural network or ensemble of network models as a teacher model to train a student / mimic model (such as a shallow neural network or a single network model).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to examiner Richard Z. Zhu whose telephone number is 571-270-1587 or examiner’s supervisor King Poon whose telephone number is 571-272-7440. Examiner Richard Zhu can normally be reached on M-Th, 0730:1700.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RICHARD Z ZHU/Primary Examiner, Art Unit 2675                                                                                                                                                                                                        06/27/2022


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 See Abbasi et al., “Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation”, V. Knowledge Distillation: “In the base TS models, soft-labels(also known as logits) are considered as distilled knowledge”. 
        2 https://www.tensorflow.org/datasets/catalog/glue 
        3 Interpreted in view of the specification “This may be a large-scale data set that includes unordered queries containing masked words”, US 2021/0264106 A1 at ¶46.