DETAILED ACTION

	Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 03/08/2022 has been entered.
 
Response to Arguments
Applicant’s amendments and remarks filed 03/08/2022 have been fully considered by the examiner. 
Regarding applicant’s arguments directed to claims rejected under USC § 112(a), the rejection made in the previous office action has been withdrawn in light of amended limitations.

Applicant’s arguments regarding the rejection of claims under USC 35 103 are directed to subject matter in the amended claim not previously examined by the examiner. Therefore, applicants arguments are rendered moot. The examiner refers to the rejection under 35 U.S.C. 103 in the current office action for more details.

Claim Rejections - 35 USC § 112-Indefiniteness
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 8-9 and 21-22 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding claim 8, the claim recites the limitation “model has a size…; wherein the first size is larger than the second size” that renders the claim indefinite. Specifically, with regard to machine learning models conceptualizing a size is dependent on either the model type or memory storage associated with the model or some combination of both; in is unclear what standard is used to determine the “size” of model and how to compare when one model is larger than the other without further information. The specification, refers to a process for selecting training data for training a model and does help provide addition information such that one of ordinary skill in the art would ascertain the intended scope of the claim limitation or the context in which the model size is ascertained for making the claimed determination. The specification does not provide  some standard for measuring model size and determining when there is a larger model than another; one of ordinary skill in the art could not ascertain the scope of the claim (e.g., a standard that is recognized in the art for measuring claimed larger size of two models), thus, the claim limitation is consider indefinite. See MPEP 2173.05(b) (I): 
“Terms of degree are not necessarily indefinite. "Claim language employing terms of degree has long been found definite where it provided enough certainty to one of skill in the art when read in the context of the invention." Interval Licensing LLC v. AOL, Inc., 766 F.3d 1364, 1370, 112 USPQ2d 1188, 1192-93 (Fed. Cir. 2014) (citing Eibel Process Co. v. Minnesota & Ontario Paper Co., 261 U.S. 45, 65-66 (1923) (finding ‘substantial pitch’ sufficiently definite because one skilled in the art ‘had no difficulty … in determining what was the substantial pitch needed’ to practice the invention)). Thus, when a term of degree is used in the claim, the examiner should determine whether the specification provides some standard for measuring that degree. Hearing Components, Inc. v. Shure Inc., 600 F.3d 1357, 1367, 94 USPQ2d 1385, 1391 (Fed. Cir. 2010); Enzo Biochem, Inc., v. Applera Corp., 599 F.3d 1325, 1332, 94 USPQ2d 1321, 1326 (Fed. Cir. 2010); Seattle Box Co., Inc. v. Indus. Crating & Packing, Inc., 731 F.2d 818, 826, 221 USPQ 568, 574 (Fed. Cir. 1984). If the specification does not provide some standard for measuring that degree, a determination must be made as to whether one of ordinary skill in the art could nevertheless ascertain the scope of the claim (e.g., a standard that is recognized in the art for measuring the meaning of the term of degree). For example, in Ex parte Oetiker, 23 USPQ2d 1641 (Bd. Pat. App. & Inter. 1992), the phrases "relatively shallow," "of the order of," "the order of about 5mm," and "substantial portion" were held to be indefinite because the specification lacked some standard for measuring the degrees intended.
During prosecution, an applicant may also overcome an indefiniteness rejection by providing evidence that the meaning of the term of degree can be ascertained by one of ordinary skill in the art when reading the disclosure….
Even if the specification uses the same term of degree as in the claim, a rejection is proper if the scope of the term is not understood when read in light of the specification. While, as a general proposition, broadening modifiers are standard tools in claim drafting in order to avoid reliance on the doctrine of equivalents in infringement actions, when the scope of the claim is unclear a rejection under 35 U.S.C. 112(b)  or pre-AIA  35 U.S.C. 112, second paragraph, is proper. See In re Wiggins, 488 F. 2d 538, 541, 179 USPQ 421, 423 (CCPA 1973).

When relative terms are used in claims wherein the improvement over the prior art rests entirely upon size or weight of an element in a combination of elements, the adequacy of the disclosure of a standard is of greater criticality.” (emphasis added)

	Claim 9 recites the use of a model size to determine the same size of another model. Specifically, with regard to machine learning models conceptualizing a size is dependent on either the model type or memory storage associated with the model or some combination of both; in is unclear what standard is used to determine the “size” of model and how to compare when one model is the same size as  another without further information. The specification, refers to a process for selecting training data for training a model and does help provide addition information such that one of ordinary skill in the art would ascertain the intended scope of the claim limitation or the context in which the model size is ascertained for making the claimed determination. The specification does not provide  some standard for measuring model size and determining when there is the same size as another model; one of ordinary skill in the art could not ascertain the scope of the claim (e.g., a standard that is recognized in the art for measuring claimed same size of two models), thus, the claim limitation is consider indefinite. See MPEP 2173.05(b) (I).
	Claims 21-22 recite similar elements to those recited in claims 8-9 respectively, and are thus rejected under the same rationale.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2, 4, 9-15, 17, and 22-26 are rejected under 35 U.S.C. 103 as being unpatentable over US Patent No.  8,122,026 to Laroco, Jr. et al. (hereinafter “Lar”), in view of US Patent No. 10,885,900 to Li et al. (hereinafter “Li”) and in further view of Rao et al. (US 2019/0188584, hereinafter ‘Rao’).
	
As per claim 1, Lar teaches A method comprising: generating, by data processing hardware, a base model by training with a first dataset of data pairs (in Abstract: A system and method for disambiguating references to enti­ties in a document. In one embodiment, an iterative process is used to disambiguate references to entities in documents. An initial model is used to identify documents referring to an entity based on features contained in those documents [claimed first set of data pairs as the set used for identification, including an entity and set of documents features]. The occurrence of various features in these documents is mea­sured…; And claimed processor, in 2:55-61: FIG. 1 shows components used to manage facts in a fact repository 115. Data processing system 106 includes one or more importers 108, one or more janitors 110, a build engine 112, a service engine 114, and a fact repository 115 (also called simply a "repository"). Each of the foregoing are implemented, in one embodiment, as software modules ( or programs) executed by processor 116. Importers 108 operate to process documents received from the document hosts, read the data content of documents, and extract facts (as operationally and programmatically defined within the data processing system 106) from such documents….13:51-67: The present invention also relates to an apparatus for per­forming the operations herein. This apparatus may be spe­cially constructed for the required purposes, or it may com­prise a general-purpose computer selectively activated … Furthermore, the computers referred to in the specification may include a single processor or may be architectures 65 employing multiple processor designs for increased comput­ing capability.);
 generating, by the data processing hardware, an adapted model by training the base model on a second dataset of data pairs; (in Abstract: …An initial model is used to identify documents [claimed second set of data pairs, including an entity and set of identified of whole set of documents features] referring to an entity based on features contained in those documents … From the number occurrences of features in these documents, a second model is constructed. The second model is used to identify documents referring to the entity based on features contained in the documents; and as depicted in Fig 4; And in 9:9-24: For example, a model can specify the probability that a document refers to a particular entity given a set of features in the document. The first model is used to identify a first set of documents likely to refer to the entity [claimed second dataset]. The disambiguation engine 310 determines 404 a subse­quent model based on the features of the documents that are identified 402 as referring to the entity; And claimed hardware in 2:55-61 & 13:51-67 as noted above)
for each respective data pair of a third dataset of data pairs: determining, by the data processing hardware, a divergence between the base model and the adapted model and determining, by the data processing hardware, a respective contrastive score for the respective data pair using the determined divergence, the respective contrastive score indicative of a probability of quality of the respective data pair; (in 9:52-66: The number of iterations through the loop illustrated in FIG. 4 can depend on a variety of factors. For example, the disambiguation engine 310 can iterate a predetermined num­ber of times. As other examples, the disambiguation engine 55 310 can iterate until the model has converged on a stable condition [claimed determined divergence of the base model as the first model iteration and the model as the subsequent model iteration at a first iteration], a predetermined resource budget has been con­sumed, the improvement provided by an iteration falls below some improvement threshold [claimed determined divergence over each model iteration from a base model to an updated model as the claimed adapted model], an iteration does not introduces a number of new features that falls below some threshold, the Kullback-Leibler divergence [Claimed contractive score indicative of a probability of quality of the respective data pair] between the probability distri­bution over entity assignments  to documents in subsequent iterations falls below some threshold, and so on...; And in 9:9-24: For example, a model can specify the probability [Alternately, claimed contractive score indicative of a probability of quality of the respective data pair]; Examiner notes that the contractive score is any measure that provides a predictive results associated with the modeled data] that a document refers to a particular entity given a set of features in the document. The first model is used to identify a first set of documents likely to refer to the entity [claimed second dataset]. The disambiguation engine 310 determines 404 a subse­quent model based on the features of the documents that are identified 402 as referring to the entity. For example, analysis of the documents identified 402 as referring to the entity can provide insights into the characteristics of documents that 15 refer to the entity. These insights can be codified into a sub­sequent model. Further discussion of methods for determin­ing 404 a subsequent model is included below. The disambiguation engine 310 identifies 406 documents referring to the entity using the determined 404 model. The 20 determined 404 model is used to identify 406 a second set of documents likely to refer to the entity [third set of data]. In one embodiment, the disambiguation engine 310 returns to determining 404 a sub­sequent model based on the identified 406 documents…); 
and training, by the data processing hardware, using the … data pairs of the third dataset and the respective contrastive scores, a target model based … data pairs of the third dataset. (in claim 1: responsive to determining that the second set of features are associated with the entity, identifying a third set of documents based on a third model and the second set of features, the third set of documents each comprising a sufficient number of features in common with the second set of features to identify a document referring to the entity according to the third model [claimed training, by the data processing hardware, using the … data pairs of the third dataset and the respective contrastive scores, a target model based … data pairs of the third dataset]; And the use of claimed scores in determining subsequent iterations, in 9:52-66: The number of iterations through the loop illustrated in FIG. 4 can depend on a variety of factors. For example, the disambiguation engine 310 can iterate a predetermined num­ber of times. As other examples, the disambiguation engine 55 310 can iterate until the model has converged on a stable condition [claimed determined divergence of the subsequent model iteration], a predetermined resource budget has been con­sumed, the improvement provided by an iteration falls below some improvement threshold, an iteration does not introduces a number of new features that falls below some threshold, the Kullback-Leibler divergence between the probability distri­bution over entity assignments  to documents in subsequent iterations falls below some threshold, and so on...; And in 9:9-24: For example, a model can specify the probability that a document refers to a particular entity given a set of features in the document. The first model is used to identify a first set of documents likely to refer to the entity [claimed second dataset]. The disambiguation engine 310 determines 404 a subse­quent model based on the features of the documents that are identified 402 as referring to the entity. For example, analysis of the documents identified 402 as referring to the entity can provide insights into the characteristics of documents that 15 refer to the entity. These insights can be codified into a sub­sequent model. Further discussion of methods for determin­ing 404 a subsequent model is included below. The disambiguation engine 310 identifies 406 documents referring to the entity using the determined 404 model. The 20 determined 404 model is used to identify 406 a second set of documents likely to refer to the entity [third set of data]. In one embodiment, the disambiguation engine 310 returns to determining 404 a sub­sequent model based on the identified 406 documents…)
	While Lar teaches the use of Kullback-Leibler divergence to indicate a probability of quality in data using sequential models. 
Lar does not expressly teach the Kullback-Leibler divergence as a score.
Li expressly teaches the Kullback-Leibler divergence as a score. (in 7:27-41: The posterior distribution of the teacher model 150 is referred to as PT (s|xT)  and the posterior distribution of the student model 160 is referred to as Ps (s|xs) where Xt and Xs represent the parallel inputs from 30 different domains to the teacher model 150 and student model 160 respectively and s represents the senones (or phonemes) that have been analyzed. Using the above defi­nitions, a divergence score of the Kullback-Leibler diver­gence [claimed contractive score as Kullback-Leibler divergence score] between the two speech recognition models deter­ 35 mined by the output comparator 170 may be calculated according to FORMULA 1,...)
The Li and Lar are references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing a natural language information processing systems and methods using sequential modeling algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior arts for training sequential models using Kullback-Leibler divergence as a score to indicate the probability distribution associated with the data set pairs as disclosed by Li with the method of for generating sequential models using probability distributions and Kullback-Leibler divergence as by Lar.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods of Li and Lar in order use the divergence score in optimization of sequential models over different training epochs (Li, 7:41-52); Doing so will provide a supervisory training signal for training the sequential models with the need to label the respective training datasets (Li, 7:55-64).

	While Lar and Li teaches the process for iterative updating models to compute a final model based on using a plurality of datasets as disclosed above. Lar and Li do not expressly teach the use of sorted data sets as claimed in the limitation(s):
	  sorting, by the data processing hardware, the data pairs of the third dataset based on the respective contrastive scores; and training, by the data processing hardware, using the sorted data pairs of the third dataset and the respective contrastive scores, a target model based on an order of the sorted data pairs of the third dataset.
	Rao teaches the use of sorted data sets as claimed in the limitation(s):
	  sorting, by the data processing hardware, the data pairs of the third dataset based on the respective contrastive scores; and training, by the data processing hardware, using the sorted data pairs of the third dataset and the respective contrastive scores, a target model based on an order of the sorted data pairs of the third dataset. (iterative training and evaluating a predictive score of the previous model with an adaptive current model as claimed contractive score, for selecting a third data set for training an output model as claimed target model, Rao teaches as depicted in Fig. 1D; wherein the selected variables in a subsequent data set (e.g. including second and third datasets as subsequent selected dataset) depicted in Fig. 1D are selected using a sorting process, in 0108: 120 FIG. 2D shows the 100) example cross-correlation results in an overall correlation coefficients table. In the coefficient table of FIG. 2D, highly correlated pairs (x-y cells) of inputs (measurements and derived values of the dataset) with values greater than a pre-defined thresh­old [claimed based on the respective contrastive scores; Examiner notes capturing predictive measurements and dataset correlation values as claimed respective contractive scores] are identified (in cross pattern) and put into a high­correlation tag group [claimed  sorting, by the data processing hardware, the data pairs of the third dataset based on the respective contrastive scores]. FIGS. 2E-2F illustrates more details of example correlation function curves over a pre-defined time window (240 min or 4 hours). An application in the plant then performs preliminary feature selection (step 125 of method 100) and secondary feature selection (step 130 of method on the dataset to reduce the dataset to only 31 selected 100) inputs (of process variables' measurements and derived feature variables' values) to train/test the failure model [claimed training, by the data processing hardware, using the sorted data pairs of the third dataset and the respective contrastive scores, a target model based on an order of the sorted data pairs of the third dataset] for a C2 Splitter with Hydrate Formation problem. Further, the plant application builds and executes a PLS model using only the 31 inputs selected from a total of over 1000 variables as the failure model inputs and a historical process failure (alert) event as the failure model output.)
	The Rao, Li and Lar are references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing a information processing systems and methods using feature selections and sequential modeling algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior arts for training sequential models using feature selection techniques for selecting and sorting data-subset for subsequent model training as disclosed by Rao with the method of for generating sequential models using probability distributions and Kullback-Leibler divergence collectively disclosed by Li and Lar.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods of Rao, Li, and Lar in order use cross-correlation for feature selection for building reduce data sets for training predictive models (Rao, Abstract); Doing so will provide a method for removing bad quality data  using a multi -step feature selection process for building an training predictive models (Rao, Abstract).

As per claim 14, Lar teaches a system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising… (Claimed processing hardware and memory, in 2:55-61: FIG. 1 shows components used to manage facts in a fact repository 115. Data processing system 106 includes one or more importers 108, one or more janitors 110, a build engine 112, a service engine 114, and a fact repository 115 (also called simply a "repository"). Each of the foregoing are implemented, in one embodiment, as software modules ( or programs) executed by processor 116. Importers 108 operate to process documents received from the document hosts, read the data content of documents, and extract facts (as operationally and programmatically defined within the data processing system 106) from such documents….13:51-67: The present invention also relates to an apparatus for per­forming the operations herein. This apparatus may be spe­cially constructed for the required purposes, or it may com­prise a general-purpose computer selectively activated reconfigured by a computer program stored in the computer. 55 Such a computer program may be stored in a computer read­able storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, mag­netic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic 60 or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures 65 employing multiple processor designs for increased comput­ing capability.)
	The remaining claim limitations of claim 14 are analogous to the claim limitations of claim 1 and are therefore rejected with the same rationale applied against claim 1. 
	
As per claim 2, the combination of Lar in combination with Li and Rao teaches the method of claim 1 and the rejection of claim 1 is incorporated. Lar further teaches herein training the target model further comprises using the data pairs of the third dataset that satisfy a threshold contrastive score. (in claim 1: responsive to determining that the second set of features are associated with the entity, identifying a third set of documents based on a third model and the second set of features, the third set of documents each comprising a sufficient number of features in common with the second set of features to identify a document referring to the entity according to the third model [claimed a target model using the data pairs of the third dataset and the respective contrastive scores]; And the use of claimed scores in determining subsequent iterations, in 9:52-66: The number of iterations through the loop illustrated in FIG. 4 can depend on a variety of factors. For example, the disambiguation engine 310 can iterate a predetermined num­ber of times. As other examples, the disambiguation engine 55 310 can iterate until the model has converged on a stable condition [claimed determined divergence of the subsequent model iteration], a predetermined resource budget has been con­sumed, the improvement provided by an iteration falls below some improvement threshold, an iteration does not introduces a number of new features that falls below some threshold, the Kullback-Leibler divergence between the probability distri­bution over entity assignments  to documents in subsequent iterations falls below some threshold [claimed using data pairs of the third dataset satisfying a threshold contrastive score], and so on...)
	As per claim 15, the claim is a system claim analogous to claim 2 and is therefore rejected with the same rationale applied against claim 2.
	
As per claim 4, the combination of  Lar in combination with Li and Rao teach the method of claim 1 and the rejection of claim 1 is incorporated. Lar further teaches wherein training the target model further comprises: generating a plurality of data batches, wherein each data batch comprises at least one data pair and wherein a probability that a select data pair is included in a select data batch is based on the respective contrastive score of the select data pair and wherein the probability increases as the respective contrastive score increases; and training the target model using each data batch  increasing the probability increasing selection of pair included for training, in 8:54-59: In one embodiment, if the probability 312 is above some threshold, the document 302A is associated with the entity [claimed generating a plurality of data batches, wherein each data batch comprises at least one data pair and wherein a probability that a select data pair is included in a select data batch is based on the respective contrastive score of the select data pair and wherein the probability increases as the respective contrastive score increases]. For example, a link to the document 302A can be stored as a fact with an object ID indicating that the fact is associated with the entity. Advantageously, a fact is available in the fact repository citing a document 302A that refers to the entity; And in 9:20-63: … The 20 determined 404 model is used to identify 406 a second set of documents likely to refer to the entity. In one embodiment, the disambiguation engine 310 returns to determining 404 a sub­sequent model based on the identified 406 documents. The disambiguation engine 310 iterates through the loop illustrated in the method, advantageously converging towards increasingly accurate models [claimed wherein he respective contrastive score of the select data pair and wherein the probability increases as the respective contrastive score increases]… In one embodiment, the disambiguation engine 310 returns a list of documents with a probability of referring to the entity above some thresh­old… As other examples, the disambiguation engine 55 310 can iterate until the model has converged on a stable condition, a predetermined resource budget has been con­sumed, the improvement provided  [claimed wherein he respective contrastive score of the select data pair and wherein the probability increases as the respective contrastive score increases] by an iteration falls below some improvement threshold, an iteration does not introduces a number of new features that falls below some threshold, the Kullback-Leibler divergence between the probability distri­bution over entity assignments to documents in subsequent iterations falls below some threshold, and so on. Examiner notes: each epoch/training iteration is interpreted as the plurality of data batches; and adding higher probability pairs increases the mean of a probability distribution to not fall below some improvement threshold)  
	As per claim 17, the claim is a system claim analogous to claim 4 and is therefore rejected with the same rationale applied against claim 4.

As per claim 9, the combination of Lar, Li, and Rao teach the method of claim 1 and the rejection of claim 1 is incorporated. Lar further teaches: determining, by the data processing hardware, that a first size corresponding to the target model and a second size corresponding to the base model are the same; and when the first size is the same as the second size: (claimed size as the determined model including 3 parameters including a set of rules, model input features, and probability output, in 9:3-17: The disambiguation engine 310 identifies 402 documents referring to an entity using a first model. A model is a set of rules specifying at least one combination of features sufficient for identifying a document referring to a particular entity. For example, a model can specify the probability that a document refers to a particular entity given a set of features in the document. The first model is used to identify a first set of documents likely to refer to the entity.  The disambiguation engine 310 determines 404 a subse­quent model based on the features of the documents that are identified 402 as referring to the entity. For example, analysis of the documents identified 402 as referring to the entity can provide insights into the characteristics of documents that 15 refer to the entity. These insights can be codified into a sub­sequent model [claimed a first size comprising a first number of parameters of the target model and a second size comprising a second number of parameters of the base model are the same; and when the first size is the same as the second size to produce subsequent model coded with the same parameter size]…; Examiner notes that model parameters are broadly elements used to determine and processing information received by a model)
 replacing, by the data processing hardware, the base model with the adapted model; replacing, by the data processing hardware, the adapted model with the target model; (claimed replacement as the iterative process depicted in Fig. 4 where the next iteration replaces base with adapted and adaptive with target to process the distribution from the previous iteration; And use of claimed model in subsequent iterations, in 9:52-66: The number of iterations through the loop illustrated in FIG. 4 [including claimed model replacement in a subsequent iteration] can depend on a variety of factors. For example, the disambiguation engine 310 can iterate a predetermined num­ber of times. As other examples, the disambiguation engine 55 310 can iterate until the model has converged on a stable condition, a predetermined resource budget has been con­sumed, the improvement provided by an iteration falls below some improvement threshold, an iteration does not introduces a number of new features that falls below some threshold, the Kullback-Leibler divergence between the probability distri­bution over entity assignments  to documents in subsequent iterations falls below some threshold, and so on… ; And in 10:23-35:In one embodiment, determining 404 a subsequent model includes analyzing the available documents and the first set of documents [claimed replacing, by the data processing hardware, the base model with the adapted model; replacing, by the data processing hardware, the adapted model with the target model]. The disambiguation engine 310 counts the number of occurrences of features in the available documents, the number of occurrences of features in the first set of documents, the total number of available documents, and the number of documents in the first set…; using the target model as the adaptive mode used to produce a new model from the pervious iteration, in 12: 1-23: The disambiguation engine 310 identifies 406 documents referring to the entity 502 using the determined 404 model. In the new model, the feature "Aug. 29, 1958" is given more weight than the feature "USA". Using the new model, docu­ment 508 is identified 406 as referring to the entity 502, because the document 508 contains several features consid­ered highly indicative of references to the entity 502. On the other hand, using the new model, document 512 is not iden­tified 406 as referring to the entity 502, because the document 508 contains only a few features, and those features are not considered highly indicative of references to the entity 502. Advantageously, documents not referring to the entity 502 are not included in the set of documents referring to the entity, despite being included in the set of documents identified using the first model. Furthermore, additional documents (not shown) that were not identified 402 as referring to the entity 502 using the first model can be identified 406 as referring to the entity 406 using the subsequent model. Such documents may contain features not considered indicative of reference to the entity under the first ( or previous) model, but that are considered indicative of reference to the entity under the subsequent model [claimed replacing, by the data processing hardware, the base model with the adapted model; replacing, by the data processing hardware, the adapted model with the target model].)3 40571074.1Application No. 16/376,254Docket No.: 231441-442837 Reply to Office Action of June 7, 2021
determining, by the data processing hardware, the respective contrastive score for each data pair of a fourth dataset of data pairs using the base model and the replaced adapted model;  and training, by the data processing hardware, a subsequent target model using the data pairs of the fourth dataset and the respective contrastive scores. (for the subsequent iteration, in claim 1: responsive to determining that the second set of features are associated with the entity, identifying a third set of documents based on a third model and the second set of features, the third set of documents each comprising a sufficient number of features in common with the second set of features to identify a document referring to the entity according to the third model [claimed a subsequent target model using the data pairs of the fourth dataset and the respective contrastive scores]; And the use of claimed scores in determining subsequent iterations, in 9:52-66: The number of iterations through the loop illustrated in FIG. 4 can depend on a variety of factors. For example, the disambiguation engine 310 can iterate a predetermined num­ber of times. As other examples, the disambiguation engine 55 310 can iterate until the model has converged on a stable condition, a predetermined resource budget has been con­sumed, the improvement provided by an iteration falls below some improvement threshold, an iteration does not introduces a number of new features that falls below some threshold, the Kullback-Leibler divergence [claimed the respective contrastive score for each data pair of a fourth dataset of data pairs using the base model and the replaced adapted model] between the probability distri­bution over entity assignments  to documents in subsequent iterations falls below some threshold, and so on...; And in 9:9-24: … The disambiguation engine 310 determines 404 a subse­quent model based on the features of the documents that are identified 402 as referring to the entity. For example, analysis of the documents identified 402 as referring to the entity can provide insights into the characteristics of documents that 15 refer to the entity. These insights can be codified into a sub­sequent model. Further discussion of methods for determin­ing 404 a subsequent model is included below. The disambiguation engine 310 identifies 406 documents referring to the entity using the determined 404 model. The 20 determined 404 model is used to identify 406 a second set of documents likely to refer to the entity. In one embodiment, the disambiguation engine 310 returns to determining 404 a sub­sequent model based on the identified 406 documents [claimed fourth data set of the previous iteration]…) )
	As per claim 22, the claim is a system claim analogous to claim 9 and is therefore rejected with the same rationale applied against claim 9.

As per claim 10, the combination of Lar, Li, Rao teach the method of claim 1 and the rejection of claim 1 is incorporated. While Lar teaches the selection of data batches for each training sequential model for a training iteration/epoch as discuss noted above (Lar in 9:4-27 & Li in 5:6-43). Lar does not expressly teach claim 10 limitations. 
Li does expressly teach claim 10 limitations. Li further teaches wherein the first dataset comprises random data (in 5: 6-13: Machine learning techniques train models to accurately make predictions on data fed into the models ( e.g., what was said by a user in a given utterance; whether a noun is a person, place, or thing; what the weather will be like tomorrow). During a learning phase, the models are developed against a training dataset of inputs (e.g., sample A, sample B, sample C) [claimed the first dataset comprises random data] to optimize the models to correctly predict the output for a given input…; And in 5: 31-43: In another example, for an unsupervised learning phase, a model is developed to cluster the dataset into n groups, and is evaluated over several 40 epochs in how consistently it places a given input into a given group and how reliably it produces the n desired clusters across each epoch.; Examiner note: these groups of separate data sets are interpreted as random data sets.)
The Li and Lar are references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing a natural language information processing systems and methods using sequential modeling algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior arts for training sequential models using randomly selected groups of data as disclosed by Li with the method of for generating sequential models using probability distributions and Kullback-Leibler divergence using sampled subsets as by Lar.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods of Li and Lar in order determine a supervisory training signal in optimization of sequential models over different training epochs (Li, 7:41-52); Doing so will provide a supervisory training signal for training the sequential models with the need to label the respective training datasets (Li, 7:55-64).

As per claim 23, the claim is a system claim analogous to claim 10 and is therefore rejected with the same rationale applied against claim 10.

As per claim 11, the combination of Lar, Li, and Rao teach the method of claim 10 and the rejection of claim 10 is incorporated. Lar further teaches wherein the second dataset comprises data that is cleaner than the … data of the first dataset (in 9:9-24: For example, a model can specify the probability that a document refers to a particular entity given a set of features in the document. The first model is used to identify a first set of documents likely to refer to the entity [claimed second dataset wherein the second dataset comprises data that is cleaner than the … data of the first dataset]. The disambiguation engine 310 determines 404 a subse­quent model based on the features of the documents that are identified 402 as referring to the entity. For example, analysis of the documents identified 402 as referring to the entity can provide insights into the characteristics of documents that 15 refer to the entity.).
	Lar does not expressly teach the first data as a random data. (in 5: 6-13: Machine learning techniques train models to accurately make predictions on data fed into the models ( e.g., what was said by a user in a given utterance; whether a noun is a person, place, or thing; what the weather will be like tomorrow). During a learning phase, the models are developed against a training dataset of inputs (e.g., sample A, sample B, sample C) [claimed the first dataset comprises random data] to optimize the models to correctly predict the output for a given input…)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Li, and Lar for the same reasons disclosed above.
	As per claim 24, the claim is a system claim analogous to claim 11 and is therefore rejected with the same rationale applied against claim 11.
	
	As per claim 12, the combination of Lar, Li, and Rao teach the method of claim 1 and the rejection of claim 1 is incorporated. Lar further teaches wherein the respective contrastive score comprises a Kullback- Leibler (KL) divergence. (in 9:52-66: The number of iterations through the loop illustrated in FIG. 4 can depend on a variety of factors. For example, the disambiguation engine 310 can iterate a predetermined num­ber of times. As other examples, the disambiguation engine 55 310 can iterate until the model has converged on a stable condition [claimed determined divergence of the subsequent model iteration], a predetermined resource budget has been con­sumed, the improvement provided by an iteration falls below some improvement threshold, an iteration does not introduces a number of new features that falls below some threshold, the Kullback-Leibler divergence [claimed wherein the respective contrastive score comprises a Kullback- Leibler (KL) divergence] between the probability distri­bution over entity assignments  to documents in subsequent iterations falls below some threshold, and so on...)
	Li teaches the Kullback-Leibler divergence as a score. (in 7:27-41: The posterior distribution of the teacher model 150 is referred to as PT (s|xT)  and the posterior distribution of the student model 160 is referred to as Ps (s|xs) where Xt and Xs represent the parallel inputs from 30 different domains to the teacher model 150 and student model 160 respectively and s represents the senones (or phonemes) that have been analyzed. Using the above defi­nitions, a divergence score of the Kullback-Leibler diver­gence [claimed respective contrastive score as Kullback-Leibler divergence score] between the two speech recognition models deter­ 35 mined by the output comparator 170 may be calculated according to FORMULA 1,...)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Li, and Lar for the same reasons disclosed above.
	As per claim 25, the claim is a system claim analogous to claim 12 and is therefore rejected with the same rationale applied against claim 12.
	
As per claim 13, the combination of Lar, Li, and Rao teach the method of claim 1 and the rejection of claim 1 is incorporated. Lar further teaches wherein each dataset comprises sentence language pairs. (entity and feature pairs in textual language as claimed pairs, in 7:59-65: Features 308 are associated with the entity to which it is desired to disambiguate references. A feature is any property that can be represented in or by a document 302. For example, if the entity is Bob Dylan, features of the entity Bob Dylan could include, for example, the text "Bob Dylan", an image of Bob Dylan, an audio clip of a song by Bob Dylan, a sentence about [wherein each dataset comprises sentence language pairs] Bob Dylan, and so on…)
	As per claim 26, the claim is a system claim analogous to claim 13 and is therefore rejected with the same rationale applied against claim 13.

Claims 5-7 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Lar  in view of Li, Rao and in further view of "Dynamic Data Selection for Neural Machine Translation" to Wees et al. (hereinafter "Wees"). 
	As per claim 5, the combination of Lar in combination with Li and Rao teaches the method of claim 4 and the rejection of claim 4 is incorporated. While Lar teaches the selection of data batches for each training sequential model for a training iteration/epoch as discuss noted above (Lar in 9:4-27 & Li in 5:6-43); And Rao teaches sorting the selected number of data pairs based on the respective contrastive scores; and removing, from the data batch, a removal ratio of the selected number of data pairs with lowest contrastive scores (in 0108: 120 FIG. 2D shows the 100) example cross-correlation results in an overall correlation coefficients table. In the coefficient table of FIG. 2D, highly correlated pairs (x-y cells) of inputs (measurements and derived values of the dataset) with values greater than a pre-defined thresh­old [claimed based on the respective contrastive scores; Examiner notes capturing predictive measurements and dataset correlation values as claimed respective contractive scores] are identified (in cross pattern) and put into a high­correlation tag group [claimed  sorting the selected number of data pairs based on the respective contrastive scores; and removing, from the data batch, a removal ratio of the selected number of data pairs with lowest contrastive scores]. FIGS. 2E-2F illustrates more details of example correlation function curves over a pre-defined time window (240 min or 4 hours). An application in the plant then performs preliminary feature selection (step 125 of method 100) and secondary feature selection (step 130 of method on the dataset to reduce [and removing, from the data batch, a removal ratio of the selected number of data pairs with lowest contrastive scores] the dataset to only 31 selected 100) inputs (of process variables' measurements and derived feature variables' values) to train/test the failure model  for a C2 Splitter with Hydrate Formation problem. Further, the plant application builds and executes a PLS model using only the 31 inputs selected from a total of over 1000 variables as the failure model inputs and a historical process failure (alert) event as the failure model output.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Rao, Li, and  Lar for the same reasons disclosed above.
	
 Lar, Li, and Rao do not expressly teach claim 5 limitations.
Wees does expressly teach claim 5 limitations. Wees teaches wherein generating the plurality of data batches comprises: determining a selection ratio for each data batch (Wees Pg. 4 Sec. 3; “For example, if we start with the complete bitext (α = 1), select the top 60% (β = 0.6) every second epoch (η = 2), then we run epochs 1 and 2 with a subset of size |G|, epochs 3 and 4 with a subset of size 0.6 · |G|, epochs 5 and 6 with a subset of size 0.36 · |G|, and so on. For every size n, the actual selection contains the top n sentences pairs of G.”);
	determining a batch size for each data batch, wherein the batch size is based on the selection ratio and a number of data pairs in the third dataset (Wees Pg. 4 Sec. 3; “…the selection size n is a function of epoch i...” see eq. 5)
	selecting a number of data pairs from the third dataset that corresponds with the determined batch size (Wees Pg. 4 Sec. 3; “For every size n, the actual selection contains the top n sentences pairs of G.”)
sorting the selected number of data pairs based on the respective contrastive scores (Wees Pg. 2 sec. 3;“Finally, we rank all sentence pairs s ∈ G according to their CEDs, and then select only the top n sentence pairs with the lowest CED”);
	removing, from the data batch, a removal ratio of the selected number of data pairs with lowest contrastive scores (Wees Pg. 4 Sec. 3 “0  ≤ β  ≤ 1 is the retention rate, i.e., the fraction of data to be kept in each new selection…For every size n, the actual selection contains the top n sentences pairs of G.”),
	the removal ratio comprising an inverse of the selection ratio (Examiner note: It’s implicit that the removal ratio is the inverse of the selection ratio, in Pg. 4 Sec. 3 “0  ≤ β  ≤ 1 is the retention rate, i.e., the fraction of data to be kept in each new selection…For every size n, the actual selection contains the top n sentences pairs of G.”).
The Wees, Rao, Li and Lar are references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing a natural language information processing systems and methods using sequential modeling algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior arts for training sequential models using sampled dataset using gradual fine-tuning approach as disclosed by Wees with the method of for generating sequential models using sampled subsets over a training iteration/epoch as collectively by Rao, Li, and  Lar.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods of Li and Lar in order use a gradual fine-tuning method for selecting data sets for training sequential models (Wees, Sec. 5.2: Right. Col. 2nd and 3rd paras.); Doing so will provide improve performance over each training iteration/epoch (Wees, Sec. 5.2: Right. Col. 2nd and 3rd paras.).

	As per claim 18, the claim is a system claim analogous to claim 5 and is therefore rejected with the same rationale applied against claim 5.

As per claim 6, the combination of Lar in combination with Li and Rao teach the method of claim 5 and the rejection of claim 5 is incorporated. While Lar teaches the selection of data batches for each training sequential model for a training iteration/epoch as discuss noted above (Lar in 9:4-27 & Li in 5:6-43). Lar and Li do not expressly teach claim 6 limitations.
Wees does expressly teach claim 6 limitations. Wees further teaches wherein the selection ratio decreases over training time (Examiner note: Wees Figure 1b shows the selection ratio decreases after each epoch.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Wees, Rao, Li, and Lar for the same reasons disclosed above.
	As per claim 19, the claim is a system claim analogous to claim 6 and is therefore rejected with the same rationale applied against claim 6.

As per claim 7, the combination of Lar and Li teach the method of claim 6 and the rejection of claim 6 is incorporated. While Lar teaches the selection of data batches for each training sequential model for a training iteration/epoch as discuss noted above (Lar in 9:4-27 & Li in 5:6-43). Lar and Li do not expressly teach claim 7 limitations.
 Wees does expressly teach claim 7 limitations. Wees further teaches wherein the batch size is equal to a fixed batch size divided by the selection ratio. (Wees Pg. 3 sec. 3 “we gradually decrease the training data size, starting from G and keeping only the top n sentence pairs for the duration of n epochs, where the top n pairs are defined by their CEDs scores.”
Examiner note: one way to decrease the data size is to multiply (inverse of divide) the batch size by the selection ratio as shown in fig. 1b.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Wees, Rao, Li, and Lar for the same reasons disclosed above.
As per claim 20, the claim is a system claim analogous to claim 7 and is therefore rejected with the same rationale applied against claim 7.
	
Claims 8 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Lar  in view of Li and Rao, and in further view of US Pub. No. 2016/0078339 to Li et al. (hereinafter "Li_2").
As per claim 8, the combination of the combination of Lar and Li teach the method of claim 1 and the rejection of claim 1 is incorporated. While Lar teaches the selection of data batches for each training sequential model for a training iteration/epoch as discuss noted above (Lar in 9:4-27 as depicted in Fig. 4; Li in 5:6-43 and as depicted in Fig. 1). Lar and Li do not expressly teach claim 8 limitations.
 
	Li_2 does expressly teach claim 8 limitations. Li_2 teaches wherein the target model has a first size and the base model has a second size , and wherein the first size is larger than the second size. ( for an training iteration teacher DNN as claimed target model and student DNN as claimed second model where the size of the target model is larger than the base as depicted in Fig. 3, in Turning now to FIG. 3, aspects of a system 300 for learning a smaller student DNN from a larger teacher DNN are illustratively provided, in accordance with an embodi­ment of the invention. Example system 300 includes teacher DNN 302 and a smaller student DNN 301, which is depicted as having fewer nodes on each of its layers 341. As described previously, in one embodiment of the invention teacher DNN 302 comprises a trained DNN model, which may be trained according to standard techniques known to one of ordinary skill in the art ( such as the technique described in connection to FIG. 2). In another embodiment, a teacher DNN may be trained such as described in connection to the trainer compo­nent 126 of FIG. 1. In either case, it is assumed that there is a good teacher (i.e. a trained teacher DNN) from which to learn the student DNN. Further, student DNN 301 and teacher DNN 302 may be embodied as a CD-DNN-HMM having a number of hidden layers 341 and 342, respectively. In the embodiment shown in FIG. 3, student DNN 301 has output distribution 351, and teacher DNN 302 has output distribu­tion 302 of the same size, although the student DNN 301 itself is smaller than teacher DNN 302.; And 0044: As previously described, some embodiments may determine convergence using a threshold, wherein the distri­bution 351 of the student DNN 301 is determined to have converged with the distribution 352 of the teacher DNN 302 where the error is below a specified threshold, which may be pre-determined and also may be based on the specific appli­cation of the DNNs (or the type of data 310 used by the DNNs) or the size of the student DNN. For example, it is expected that a student DNN that has close to the same num­ber of parameters as the teacher DNN will reach better con­vergence (lower error signal and thus higher accuracy) than a student DNN that is much smaller than the teacher DNN. System 300 may also determine convergence or otherwise stop iterating where it is determined that the error signal is no longer getting smaller over subsequent iterations. In other words, the student has learned all that it can from the teacher, for the available data.)
	
The Li_2, Li and Lar are references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing a natural language information processing systems and methods using sequential modeling algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior arts for training sequential models using sampled dataset to learn a target teacher model as disclosed by Li_2 with the method of for generating sequential models using sampled subsets over a training iteration/epoch as collectively by Li and  Lar.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods of Li and Lar in order use iterative process for training sequential models using unlabeled training data by using a supervised signal, (Li_2, Abstract); doing so will help to provide an accurate signal to minimize the divergence of the output from the two models and produce trained model for processing applications of resource limited devices, (Li_2, Abstract).
	As per claim 21, the claim is a system claim analogous to claim 8 and is therefore rejected with the same rationale applied against claim 8. 

	
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516. The examiner can normally be reached Monday-Friday, 8:00am-5:00pm EST..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/OLUWATOSIN O ALABI/Examiner, Art Unit 2129