DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Amendment
This Office Action is in response to applicant’s communication filed 19 October 2022, in response to the Office Action mailed 2 August 2022.  The applicant’s remarks and any amendments to the claims or specification have been considered, with the results that follow.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-2, 6-9, 13-16, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rodriguez-Galiano, et al., (An assessment of the effectiveness of a random forest classifier for land-cover classification, Jan 2012, pgs. 93-104), in view of Maughan (US 2017/0330109), and further in view of Liu (US 2002/0174088).

As per claim 1, Rodriguez-Galiano teaches a system comprising: a significance component that: determines whether data fields of data records of a dataset are deemed to be significant based on a significance function [an importance measure may be determined for each variable, which may then be assigned a rank based upon the value (pg. 93, abstract; pgs. 98-99, section 4.3; etc.)], labels a first set of the data fields that are determined to be significant with a first indication of being a significant data field [an importance measure may be determined for each variable, which may then be assigned a rank based upon the value (pg. 93, abstract; pgs. 98-99, section 4.3; etc.)], and labels a second set of the data fields that are determined not to be significant with a second indication of being a non-significant data field [an importance measure may be determined for each variable, which may then be assigned a rank based upon the value (pg. 93, abstract; pgs. 98-99, section 4.3; etc.); where the different importance measure/rank values are first and second indications of significant/non-significant data fields]; a training component that trains, using the dataset, a modified random forest model based on a training process that employs the first indication of being the significant data field and the second indication of being the non-significant data field [the system includes training a random forest model on a number of datasets, including utilizing the variable importance measures (pgs. 97-99, sections 4.1-4.4; etc.)]; a sampling component that, during the training process: generates decision trees of the modified random forest model, wherein each decision tree is generated using a different group of data records of the dataset, and wherein each decision tree is generated using a different group of data fields of the data fields [sampling is performed to create the reference datasets used (pg. 95, section 3.2; pg. 96, section 3.3; pg. 99, sections 4.4-45; etc.) and the random forest increases the diversity of the trees by growing them from different training data subsets created by bagging or bootstrap aggregating, which includes randomly resampling the original dataset for the bootstrap aggregating and selecting samples of the dataset for bagging (pg. 96, section 3.3, etc.)]; and a runtime component that imputes, during an analysis of a new data record using the modified random forest model, other data values for respective data fields of the second set that are missing data values in the new data record [an algorithm is used by the random forest models to infer missing data (imputation) in the datasets (pg. 93, abstract; pg. 96, section 3.3; etc.) where sampling is performed to create the reference datasets used (pg. 95, section 3.2; pg. 99, sections 4.4-45; etc.)].
While Rodriguez-Galiano teaches a machine learning system with various components, and determining importance measures of variables (see above), it does not talk about the actual physical implementation, and thus does not teach a memory that stores computer executable components; a processor, operably coupled to the memory, and that executes computer executable components stored in the memory, wherein the computer executable components comprise the significance and training.  Furthermore, while Rodriguez-Galiano teaches inferring missing data values (see above) it does not explicitly teach an imputation component that imputes, during the training process, data values only for ones of the second set of data fields that are missing data values in data records of the dataset; and that the runtime component selects, during the analysis, only one or more decision trees of the modified random forest model that respectively have sampled data fields that all have corresponding data fields of the new data record that have data values.
Maughan teaches teach a memory that stores computer executable components; a processor, operably coupled to the memory, and that executes computer executable components stored in the memory, wherein the computer executable components comprise the significance and training [the system includes hardware embodiments including processors and connected memories, as well as modules implemented in hardware and/or software (paras. 0022-27, etc.)]; an imputation component that imputes, during the training process, data values only for ones of the second set of data fields that are missing data values in data records of the dataset [the retrain module may include an imputation/estimation component for imputing missing values, and may only impute values for missing data for features that are less significant based on a significant indication (paras. 0078, 0097, etc.)]; and a runtime component that imputes, during an analysis of a new data record using the modified random forest model, other data values for respective data fields of the second set that are missing data values in the new data record [the retrain module may include an imputation/estimation component for imputing missing values, and may only impute values for missing data for features that are less significant based on a significant indication (paras. 0078, 0097, etc.); for the modified random forest model of Rodriguez-Galiano, above].
Rodriguez-Galiano and Maughan are analogous art, as they are within the same field of endeavor, namely machine learning.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to implement a system including hardware components and to only impute values for features indicated as less significant, as taught by Maughan, for the training of the models using importance indications for data in the system taught by Rodriguez-Galiano.
Maughan provides motivation as [the system requires various hardware and software component and modules but they may take various forms including programmable hardware, software instructions, etc. (paras. 0022-30, etc.) and a user may not want to use results where data has been imputed if the associated data is marked as significant, whereas they may risk using results where less significant data is missing (para. 0097)].
Liu teaches a runtime component that selects, during the analysis, only one or more decision trees of the modified random forest model that respectively have sampled data fields that all have corresponding data fields of the new data record that have data values [during runtime use of the decision trees only the trees that are based on a subset of variables that do not include missing data are selected and used (paras. 0012, 0039, 0054, etc.)].
Rodriguez-Galiano and Liu are analogous art, as they are within the same field of endeavor, namely using decision trees for classification.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to only use, by the classifier, the decision trees that do not have variables with missing data, as taught by Liu, for the decision trees used for classification in a system which may include missing data, in the system taught by Rodriguez-Galiano.
Liu provides motivation as [If information is missing from the record, the first classification tree is used initially because it may be possible that the missing information is not needed to predict class membership. However, if the missing information is needed, a classification tree that is based on a subset of variables that does not include the missing information is selected and used for predicting class membership (para. 0012, etc.)].

As per claim 2, Rodriguez-Galiano/Maughan/Liu teaches wherein the imputation component imputes the data values using an imputation function selected from a group consisting of a weighted average function, an average function, a median function, a mean function, a random guess function, a zero-value replacement function, a regression estimation function, and a Bayesian function [an algorithm is used by the random forest models to infer missing data (imputation) in the datasets (Rodriguez-Galiano: pg. 93, abstract; pg. 96, section 3.3; etc.) where the retrain module may include an imputation/estimation component for imputing missing values, using various average and standard deviation functions or a machine learning model including regression, etc. (Maughan: paras. 0038-48, 0057-59, 0078, 0097, etc.)].

As per claim 6, Rodriguez-Galiano/Maughan/Liu teaches wherein the new data record comprises one or more data fields from the first set [an algorithm is used by the random forest models to infer missing data (imputation) in the datasets (Rodriguez-Galiano: pg. 93, abstract; pg. 96, section 3.3; etc.) where sampling is performed to create the reference datasets used (Rodriguez-Galiano: pg. 95, section 3.2; pg. 99, sections 4.4-45; etc.) where the retrain module may include an imputation/estimation component for imputing missing values, using various average and standard deviation functions or a machine learning model including regression, etc. (Maughan: paras. 0038-48, 0057-59, 0078, 0097, etc.)].

As per claim 7, Rodriguez-Galiano/Maughan/Liu teaches wherein the runtime component further: generates, during the analysis, predictions respectively from the one or more decision trees using the new data record; and performs, during the analysis, an ensemble operation on the predictions to generate a final prediction result [the system uses an ensemble learning technique called random forests for classification (Rodriguez-Galiano: pg. 94, section 1; etc.) where an algorithm is used by the random forest models to infer missing data (imputation) in the datasets being used (Rodriguez-Galiano: pg. 93, abstract; pg. 96, section 3.3; etc.)].

As per claim 8, see the rejection of claim 1, above.

As per claim 9, see the rejection of claim 2, above.

As per claim 13, see the rejection of claim 6, above.

As per claim 14, see the rejection of claim 7, above.

As per claim 15, see the rejection of claim 1, above, wherein Rodriguez-Galiano/Maughan/Liu also teaches a computer program product facilitating the training, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to perform the steps [aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable storage media having computer readable program code embodied thereon (Maughan: paras. 0022-27, etc.)].

As per claim 16, Rodriguez-Galiano/Maughan/Liu teaches wherein the imputation component imputes the data values using an imputation function selected from a group of functions consisting of a weighted average function, an average function, a median function, a mean function, a most common value function, a zero-value replacement function, a regression estimation function, and a Bayesian function [an algorithm is used by the random forest models to infer missing data (imputation) in the datasets (Rodriguez-Galiano: pg. 93, abstract; pg. 96, section 3.3; etc.) where the retrain module may include an imputation/estimation component for imputing missing values, using various average and standard deviation functions or a machine learning model including regression, etc. (Maughan: paras. 0038-48, 0057-59, 0078, 0097, etc.)].

As per claim 20, see the rejection of claims 6-7, above.


Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rodriguez-Galiano, Maughan, and Liu as applied to claim 1 above, and further in view of Ma (US 2007/0185727).

As per claim 3, Rodriguez-Galiano/Maughan/Liu teaches the system of claim 1, as described above.
While Rodriguez-Galiano/Maughan/Liu teaches imputing missing values using various functions (see above) it does not explicitly teach wherein the imputation component imputes the data values using a most common value function.
Ma teaches wherein the imputation component imputes the data values using a most common value function [the facility imputes missing values using the median value in the same column for continuous variables, or the mode (i.e., most frequent) value for categorical values (paras. 0046, 0075, etc.)].
Rodriguez-Galiano/Maughan/Liu and Ma are analogous art, as they are within the same field of endeavor, namely imputing missing values for a machine learning model including a random forest.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to use the mode for imputing missing values, as taught by Ma, for the imputation of missing values in Rodriguez-Galiano/Maughan/Liu.
Because both Rodriguez-Galiano/Maughan/Liu and Ma teach imputing missing values, it would have been obvious to one of ordinary skill in the art to use the mode for imputing missing values, as taught by Ma, for the imputation of missing values in Rodriguez-Galiano/Maughan/Liu, to achieve the predictable result of trying different functions to find the most accurate values for those missing.


Claims 4, 5, 11, 12, 18, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rodriguez-Galiano, Maughan, and Liu as applied to claims 1-3, 8-10, and 15-17 above, and further in view of Wei (US 2018/0046475).

As per claim 4, Rodriguez-Galiano/Maughan/Liu teaches the system of claim 3, as described above.
While Rodriguez-Galiano/Maughan/Liu teaches filling in some missing data (see above) it does not explicitly teach wherein the sampling component further: filters out, during the training process, from a first group of data records of the different groups of data records, a data record of the data records having a data field from the first set and the data field is missing a data value.
 Wei teaches wherein the sampling component further: filters out, during the training process, from a first group of data records of the different groups of data records, a data record of the data records having a data field from the first set and the data field is missing a data value [preprocessing can additionally or alternatively include filtering out noise, missing data and outliers, and/or down-sampling or over-sampling data (para. 0070, etc.)].
Rodriguez-Galiano and Wei are analogous art, as they are within the same field of endeavor, namely training models including random forest models.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to include the preprocessing of the training data for training the random forest, as taught by Wei, in the data for training the random forest in the system of Rodriguez-Galiano/Maughan/Liu.
Wei provides motivation as [the preprocessing steps improve the quality and balance of the dataset, improving the data and thus the model(s) (para. 0070)].

As per claim 5, Rodriguez-Galiano/Maughan/Liu/Wei teaches wherein the training component further: generates, during the training process, a decision tree of the modified random forest model based on the first group of data records [the system includes generating (Rodriguez-Galiano: pg. 96, sections 3.3-3.3.1) and training a random forest model on a number of datasets, including utilizing the variable importance measures (Rodriguez-Galiano: pgs. 97-99, sections 4.1-4.4; etc.)].

As per claim 11, see the rejection of claim 4, above.

As per claim 12, see the rejection of claim 5, above.

As per claim 18, see the rejection of claim 4, above.

As per claim 19, see the rejection of claim 5, above.


Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rodriguez-Galiano, Maughan, and Liu as applied to claim 8 above, and further in view of Brunner (US 2017/0249434).

As per claim 10, Rodriguez-Galiano/Maughan/Liu teaches the computer-implemented method of claim 8, as described above.
While Rodriguez-Galiano/Maughan/Liu teaches imputing missing values using various functions (see above) it does not explicitly teach wherein the imputing of the data values employs a regression estimation function.
Brunner teaches wherein the imputing of the data values employs a regression estimation function [variable estimation via regression models may be used to estimate missing values (para. 0238, etc.)].
Rodriguez-Galiano/Maughan/Liu and Brunner are analogous art, as they are within the same field of endeavor, namely imputing missing values for a machine learning model including a random forest.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to use the regression estimation for imputing missing values, as taught by Brunner, for the imputation of missing values in Rodriguez-Galiano/Maughan/Liu.
Because both Rodriguez-Galiano/Maughan/Liu and Brunner teach imputing missing values, it would have been obvious to one of ordinary skill in the art to use the regression estimation for imputing missing values, as taught by Brunner, for the imputation of missing values in Rodriguez-Galiano/Maughan/Liu, to achieve the predictable result of trying different functions to find the most accurate values for those missing.  Brunner provides further motivation as [using the regression model allows estimating values that will improve the model fitting without introduction of bias (para. 0238, etc.)].


Claim(s) 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rodriguez-Galiano, Maughan, and Liu as applied to claim 15 above, and further in view of Morris (US 2009/0326976).

As per claim 17, Rodriguez-Galiano/Maughan/Liu teaches the computer program product of claim 15, as described above.
While Rodriguez-Galiano/Maughan/Liu teaches imputing missing values using various functions (see above) it does not explicitly teach wherein the imputation of the data values employs a random guess function.
Morris teaches wherein the imputation of the data values employs a random guess function [an imputation method randomly generates data to represent missing variables (paras. 0033-34, etc.)].
Rodriguez-Galiano/Maughan/Liu and Morris are analogous art, as they are within the same field of endeavor, namely imputing missing values for a machine learning model including a random forest.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to use the random guess function for imputing missing values, as taught by Morris, for the imputation of missing values in Rodriguez-Galiano/Maughan/Liu.
Because both Rodriguez-Galiano/Maughan/Liu and Brunner teach imputing missing values, it would have been obvious to one of ordinary skill in the art to use the random guess function for imputing missing values, as taught by Morris, for the imputation of missing values in Rodriguez-Galiano/Maughan/Liu, to achieve the predictable result of creating missing values without extra calculations (i.e., that does not rely on reading other variable data).


Response to Arguments
Applicant's arguments filed 19 October 2022 have been fully considered but they are not persuasive.

Applicant argues that the cited art does not teach a sampling component that, during the training process: generates decision trees of the modified random forest model, wherein each decision tree is generated using a different group of data records of the dataset, and wherein each decision tree is generated using a different group of data fields of the data fields.  Applicant argues that Rodriguez-Galiano discloses mechanisms for selecting data records from a dataset for inclusion in the training and testing datasets and how random variables are split at each node of a tree, and not what variables are used for the entire tree.
However, Rodriguez-Galiano teaches that sampling is performed to create the reference datasets used (pg. 95, section 3.2; pg. 96, section 3.3; pg. 99, sections 4.4-45; etc.) and the random forest increases the diversity of the trees by growing them from different training data subsets created by bagging or bootstrap aggregating, which includes randomly resampling the original dataset for the bootstrap aggregating and selecting samples of the dataset for bagging (pg. 96, section 3.3, etc.).  This explicitly includes that “a RF increases the diversity of the trees by making them grow from different training data sub-sets created through bagging or bootstrap aggregating” and “Furthermore, when the RF makes a tree grow, it uses the best split of a random subset of input features or predictive variables in the division of every node” (pg. 96, section 3.3).  This describes how a different dataset is selected for creating each decision tree, and how random sets of features/variables are chosen at each node from the dataset for the tree.  While it is theoretically possible that the random selection could end up with the same selection for multiple trees, it explicitly describes them as “different” data sub-sets, which is within the broadest reasonable interpretation of “generates decision trees of the modified random forest model, wherein each decision tree is generated using a different group of data records of the dataset, and wherein each decision tree is generated using a different group of data fields of the data fields”.  The examiner has included several further references that also describe bagging and bootstrapping to create decision trees in a random forest from different (such as by using random selection with subsampling/re-sampling, as also described in Rodriguez-Galiano) selections of data from a dataset, noted below.


Conclusion
The following is a summary of the treatment and status of all claims in the application as recommended by M.P.E.P. 707.07(i): claims 1-20 are rejected.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Mishra (US 10,733,515) – discloses a system for missing data imputation using various methods, and including imputing missing values from subsets of the training/verification data.
Jung (US 2008/0082271), Shaughnessy (US 7,271,736), Kerlikowske (US 2012/0003639), and Lang (US 2005/0010106) – disclose various systems including using a Chi-square test for significance determinations.
Cowan (US 2012/0053994) – discloses using decision trees to impute missing data values.
Saffari et al. (On-line Random Forests, Oct 2009, pgs. 1393-1400) and Kulkarni et al. (Pruning of Random Forest classifiers: A survey and future directions, July 2012, pgs. 64-68) – describe generating decision trees for a random forest using bagging/bootstrapping that includes selecting different data sets to create each tree.

The examiner requests, in response to this Office action, that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line number(s) in the specification and/or drawing figure(s). This will assist the examiner in prosecuting the application.

When responding to this office action, Applicant is advised to clearly point out the patentable novelty which he or she thinks the claims present, in view of the state of the art disclosed by the references cited or the objections made. He or she must also show how the amendments avoid such references or objections.  See 37 CFR 1.111(c).

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GEORGE GIROUX whose telephone number is (571)272-9769. The examiner can normally be reached M-F 10am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GEORGE GIROUX/Primary Examiner, Art Unit 2128