DETAILED ACTION
This is the first office action regarding application number 16/254,033, filed January 12, 2018.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Specification
The disclosure is objected to because of the following informalities:
Paragraphs [159]-[164] are marked as descriptive text for Figure 11, but the text and the figures do not correspond with each other. For example, paragraph [163] indicates that element 1104 describes the following: “The example method of Figure 11 also includes generating a failure metric for at least one phase in the ML pipeline (1104).” However, Figure 11 element 1104 has the description “Apply Transform(s) to Rows to Data Set”. Another example is found in paragraph [164], where element 1106 is referenced to indicate “The example method of Figure 11 further includes providing an indication of the determined inadequacy of the target data set (1106).” However, Figure 11 element 1106 has the description “Remove Non-Selected Columns from Data Set”. Appropriate correction is required.
Paragraphs [159]-[164] are marked as descriptive text for Figure 11, but two of the elements in the figure (1108) and (1110) are not described at all within these paragraphs or elsewhere in the specification. Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-3, 6-10, 13-15, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Ghanta et al., U.S. PGPUB 2020/0034665, filed 7/30/2018 [hereafter referred as Ghanta] in view of Maag et al., U.S. PGPUB 2017/0220403, published 8/3/2017 [hereafter referred as Maag].
Regarding Claim 1, Ghanta teaches
A system comprising: 
a memory containing a target data set ([Ghanta paragraphs [0018]-[0019]: memory/storage devices containing program code for modules for a machine learning system, as well as serving as storage for a data set (“a memory containing a target data set”).] [Ghanta paragraph [0041]: training, validation, test sets, inference data set are types of a target data set (“… labels may be required to determine the suitability of the machine learning model, e.g., the accuracy or predictive performance of the machine learning model, to an inference data set during the inference phase. The predictive performance is usually evaluated on either the training data set or a separate validation or test set where both the feature and label information is available …”).] [Ghanta paragraph [0079]: an error data set, which is another type of a target data set (“The resulting output of the validation of the first machine learning algorithm/model, in one embodiment, comprises an error data set. The error data set, in certain embodiments, includes values indicating the prediction error of the first machine learning algorithm/model on the validation data set (e.g., a rate, a score, or other value that indicates how often the first machine learning algorithm/model accurately predicted a label for the validation data set).”).]); 
a software application configured to apply a machine learning (ML) pipeline to an input data set ([Ghanta paragraphs [0018]-[0019]: memory/storage devices containing program code for modules for a machine learning system.] [Ghanta Figure 1, element 104; paragraphs [0043]-[0044]: the ML management apparatus (managing pipelines within a machine learning system; see Ghanta paragraphs [0035]-[0036]) consisting of sub-modules that may be a software application”), where the various hardware represent computing devices.] [Ghanta Figure 2A, elements 202, 204, 206a-c; paragraph [0056]: ML management apparatus selecting a training pipeline to generate a machine learning model with the input data (“apply a machine learning (ML) pipeline to an input data set”) (“In one embodiment, the machine learning system 200 includes physical and/or logical groupings of the machine learning pipelines 202, 204, 206a-c based on a desired objective, result, problem, and/or the like. For instance, the ML management apparatus 104 may select a training pipeline 204 for generating a machine learning model configured for the desired objective and one or more inference pipelines 206a-c that are configured to analyze the desired objective by processing input data 210 associated with the desired objective using the analytic engines for which the selected inference pipelines 206a-c are configured for and the machine learning model.”).]), 
wherein the ML pipeline includes … an ML model building phase ([Ghanta Figure 2A, elements 202, 204, 206a-c; paragraph [0053]: machine learning pipelines performing various operations including algorithm training/inference and validations (“wherein the ML pipeline includes … an ML model building phase”) (“As used herein, machine learning pipelines 202, 204, 206a-c comprise various machine learning features, components, objects, modules, and/or the like to perform various machine learning operations such as algorithm training/inference, feature engineering, validations, scoring, and/or the like.”).]), 
…
wherein the ML model building phase generates an ML model from the conditioned data set ([Ghanta Figure 2A, element 204; paragraph [0077]: ML pipeline contains a training pipeline which uses a training data set (“conditioned data set”) to train a machine learning model (“wherein the ML model building phase generates a ML model from the conditioned data set”) (“In one embodiment, the primary training module 302 trains the first machine learning model for the first machine learning algorithm on a training data set. For instance, the primary training module 302 may receive, read, access, and/or the like a training data set and provide the training data set to a training pipeline 204 to train the machine learning model.”).]), and 
wherein the software application is additionally configured to generate a failure metric for at least one phase in the ML pipeline (“generate a failure metric for at least one phase in the ML pipeline”, where the types of failure metrics include: [Ghanta paragraph [0094]: “The primary validation module 304 compares the predictions made by the first machine learning algorithm/model to the true label of the validation data to calculate primary algorithm error values.”] [Ghanta paragraph [0085]: “The second machine learning algorithm may be configured to predict a suitability of the first machine learning algorithm/model for analyzing an inference data set. As used herein, the suitability may comprise a value such as a health score that describes the efficacy, accuracy, effectiveness, or the like of the predictions that the first machine learning algorithm/model generates for the inference data set.” [Ghanta paragraph [0089]: “In further embodiments, the secondary validation module 308 analyzes other statistics, such as training statistics, to determine the suitability of the second machine learning algorithm in accurately assessing the effectiveness of the first machine learning algorithm. The other statistics may include confidence metrics, accuracy metrics, precision metrics, and/or the like.”] [Ghanta paragraph [0099]: In one embodiment, the analysis module 310 may use additional data (e.g., in addition to the metrics/health scores in Table 1) to determine whether the first machine learning algorithm/model is suitable for the inference data. For instance, the analysis module 310 may receive or access data deviation information (e.g., as described in U.S. patent application Ser. No. 16/001,904, which is incorporated by reference herein in its entirety) to determine whether and how much the inference data differs from the training data that was used to train the first machine learning model.”].); and 
a computing device configured to ([Ghanta Figure 1, element 104; paragraphs [0043]-[0044]: the ML management apparatus consisting of sub-modules that may be installed and deployed on various hardware, where the various hardware represent computing devices (“a computing device”) (“The ML management apparatus 104, including its various sub-modules, may be located on one or more information handling devices 102 in the system 100, one or more servers 108, one or more network devices, and/or the like. … In various embodiments, the ML management apparatus 104 may be embodied as a hardware appliance that can be installed or deployed on an information handling device 102, on a server 108, or elsewhere on the data network 106.”).]): 
obtain, from the memory, the target data set ([Ghanta paragraph [0041]: training, validation, test sets, inference data set are examples of a target data set (“… labels may be required to determine the suitability of the machine learning model, e.g., the accuracy or predictive performance of the machine learning model, to an inference data set during the inference phase. The predictive performance is usually evaluated on either the training data set or a separate validation or test set where both the feature and label information is available …”).] Ghanta paragraph [0079]: an error data set, which is another type of a target data set (“The resulting output of the validation of the first machine learning algorithm/model, in one embodiment, comprises an error data set. The error data set, in certain embodiments, includes values indicating the prediction error of the first machine learning algorithm/model on the validation data set (e.g., a rate, a score, or other value that indicates how often the first machine learning algorithm/model accurately predicted a label for the validation data set).”).] [Ghanta paragraphs [0077]:  (“obtain, from the memory, the target data set”) (“ … the primary training module 302 trains the first machine learning model for the first machine learning algorithm on a training data set. For instance, the primary training module 302 may receive, read, access, and/or the like a training data set and provide the training data set to a training pipeline 204 to train the machine learning model. In such an embodiment, the training data set includes labels that allow the first machine learning model to "learn" from the data to perform predictions on an inference data set that does not include labels.”); see also: Ghanta paragraph [0078]: using a validation data set; Ghanta paragraph [0085]: using an inference data set and an error data set.]); 
apply the ML pipeline to the target data set, wherein applying the ML pipeline results in at least one of 
generation of an ML model from the target data set ([Ghanta Figure 2A, elements 202, 204, 206a-c; paragraph [0056]: the ML management apparatus selecting a training pipeline to generate a machine learning model with the input data (“apply the ML pipeline to the target data set, wherein applying the ML pipeline results in at least one of … generation of an ML model from the target data set”) (“In one embodiment, the machine learning system 200 includes physical and/or logical groupings of the machine learning pipelines 202, 204, 206a-c based on a desired objective, result, problem, and/or the like. For instance, the ML management apparatus 104 may select a training pipeline 204 for generating a machine learning model configured for the desired objective and one or more inference pipelines 206a-c that are configured to analyze the desired objective by processing input data 210 associated with the desired objective using the analytic engines for which the selected inference pipelines 206a-c are configured for and the machine learning model.”).]) 
or 
determination of an inadequacy of the target data set, wherein determining an inadequacy of the target data set comprises 
(i) determining that generation of the ML model failed or that ML model generation would result in a deficient ML model, and 
(ii) determining that the target data set is inadequate in a manner related to the determined failure metric; and 
provide an indication of the determined inadequacy of the target data set ([Ghanta Table 1; Figure 3, elements 310, 312; Figure 5, elements 514, 516, 518, 520; paragraph [0100]-[0102]: after performing analysis from the analysis module to determine the suitability of the first machine learning model (see Ghanta paragraph [0098]-[0099]), triggering the action module to perform various actions (including generating notification/messages, changing the first machine learning model algorithm or target data set, retraining the first machine learning model, updating suitability thresholds), where each of these actions represent indications that the target data set does not meet the suitability criteria (“provide an indication of the determined inadequacy of the target data set”) (“… For instance, the action module 312 may select or trigger selection of a different training data set for retraining the first machine learning model. … For instance, the action module 312 may select or trigger selection of a machine learning model that has been trained on different training data, which may be more suitable or similar to the inference data set. … For instance, the action module 312 may generate a notification, message, or the like that includes a recommendation for a different machine learning algorithm that may be more suitable for the inference data set based on the characteristics or the inference data set. … For instance, the action module 312 may update or trigger updating suitability thresholds, e.g., the thresholds used to determine whether the first machine learning algorithm is suitable for the inference data set, to be more flexible or stringent.”).]).  

wherein the ML pipeline includes a data pre-processing phase …, 
wherein the data pre-processing phase generates a conditioned data set from the input data set …
Maag teaches
wherein the ML pipeline includes a data pre-processing phase ([Maag paragraph [0005]: a data pipeline system for transforming data (“wherein the ML pipeline includes a data pre-processing phase”) obtained from data sources into data in a format expected by the data sinks (“One purpose of a data pipeline system is to execute data transformation steps on data obtained from data sources to provide the data in format expected by the data sinks. A data transformation step may be defined as a set of computer commands or instructions which, when executed by the data pipeline system, transforms one or more input datasets to produce one or more output or "target" datasets.”).]) …, 
wherein the data pre-processing phase generates a conditioned data set from the input data set ([Maag paragraph [0164]: data pre-processing includes schema validation tests that transform data from the original data source format into the transformed data sink format (“wherein the data pre-processing phase generates a conditioned data set from the input data set”) (“Schema validation is the process of inspecting data to ensure that the data actually adheres to the format defined by the schema. Schemas in relational database may also define other constructs as well, such as relationships, views, indexes, packages, procedures, functions, queues, triggers, types, sequences, and so forth. …  In some embodiments, the schema(s) indicating the format of the data stored by the data sources 320 and the schema representing the data format expected by the data sinks 330 are used to implement the transformations performed by the pipelines 410. For instance, the logic defined by each pipeline may represent the steps or algorithm required to transform data from the data source format into the data sink format.”).]) …
Both Ghanta and Maag are analogous art since both teach data pipelines for machine learning systems.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the ML pipeline of Ghanta and incorporate a data pre-processing phase of Maag as a way to perform data transformation and data validation on an input dataset. The motivation to combine is taught in Maag, since a data pre-processing phase provides functionality to transform input data into conditioned data, as well as functionality to validate the transformed data before it is further processed downstream by another entity (e.g., by a model building phase), ensuring that it meets certain schema requirements, thus improving the quality of the target data set for use in a machine learning pipeline to build more accurate machine learning models ([Maag paragraph [0164]: “Schema validation is the process of inspecting data to ensure that the data actually adheres to the format defined by the schema. Schemas in relational database may also define other constructs as well, such as relationships, views, indexes, packages, procedures, functions, queues, triggers, types, sequences, and so forth. However, schemas other than relational database schemas also exist, such as XML schemas. In some embodiments, the schema(s) indicating the format of the data stored by the data sources 320 and the schema representing the data format expected by the data sinks 330 are used to implement the transformations performed by the pipelines 410. For instance, the logic defined by each pipeline may represent the steps or algorithm required to transform data from the data source format into the data sink format. If the transformation is performed properly, the data after transformation should be able to pass validation with respect to the schema of the data sink. However, if errors occur during the transformation, the validation might fail if the transformed data is improperly formatted.”]).
Regarding Claim 2, Ghanta in view of Maag teaches
The system of claim 1, 
wherein applying the ML pipeline to the target data set comprises terminating the ML pipeline in response to determining, based on the determined failure metric, that the target data set is inadequate in a manner related to the determined failure metric ([Ghanta Table 1; Figure 3, elements 310, 312; Figure 5, elements 514, 516, 518, 520; paragraph [0100]-[0102]: after performing analysis from the analysis module to determine the suitability of the first machine learning model (see Ghanta paragraph [0098]-[0099]), triggering the action module to perform various actions (including generating notification/messages, changing the first machine learning model algorithm or target data set, retraining the first machine learning model, updating suitability thresholds), where each of these actions represent indications that the target data set does not meet the suitability criteria (“based on the determined failure metric, that the target data set is inadequate in a manner related to the determined failure metric”) (“… For instance, the action module 312 may select or trigger selection of a different training data set for retraining the first machine learning model. … For instance, the action module 312 may select or trigger selection of a machine learning model that has been trained on different training data, which may be more suitable or similar to the inference data set. … For instance, the action module 312 may generate a notification, message, or the like that includes a recommendation for a different machine learning algorithm that may be more suitable for the inference data set based on the characteristics or the inference data set. … For instance, the action module 312 may update or trigger updating suitability thresholds, e.g., the thresholds used to determine whether the first machine learning algorithm is suitable for the inference data set, to be more flexible or stringent.”).] [Ghanta Figure 5, elements 514, 516, 518, 520; paragraph [0109]: triggered actions from the terminating the ML pipeline in response to determining, … ”) (“…the analysis module 310 determines 514 whether the predicted suitability of the first machine learning algorithm/model satisfies a predetermined suitability threshold. If so, the method 500 ends. Otherwise, the action module 312 triggers one or more actions associated with the first machine learning algorithm. For instance, the action module 312 may trigger retraining 516 the first machine learning model with different training data, may trigger switching 518 the first machine learning model to a different machine learning model that is trained using different training data, may recommend 520 different machine learning algorithms for analyzing the inference data set, may update 522 suitability thresholds, and/or the like, and the method 500 ends.”).]).  
Regarding Claim 3, Ghanta in view of Maag teaches
The system of claim 1, 
wherein the target data set is arranged in columns and rows ([Maag paragraph [0099]: a data source consisting of a relational database that provides rows of data, where rows of data in a relational database represent a record or entry (“wherein the target data set is arranged in … rows”) (“Each of the data sources 320 may provide different data, possibly even in different data formats. As just one simple example, one data source 320 (e.g., 320A) may be a relational database server that provides rows of data … ”).] [Maag paragraph [0164]: target data set represented in tabular format (“wherein the target data set is arranged in columns …”) (“… relational database schemas typically define tables of data, where each table is defined to include a number of columns (or fields), each tied to a specific type of data, such as strings, integers, doubles, floats, bytes, and so forth.”).]), 
wherein the columns define fields of the target data set and the rows define entries in the target data set ([Maag paragraph [0099]: a data source consisting of a relational database that provides rows of data, where rows of data in a relational database represent a record or entry (“wherein the … rows define entries in the target data set”) (“Each of the data sources 320 may provide different data, possibly even in different data formats. As just one simple example, one data source 320 (e.g., 320A) may be a relational database server that provides rows of data … ”).] [Maag paragraph [0164]: target data set represented in tabular format (“wherein the columns define fields of the target data set…”) (“… relational database schemas typically define tables of data, where each table is defined to include a number of columns (or fields), each tied to a specific type of data, such as strings, integers, doubles, floats, bytes, and so forth.”).]), and 
wherein generating a failure metric for at least one phase in the ML pipeline comprises determining, for a particular one of the columns of the target data set, at least one of 
(i) that the particular column is empty ([Maag paragraph [0167]: performing schema validation tests on the target data set, including indicating fault/warning for cases where columns which are defined as non-NULL contain NULL values (“wherein generating a failure metric for at least one phase in the ML pipeline comprises determining, for a particular one of the columns of the target data set, at least one of (i) that the particular column is empty”) (“Configuration points for schema validation tests may include the schema that should be compared against the pre-transformation and/or post-transformation data, the pipeline and/or data sets from which to collect the data, how often the tests 700 should be performed, criteria for determining whether a violation is a "fault" or "potential fault“ ( or "warning"), valid values for certain columns/fields (e.g. ensuring columns which are defined as non-NULL do not contain NULL values, that non-negative columns do not contain numbers which are negative, etc.) and so forth.”).]); 
(ii) that more than a threshold amount of the entries in the particular column are empty; 
(iii) that fewer than a threshold amount of the entries in the particular column are not empty; 
(iv) that the particular column contains a single unique value; 
or 
(v) that the values of the particular column are skewed beyond a threshold amount.  
Regarding Claim 6, Ghanta in view of Maag teaches
The system of claim 1, 
wherein the target data set is arranged in columns and rows ([Maag paragraph [0099]: a data source consisting of a relational database that provides rows of data, where rows of data in a relational database represent a record or entry (“wherein the target data set is arranged in … rows”) (“Each of the data sources 320 may provide different data, possibly even in different data formats. As just one simple example, one data source 320 (e.g., 320A) may be a relational database server that provides rows of data … ”).] [Maag paragraph [0164]: target data set represented in tabular format (“wherein the target data set is arranged in columns …”) (“… relational database schemas typically define tables of data, where each table is defined to include a number of columns (or fields), each tied to a specific type of data, such as strings, integers, doubles, floats, bytes, and so forth.”).]), 
wherein the columns define fields of the target data set and the rows define entries in the target data set ([Maag paragraph [0099]: a data source consisting of a relational database that provides rows of data, where rows of data in a relational database represent a record or entry (“wherein the … rows define entries in the target data set”) (“Each of the data sources 320 may provide different data, possibly even in different data formats. As just one simple example, one data source 320 (e.g., 320A) may be a relational database server that provides rows of data … ”).] [Maag paragraph [0164]: target data set represented in tabular format (“wherein the columns define fields of the target data set…”) (“… relational database schemas typically define tables of data, where each table is defined to include a number of columns (or fields), each tied to a specific type of data, such as strings, integers, doubles, floats, bytes, and so forth.”).]), 
wherein the ML model building phase comprises generating an ML model to predict a particular column of the target data set ([Ghanta Figure 2A, elements 200, 206a-c; paragraph [0040]: training a machine learning model on a training data set comprising of three columns of feature data (Age, Sex, Height) so that the trained model can use an inference data set (“target data set”) that contains only two columns of data (Age, Height) to predict the third column (Sex) (“wherein the ML model building phase comprises generating an ML model to predict a particular column of the target data set”) (“In certain embodiments of machine learning systems 200, there is a training phase, for generating the machine learning model, and an inference phase for analyzing an inference data set using the machine learning model. The output from the inference phase may be one or more predictive "labels" determined as a function of one or more features of the inference data set. For example, if the training data set comprises three columns of feature data-Age, Sex, and Height-that are used to train the machine learning model, and the inference data comprises two columns of feature data-Age and Height-the output from an inference pipeline 206 using the machine learning model may be a "label" describing the predicted Sex (M/F) based on the given inference data.”).]) …
However, Ghanta in view of Maag does not teach
wherein generating a failure metric for at least one phase in the ML pipeline comprises determining that the values of the particular column are skewed beyond a threshold amount.  
Gebremariam teaches
wherein generating a failure metric for at least one phase in the ML pipeline comprises determining that the values of the particular column are skewed beyond a threshold amount ([Gebremariam Figure 5, elements 500, 502; col. 10, lines 3-18: processing input data by receiving an input dataset consisting of rows and columns.] [Gebremariam Figure 5, elements 504, 506, 508; Table 1; col.12, lines 22-42: processing input data based on receiving a column of input and associated parameter policies applied to the input (see Gebremarium col.12-col.13 Table 1), including threshold metrics related to skewness or kurtosis (“skewed beyond a threshold amount”) (“In an operation 506, a third indicator may be received that indicates a plurality of variables v, of the input dataset to analyze for each observation vector x, read from a row of the input dataset. For example, the third indicator indicates a list of input variables to analyze by name, column number, etc. The name may be matched to a column header included in the first row of the input dataset. … In an operation 508, a fourth indicator may be received that indicates a plurality of policy parameter values. The plurality of policy parameter values is used to define how the plurality of variables v, are grouped. Each policy parameter value of the plurality of policy parameter values may have a predefined default value that may be used when a user does not specify a value for the policy parameter using the fourth indicator.”).] [Gebremariam Figure 5, element 510; col. 14, lines 19-21: analyzing the input dataset based on a plurality of policy parameter values, including those related to data skewness (“determining that the values of the particular column are skewed beyond a threshold amount”) (“Referring again to FIG. 5, in an operation 510, a request to analyze the input dataset based on the plurality of policy parameter values is sent to controller device 104.”).] [Gebremariam Figure 5, element 512; col.14, lines 33-37: after analyzing input data, generating and storing results of the data analysis performed on the input data set, where generating the results represents generating metrics related to data skewness (“wherein generating a failure metric for at least one phase of the ML pipeline … ”) (“In an operation 512, data analysis results are received. For example, variable statistical metrics and variable grouping data may be received from controller device 104 and stored in data analysis results 223 on computer-readable medium 208.”).]).  
Ghanta in view of Maag and Gebremariam are analogous art since both teach data analysis and transformation in a pre-processing phase within a machine learning workflow.
It would have been obvious for a person having ordinary skill in the art before the effective filing date of the invention to take the data pre-processing phase of Ghanta in view of Maag and incorporate the parameter policy analysis step of Gebremariam as a way to optimize the data pre-processing phase by performing data analysis and generating metrics for data skewness as a one-pass approach without generating intermediate datasets. The motivation to combine is taught in Gebremariam, as data pre-processing is an essential phase of a machine learning workflow. Providing notifications of data skewness during the data preprocessing phase will help identify potential data quality issues that can be addressed early, allowing for improved prediction performance. Furthermore, optimizing the data analysis on the input data set will provide a significant time and system resource savings by avoiding the generation and storage of intermediate datasets, thereby speeding up the machine learning workflow process ([Gebremariam col.1 lines 17-30: “Quantifying data-quality issues using statistical data quality metrics such as missing rate, cardinality, etc. is the first task in predictive modelling of a dataset. As a result, variable (feature) transformation aimed at increasing model performance is a significant part of a predictive modelling workflow. However, high dimensionality precludes an interactive variable-by-variable analysis and transformation. To handle this issue of scale (high dimensionality), practitioners consider data quality issues iteratively. For example, variables with a high-rate of missing values can be identified and addressed. Variables with a high-skew can then be identified and addressed. However, this approach precludes the effective utilization of prescriptions that can treat multiple data quality problems at the same time.”] [Gebremariam col.22, lines 42-60: “Quantifying data-quality issues of the input dataset is an important first task in predictive modelling. … Though the default values are usually effective for most input datasets, it may be beneficial to experiment with different values for the policy parameters. This helps to identify variable that have borderline values for specific statistical metrics. These variables can further be explored individually for a better understanding and more robust classification.”] [Gebremariam col.39, lines 51-64: “Predictive modelling practitioners such as data scientists and statisticians, spend a significant part of their time in the data preprocessing (feature transformation and generation) phase. Data transformation application 224, controller data transformation application 324, and worker data transformation application 424 transform the input dataset without generating intermediate datasets, which saves significant computer memory for large datasets and saves computer memory, computing time, and communication time for distributed datasets. Additionally, the user can specify any number of transformation flows with one or more phases that can be executed in parallel saving significant user time, computer memory, computing time, and communication time.”]).
Regarding Claim 7, Ghanta in view of Maag teaches
The system of claim 1, 
wherein the ML pipeline additionally includes a utility validation phase ([Ghanta Figure 2A, elements 202, 204, 206a-206c; paragraph [0064]: a machine learning system comprising a set of training, policy, and inference pipelines (“ML pipeline”) (“In one embodiment, the policy pipeline 202 maintains a mapping of the pipelines 204, 206a-c that comprise the logical grouping of pipelines 204, 206a-c. The policy pipeline may further adjust various settings or features of the pipelines 204, 206a-c in response to user input, feedback or events generated by the pipelines 204, 206a-c, and/or the like. For example, if an inference pipeline 206a generates machine learning results that are inaccurate, the policy pipeline 202 may receive a message from the inference pipeline 202 that indicates the results are inaccurate, and may direct the training pipeline 204 to generate a new machine learning model for the inference pipeline 206a.”).] [Ghanta paragraph [0075]: utility validation phase” in a ML pipeline (“wherein the ML pipeline additionally includes a utility validation phase”) (“The ML management apparatus 104, in one embodiment, includes one or more of a primary training module 302, a primary validation module 304, a secondary training module 306, a secondary validation module 308, an analysis module 310, and an action module 312…”).]), 
wherein the utility validation phase comprises: 
generating first and second ML models from the conditioned data set ([Ghanta paragraphs [0077]-[0079]: using a primary training module to train a first machine learning model (“generating first … ML models from the conditioned data set”), with a primary validation module validating the first machine learning model using a validation data set (see Ghanta paragraph [0081]) and producing an error data set (“In one embodiment, the primary training module 302 trains the first machine learning model for the first machine learning algorithm on a training data set. … In one embodiment, the primary validation module 304 is configured to validate the first machine learning algorithm/model using a validation data set. … The resulting output of the validation of the first machine learning algorithm/model, in one embodiment, comprises an error data set. The error data set, in certain embodiments, includes values indicating the prediction error of the first machine learning algorithm/model on the validation data set (e.g., a rate, a score, or other value that indicates how often the first machine learning algorithm/model accurately predicted a label for the validation data set).”).] [Ghanta paragraph [0085]-[0086]: a secondary training module training a second machine learning model based on the error data set (“generating … second ML models from the conditioned data set”) (“In one embodiment, the secondary training module 306 is configured to train a second machine learning model for a second machine learning algorithm using the error data set described above. The second machine learning algorithm may be configured to predict a suitability of the first machine learning algorithm/model for analyzing an inference data set. …In one embodiment, the second machine learning algorithm is different than the first machine learning algorithm. For example, if the first machine learning algorithm is a linear regression algorithm, the second machine learning algorithm may comprise a logistic regression algorithm.”).]), 
wherein the first ML model corresponds to the ML model generated during the ML model building phase ([Ghanta paragraphs [0077]-[0079]: using a primary training module to train a first machine learning model (“wherein the first ML model corresponds to the ML model generated during the ML model building phase”), with a primary validation module validating the first machine learning model using a validation data set (see Ghanta paragraph [0081]) and producing an error data set (“In one embodiment, the primary training module 302 trains the first machine learning model for the first machine learning algorithm on a training data set. … In one embodiment, the primary validation module 304 is configured to validate the first machine learning algorithm/model using a validation data set. … The resulting output of the validation of the first machine learning algorithm/model, in one embodiment, comprises an error data set. The error data set, in certain embodiments, includes values indicating the prediction error of the first machine learning algorithm/model on the validation data set (e.g., a rate, a score, or other value that indicates how often the first machine learning algorithm/model accurately predicted a label for the validation data set).”).]), and 
wherein the second ML model has fewer trainable parameters than the first ML model ([Ghanta paragraphs [0087]: secondary training module performing training on a subset of features from the error data set (“wherein the second ML model has fewer trainable parameters than the first ML model”) (“In one embodiment, the secondary training module 306 enhances the error data set by including additional data to supplement the prediction error data. For instance, the secondary training module 306 may include data for additional features such as features of the data set itself (e.g., the secondary training module 306 may select all or a subset of the available features of the error data set itself), …”).]); and 
comparing the predictive ability of the first ML model and the second ML model ([Ghanta paragraph [0092]: analysis module analyzing health scores/metrics produced by the first and second machine learning algorithms (see Ghanta paragraph [0094], [0085], [0089], [0099]) to compare whether the first model is a suitable algorithm for generating predictions (“comparing the predictive ability of the first ML model and the second ML model”) (“The analysis module 310, in one embodiment, is configured to determine whether the first machine learning algorithm/model is a suitable algorithm/model for generating predictions for the inference data set based on the predictions that the second machine learning algorithm generates. … For example, the analysis module 310 may determine whether the various metrics/health scores each satisfy a threshold value, if a percentage of the metrics/health scores satisfy threshold values, of if a calculated combination of various health scores (e.g., an average) satisfies a threshold. If so, then the analysis module 310 may determine that the first machine learning algorithm/model is generating accurate predictions for the inference data set. In some embodiments, the health scores/values may include prediction confidence values, data deviation values, AB testing values, canary values, and/or the like.”).]).  
Regarding Claim 8, Ghanta in view of Maag teaches
The system of claim 7, 
wherein applying the ML pipeline to the target data set comprises terminating the ML pipeline in response to determining, based on comparing the predictive ability of the first ML model and the second ML model, that the predictive ability of the first ML model fails to exceed the predictive ability of the second ML model by more than a threshold amount ([Ghanta paragraph [0092]: analysis module analyzing health scores/metrics produced by the first and Ghanta paragraph [0094], [0085], [0089], [0099]) to compare whether the first model is a suitable algorithm for generating predictions (“based on comparing the predictive ability of the first ML model and the second ML model”) (“The analysis module 310, in one embodiment, is configured to determine whether the first machine learning algorithm/model is a suitable algorithm/model for generating predictions for the inference data set based on the predictions that the second machine learning algorithm generates. … For example, the analysis module 310 may determine whether the various metrics/health scores each satisfy a threshold value, if a percentage of the metrics/health scores satisfy threshold values, of if a calculated combination of various health scores (e.g., an average) satisfies a threshold. If so, then the analysis module 310 may determine that the first machine learning algorithm/model is generating accurate predictions for the inference data set. In some embodiments, the health scores/values may include prediction confidence values, data deviation values, AB testing values, canary values, and/or the like.”).] [Ghanta Table 1; Figure 3, elements 310, 312; paragraph [0098]-[0099]: an analysis module comparing health scores/metrics and data deviation metrics against a threshold to determine the suitability of the first machine learning model (“… that the predictive ability of the first ML model fails to exceed the predictive ability of the second ML model by more than a threshold amount”) (“In one embodiment, the analysis module 310 may determine whether the suitability score based on the metrics/ health scores in Table 1 satisfies a threshold to determine (1) whether the second machine learning algorithm/model is a good fit for validating the predictive performance of the first machine learning algorithm/model … if it determines that the trained model is not generating accurate predictions, the ML management apparatus 104 can react accordingly as described below with reference to the action module 312. … In one embodiment, the analysis module 310 may use additional data (e.g., in addition to the metrics/health scores in Table 1) to determine whether the first machine learning algorithm/model is suitable for the inference data. For instance, the analysis module 310 may receive or access data deviation information to determine whether and how much the inference data differs from the training data that was used to train the first machine learning model. If the data deviation scores do not deviate beyond a predefined threshold, then the second machine learning algorithm/ model may be used to determine the predictive performance of the first machine learning algorithm/model on the inference data because the first machine learning algorithm/model is suitable for the inference data set (e.g., the training data set and the inference data set are sufficiently similar or complementary). Otherwise, if the data deviation scores indicate that the inference data set is not similar enough to the training data set so that the first machine learning algorithm/model would likely not generate accurate predictions for the inference data set, the analysis module 310 may trigger one or more of the actions described below.”).] [Ghanta Figure 5, elements 514, 516, 518, 520; paragraph [0109]: triggered actions from the action module within the ML management apparatus representing indications that the first machine learning model or that the target data set is deficient in some way, with the implication that each action first stops the current ML model generation with the current target data set before the action is applied (“terminating the ML pipeline, in response to determining, … ”) (“…the analysis module 310 determines 514 whether the predicted suitability of the first machine learning algorithm/model satisfies a predetermined suitability threshold. If so, the method 500 ends. Otherwise, the action module 312 triggers one or more actions associated with the first machine learning algorithm. For instance, the action module 312 may trigger retraining 516 the first machine learning model with different training data, may trigger switching 518 the first machine learning model to a different machine learning model that is trained using different training data, may recommend 520 different machine learning algorithms for analyzing the inference data set, may update 522 suitability thresholds, and/or the like, and the method 500 ends.”).]).  
Regarding Claim 9, Ghanta teaches
A method comprising: 
obtaining a target data set ([Ghanta paragraph [0041]: training, validation, test sets, inference data set are types of a target data set (“… labels may be required to determine the suitability of the machine learning model, e.g., the accuracy or predictive performance of the machine learning model, to an inference data set during the inference phase. The predictive performance is usually evaluated on either the training data set or a separate validation or test set where both the feature and label information is available …”).] [Ghanta paragraphs [0079]: an error data set, which is another type of a target data set (“The resulting output of the validation of the first machine learning algorithm/model, in one embodiment, comprises an error data set. The error data set, in certain embodiments, includes values indicating the prediction error of the first machine learning algorithm/model on the validation data set (e.g., a rate, a score, or other value that indicates how often the first machine learning algorithm/model accurately predicted a label for the validation data set).”).] [Ghanta paragraphs [0077]: using a training data set to train a first machine learning model (“obtaining a target data set”) (“ … the primary training module 302 trains the first machine learning model for the first machine learning algorithm on a training data set. For instance, the primary training module 302 may receive, read, access, and/or the like a training data set and provide the training data set to a training pipeline 204 to train the machine learning model. In such an embodiment, the training data set includes labels that allow the first machine learning model to "learn" from the data to perform predictions on an inference data set that does not include labels.”); see also: Ghanta paragraph [0078]: using a validation data set; Ghanta paragraph [0085]: using an inference data set and an error data set.]); 
applying a machine learning (ML) pipeline to the target data set ([Ghanta Figure 2A, elements 202, 204, 206a-c; paragraph [0056]: the ML management apparatus (managing pipelines within a machine learning system; see Ghanta paragraphs [0035]-[0036]) selecting a training pipeline to generate a machine learning model with the input data (“applying a machine learning (ML) pipeline to the target data set”) (“In one embodiment, the machine learning system 200 includes physical and/or logical groupings of the machine learning pipelines 202, 204, 206a-c based on a desired objective, result, problem, and/or the like. For instance, the ML management apparatus 104 may select a training pipeline 204 for generating a machine learning model configured for the desired objective and one or more inference pipelines 206a-c that are configured to analyze the desired objective by processing input data 210 associated with the desired objective using the analytic engines for which the selected inference pipelines 206a-c are configured for and the machine learning model.”).]), 
wherein the ML pipeline includes … an ML model building phase ([Ghanta Figure 2A, elements 202, 204, 206a-c; paragraph [0053]: machine learning pipelines performing various operations including algorithm training/inference and validations (“wherein the ML pipeline includes … an ML model building phase”) (“As used herein, machine learning pipelines 202, 204, 206a-c comprise various machine learning features, components, objects, modules, and/or the like to perform various machine learning operations such as algorithm training/inference, feature engineering, validations, scoring, and/or the like.”).]), 
…
wherein the ML model building phase generates an ML model from the conditioned data set ([Ghanta Figure 3, elements 302, 306; paragraph [0075]: “The ML management apparatus 104, in one embodiment, includes one or more of a primary training module 302, a primary validation module 304, a secondary training module 306, a secondary validation module 308, an analysis module 310, and an action module 312…”).] [Ghanta Figure 2A, element 204; paragraph [0077]: ML pipeline contains a training pipeline which uses a training data set (“conditioned data set”) to train a machine learning model (“wherein the ML model building phase generates a ML model from the conditioned data set”) (“In one embodiment, the primary training module 302 trains the first machine learning model for the first machine learning algorithm on a training data set. For instance, the primary training module 302 may receive, read, access, and/or the like a training data set and provide the training data set to a training pipeline 204 to train the machine learning model.”).]),
generating a failure metric for at least one phase in the ML pipeline (“generating a failure metric for at least one phase in the ML pipeline”, where the types of failure metrics include: [Ghanta paragraph [0094]: “The primary validation module 304 compares the predictions made by the first machine learning algorithm/model to the true label of the validation data to calculate primary algorithm error values.”] [Ghanta paragraph [0085]: “The second machine learning algorithm may be configured to predict a suitability of the first machine learning algorithm/model for analyzing an inference data set. As used herein, the suitability may comprise a value such as a health score that describes the efficacy, accuracy, effectiveness, or the like of the predictions that the first machine learning algorithm/model generates for the inference data set.” [Ghanta paragraph [0089]: “In further embodiments, the secondary validation module 308 analyzes other statistics, such as training statistics, to determine the suitability of the second machine learning algorithm in accurately assessing the effectiveness of the first machine learning algorithm. The other statistics may include confidence metrics, accuracy metrics, precision metrics, and/or the like.”] [Ghanta paragraph [0099]: In one embodiment, the analysis module 310 may use additional data (e.g., in addition to the metrics/health scores in Table 1) to determine whether the first machine learning algorithm/model is suitable for the inference data. For instance, the analysis module 310 may receive or access data deviation information (e.g., as described in U.S. patent application Ser. No. 16/001,904, which is incorporated by reference herein in its entirety) to determine whether and how much the inference data differs from the training data that was used to train the first machine learning model.”].), 
wherein applying the ML pipeline results in at least one of 
generation of an ML model from the target data set ([Ghanta Figure 2A, elements 202, 204, 206a-c; paragraph [0056]: the ML management apparatus selecting a training pipeline to generate a machine learning model with the input data (“wherein applying the ML pipeline results in at least one of … generation of an ML model from the target data set”) (“In one embodiment, the machine learning system 200 includes physical and/or logical groupings of the machine learning pipelines 202, 204, 206a-c based on a desired objective, result, problem, and/or the like. For instance, the ML management apparatus 104 may select a training pipeline 204 for generating a machine learning model configured for the desired objective and one or more inference pipelines 206a-c that are configured to analyze the desired objective by processing input data 210 associated with the desired objective using the analytic engines for which the selected inference pipelines 206a-c are configured for and the machine learning model.”).]) 
or 
determination of an inadequacy of the target data set, wherein determining an inadequacy of the target data set comprises 
(i) determining that generation of the ML model failed or that ML model generation would result in a deficient ML model, and 
(ii) determining that the target data set is inadequate in a manner related to the generated failure metric; and 
providing an indication of the determined inadequacy of the target data set ([Ghanta Table 1; Figure 3, elements 310, 312; Figure 5, elements 514, 516, 518, 520; paragraph [0100]-[0102]: after performing analysis from the analysis module to determine the suitability of the first machine learning model (see Ghanta paragraph [0098]-[0099]), triggering the action module to perform various actions (including generating notification/messages, changing the first machine learning model algorithm or target data set, retraining the first machine learning model, updating suitability thresholds), where each of these actions represent indications that the target data set does not meet the suitability criteria (“providing an indication of the determined inadequacy of the target data set”) (“… For instance, the action module 312 may select or trigger selection of a different training data set for retraining the first machine learning model. … For instance, the action module 312 may select or trigger selection of a machine learning model that has been trained on different training data, which may be more suitable or similar to the inference data set. … For instance, the action module 312 may generate a notification, message, or the like that includes a recommendation for a different machine learning algorithm that may be more suitable for the inference data set based on the characteristics or the inference data set. … For instance, the action module 312 may update or trigger updating suitability thresholds, e.g., the thresholds used to determine whether the first machine learning algorithm is suitable for the inference data set, to be more flexible or stringent.”).]).  
However, Ghanta does not teach
wherein the ML pipeline includes a data pre-processing phase …, 
wherein the data pre-processing phase generates a conditioned data set from the input data set …
Maag teaches
wherein the ML pipeline includes a data pre-processing phase ([Maag paragraph [0005]: a data pipeline system for transforming data (“wherein the ML pipeline includes a data pre-processing phase”) obtained from data sources into data in a format expected by the data sinks (“One purpose of a data pipeline system is to execute data transformation steps on data obtained from data sources to provide the data in format expected by the data sinks. A data transformation step may be defined as a set of computer commands or instructions which, when executed by the data pipeline system, transforms one or more input datasets to produce one or more output or "target" datasets.”).]) …, 
wherein the data pre-processing phase generates a conditioned data set from the input data set ([Maag paragraph [0164]: data pre-processing includes schema validation tests that transform data from the original data source format into the transformed data sink format (“wherein the data pre-processing phase generates a conditioned data set from the input data set”) (“Schema validation is the process of inspecting data to ensure that the data actually adheres to the format defined by the schema. Schemas in relational database may also define other constructs as well, such as relationships, views, indexes, packages, procedures, functions, queues, triggers, types, sequences, and so forth. …  In some embodiments, the schema(s) indicating the format of the data stored by the data sources 320 and the schema representing the data format expected by the data sinks 330 are used to implement the transformations performed by the pipelines 410. For instance, the logic defined by each pipeline may represent the steps or algorithm required to transform data from the data source format into the data sink format.”).]) …
Both Ghanta and Maag are analogous art since both teach data pipelines for machine learning systems.
([Maag paragraph [0164]: “Schema validation is the process of inspecting data to ensure that the data actually adheres to the format defined by the schema. Schemas in relational database may also define other constructs as well, such as relationships, views, indexes, packages, procedures, functions, queues, triggers, types, sequences, and so forth. However, schemas other than relational database schemas also exist, such as XML schemas. In some embodiments, the schema(s) indicating the format of the data stored by the data sources 320 and the schema representing the data format expected by the data sinks 330 are used to implement the transformations performed by the pipelines 410. For instance, the logic defined by each pipeline may represent the steps or algorithm required to transform data from the data source format into the data sink format. If the transformation is performed properly, the data after transformation should be able to pass validation with respect to the schema of the data sink. However, if errors occur during the transformation, the validation might fail if the transformed data is improperly formatted.”]).
Regarding Claim 10, Ghanta in view of Maag teaches
The method of claim 9, 
wherein applying the ML pipeline to the target data set comprises terminating the ML pipeline in response to determining, based on the determined failure metric, that the target data set is inadequate in a manner related to the determined failure metric (This claim is similar in scope as Claim 2, and hence is rejected under similar rationale).  
Regarding Claim 13, Ghanta teaches
An article of manufacture including a non-transitory computer-readable medium ([Ghanta paragraphs [0110]-[0113]]), having stored thereon program instructions ([Ghanta paragraphs [0019]-[0021]]) that, upon execution by a computing system, cause the computing system to perform operations ([Ghanta paragraph [0022]]) comprising: 
obtaining a target data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 9, and hence is rejected under similar rationale); 
applying a machine learning (ML) pipeline to the target data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 9, and hence is rejected under similar rationale), 
wherein the ML pipeline includes a data pre-processing phase and an ML model building phase (This claim limitation is similar in scope as the corresponding claim limitation in Claim 9, and hence is rejected under similar rationale), 
wherein the data pre- processing phase generates a conditioned data set from the input data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 9, and hence is rejected under similar rationale), 
wherein the ML model building phase generates an ML model from the conditioned data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 9, and hence is rejected under similar rationale); 
generating a failure metric for at least one phase in the ML pipeline (This claim limitation is similar in scope as the corresponding claim limitation in Claim 9, and hence is rejected under similar rationale), 
wherein applying the ML pipeline results in at least one of 
generation of an ML model from the target data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 9, and hence is rejected under similar rationale) 
or 
determination of an inadequacy of the target data set, 
wherein determining an inadequacy of the target data set comprises 
(i) determining that generation of the ML model failed or that ML model generation would result in a deficient ML model, and 
(ii) determining that the target data set is inadequate in a manner related to the determined failure metric; and 
providing an indication of the determined inadequacy of the target data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 9, and hence is rejected under similar rationale).  
Regarding Claim 14, Ghanta in view of Maag teaches
The article of manufacture of claim 13, 
wherein applying the ML pipeline to the target data set comprises terminating the ML pipeline in response to determining, based on the determined failure metric, that the target data set is inadequate in a manner related to the determined failure metric (This claim is similar in scope as Claims 2 and 10, and hence is rejected under similar rationale).  
Regarding Claim 15, Ghanta in view of Maag teaches
The article of manufacture of claim 13, 
wherein the target data set is arranged in columns and rows (This claim limitation is similar in scope as the corresponding claim limitation in Claim 3, and hence is rejected under similar rationale), 
wherein the columns define fields of the target data set and the rows define entries in the target data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 3, and hence is rejected under similar rationale), and 
wherein generating a failure metric for at least one phase in the ML pipeline comprises determining, for a particular one of the columns of the target data set, at least one of 
(i) that the particular column is empty (This claim limitation is similar in scope as the corresponding claim limitation in Claim 3, and hence is rejected under similar rationale); 
(ii) that more than a threshold amount of the entries in the particular column are empty; 
(iii) that fewer than a threshold amount of the entries in the particular column are not empty; 
(iv) that the particular column contains a single unique value; 
or 
(v) that the values of the particular column are skewed beyond a threshold amount.  
Regarding Claim 18, Ghanta in view of Maag teaches
The article of manufacture of claim 13, 
wherein the target data set is arranged in columns and rows (This claim limitation is similar in scope as the corresponding claim limitation in Claim 6, and hence is rejected under similar rationale), 
wherein the columns define fields of the target data set and the rows define entries in the target data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 6, and hence is rejected under similar rationale), 
wherein the ML model building phase comprises generating an ML model to predict a particular column of the target data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 6, and hence is rejected under similar rationale), and 
wherein generating a failure metric for at least one phase in the ML pipeline comprises determining that the values of the particular column are skewed beyond a threshold amount (This claim limitation is similar in scope as the corresponding claim limitation in Claim 6, and hence is rejected under similar rationale).  
Regarding Claim 19, Ghanta in view of Maag teaches
The article of manufacture of claim 13, 
wherein the ML pipeline additionally includes a utility validation phase (This claim limitation is similar in scope as the corresponding claim limitation in Claim 7, and hence is rejected under similar rationale), wherein the utility validation phase comprises: 
generating first and second ML models from the conditioned data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 7, and hence is rejected under similar rationale), 
wherein the first ML model corresponds to the ML model generated during the ML model building phase (This claim limitation is similar in scope as the corresponding claim limitation in Claim 7, and hence is rejected under similar rationale), and 
wherein the second ML model has fewer trainable parameters than the first ML model (This claim limitation is similar in scope as the corresponding claim limitation in Claim 7, and hence is rejected under similar rationale); and 
comparing the predictive ability of the first ML model and the second ML model (This claim limitation is similar in scope as the corresponding claim limitation in Claim 7, and hence is rejected under similar rationale).  
Regarding Claim 20, Ghanta in view of Maag teaches
The article of manufacture of claim 19, 
wherein applying the ML pipeline to the target data set comprises terminating the ML pipeline in response to determining, based on comparing the predictive ability of the first ML model and the second ML model, that the predictive ability of the first ML model fails to exceed the predictive ability of the second ML model by more than a threshold amount (This claim is similar in scope as Claim 8, and hence is rejected under similar rationale).  
Claims 4, 11, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Ghanta et al., U.S. PGPUB 2020/0034665, filed 7/30/2018 [hereafter referred as Ghanta] in view of Maag et al., U.S. PGPUB 2017/0220403, published 8/3/2017 [hereafter referred as Maag] as applied to Claims 1, 9, and 13; in further view of Dirac et al., U.S. PGPUB 2015/0379424, published 12/31/2015 [hereafter referred as Dirac '424].
Regarding Claim 4, Ghanta in view of Maag as applied to Claim 1 teaches
The system of claim 1.
However, Ghanta in view of Maag does not teach
wherein the particular column contains one of 
(i) word vectors that describe, in a semantically-encoded vector space, the meaning of respective words, 
or 
(ii) paragraph vectors that describe, in a semantically-encoded vector space, the meaning of respective multi-word samples of text.  
Dirac ‘424 teaches
wherein the particular column contains one of 
(i) word vectors that describe, in a semantically-encoded vector space, the meaning of respective words ([Dirac ‘424 paragraph [0035]: performing data pre-processing operations on input data containing data records containing variables of data types such as text, and using natural language processing to perform feature processing (“Some machine learning workflows, which may correspond to a sequence of API requests from a client 164, may include the extraction and cleansing of input data records from raw data repositories 130 (e.g., repositories indicated in data source definitions 150) by input record handlers 160 of the MLS, as indicated by arrow 114. … The input data may comprise data records that include variables of any of a variety of data types, such as, for example text, … The output produced by the input record handlers may be fed to feature processors 162 (as indicated by arrow 115), where a set of transformation operations may be performed 162 in accordance with recipes 152 using another set of resources from pool 185. Any of a variety of feature processing approaches may be used depending on the problem domain: e.g., … natural language processing …”).] [Dirac ‘424 paragraph [0079]: performing the text data transformations by determining the root words to be included in an n-gram for use in a machine learning algorithm (“word vectors that describe, in a semantically-encoded vector space, the meaning of respective words”) (“… a recipe language defined by the MLS enables users to easily and concisely specify transformations to be performed on specified sets of data records to prepare the records for use for model training and prediction. … In at least one embodiment, a pipeline of successive transformations to be performed starting with a given input data set may be indicated within a single recipe. In one embodiment, the MLS may perform parameter optimization for one or more recipes---e.g., the MLS may automatically vary such transformation properties as the sizes of quantile bins or the number of root words to be included in an n-gram in an attempt to identify a more useful set of independent variables to be used for a particular machine learning algorithm.”).]), 
or 
(ii) paragraph vectors that describe, in a semantically-encoded vector space, the meaning of respective multi-word samples of text.  
Both Ghanta in view of Maag and Dirac ‘424 are analogous art since both teach data pre-processing phases in machine learning systems.
Ghanta in view of Maag and incorporate the text pre-processing steps of Dirac ‘424 as a way to handle text pre-processing in an input data set. The motivation to combine is taught in Dirac ‘424, as text data transformations that convert the data into root word vector representations/n-grams allow the machine learning system to perform automated parameter explorations for text data, thus improving upon the functionality of the machine learning system ([Dirac ‘424 paragraph [0093]: “For many types of feature processing transformation operations, such as creating quantile bins for numeric data attributes, generating ngrams, or removing sparse or infrequent words from documents being analyzed, parameters may typically have to be selected, such as the sizes/boundaries of the bins, the lengths of the ngrams, the removal criteria for sparse words, and so on. The values of such parameters (which may also be referred to as hyper-parameters in some environments) may have a significant impact on the predictions that are made using the recipe outputs. Instead of requiring MLS users to manually submit requests for each parameter setting or each combination of parameter settings, in some embodiments the MLS may support automated parameter exploration.”] [Dirac ‘424 paragraph [0094]: “Automated parameter exploration may also be used for selection dimensionality values for a vector representation of a text document (e.g., in accordance with the Latent Dirichlet Allocation (LDA) technique) or other natural language processing techniques. In some cases, the client may also indicate the criteria to be used to terminate exploration of the parameter value space, e.g., to arrive at acceptable parameter values. In at least some embodiments, the client may be given the option of letting the MLS decide the acceptance criteria to be used-such an option may be particularly useful for non-expert users. In one implementation, the client may indicate limits on resources or execution time for parameter exploration. In at least one implementation, the default setting for an auto-tune setting for at least some output transformations may be "true", e.g., a client may have to explicitly indicate that auto-tuning is not to be performed in order to prevent the MLS from exploring the parameter space for the transformations.”).]).
Regarding Claim 11, Ghanta in view of Maag as applied to Claim 9 teaches
The method of claim 9, 
wherein the particular column contains one of 
(i) word vectors that describe, in a semantically-encoded vector space, the meaning of respective words (This claim limitation is similar in scope as the corresponding claim limitation in Claim 4, and hence is rejected under similar rationale), 
or 
(ii) paragraph vectors that describe, in a semantically-encoded vector space, the meaning of respective multi-word samples of text.  
Regarding Claim 16, Ghanta in view of Maag as applied to Claim 13 teaches
The article of manufacture of claim 13, 
wherein the particular column contains one of 
(i) word vectors that describe, in a semantically-encoded vector space, the meaning of respective words (This claim limitation is similar in scope as the corresponding claim limitation in Claim 11, and hence is rejected under similar rationale), 
or 
(ii) paragraph vectors that describe, in a semantically-encoded vector space, the meaning of respective multi-word samples of text.  
Claims 5, 12, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Ghanta et al., U.S. PGPUB 2020/0034665, filed 7/30/2018 [hereafter referred as Ghanta] in view of Maag et al., U.S. PGPUB 2017/0220403, published 8/3/2017 [hereafter referred as Maag] as applied to Claims 1, 9, and 13; in further view of Dirac et al., U.S. PGPUB 2015/0379430, published 12/31/2015 [hereafter referred as Dirac '430].
Regarding Claim 5, Ghanta in view of Maag as applied to Claim 1 teaches
The system of claim 1. 
However, Ghanta in view of Maag does not teach
wherein the data pre-processing phase of the ML pipeline includes removing duplicate entries from the input data set to generate the conditioned data set, and 
wherein generating a failure metric for at least one phase in the ML pipeline comprises determining that the target data set comprises less than a threshold amount of unique entries.  
Dirac ‘430 teaches
wherein the data pre-processing phase of the ML pipeline includes removing duplicate entries from the input data set to generate the conditioned data set ([Dirac ‘430 Figure 74, elements 7410, 7035; paragraphs [0362]-[0363]: using a probabilistic duplicate detector to remove duplicate entries from a data set (“wherein the data pre-processing phase of the ML pipeline includes removing duplicate entries from the input data set to generate a conditioned data set”) (“FIG. 74 illustrates an example of probabilistic duplicate detection within a given machine learning data set, according to at least some embodiments. … When the (K+1)th observation record of the data set is encountered, the probabilistic duplicate detector 7035 may use the alternate representation 7430 to determine whether the record represents a duplicate of an already-processed observation record of the same data set 7410. The newly encountered OR may be classified as a possible duplicate, or as a confirmed non-duplicate … In other embodiments, the duplicate detector may take other actions, such as simply notifying the client regarding the number of probably duplicates, or the duplicate detector may initiate the removal of the probable duplicates from the data set 7[4]10.”).]), and 
wherein generating a failure metric for at least one phase in the ML pipeline comprises determining that the target data set comprises less than a threshold amount of unique entries ([Figure 75, elements 7501, 7504; paragraphs [0364]-[0365]: generating warnings and alerts failure metric”) based on reaching certain thresholds (discard if < 2%; warning if 5-10%; alerts if > 10% of target data set), where a detection of more than a threshold amount of duplicate entries is interpreted as another way of stating a detection of less than a certain threshold amount of unique entries (“wherein generating a failure metric for at least one phase in the ML pipeline comprises determining that the target data set comprises less than a threshold amount of unique entries”) (“FIG. 75 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service that implements duplicate detection of observation records, according to at least some embodiments. … The MLS may also determine respective responsive actions to be taken if various levels of duplication are identified (element 7504) in the depicted embodiment. Examples of such actions may include transmitting warning or alert messages to the client … In at least one embodiment, in response to the identification of potential or likely duplicates within a data set, the MLS may suspend, abandon or cancel a machine learning job which involves the use of the data set or is otherwise associated with the data set. Different responses may be selected for respective duplication levels in some embodiments---e.g., a warning may be generated if the fraction of duplicates is estimated to be no between 5% and 10%, while duplicates may simply be discarded if they are collectively less than 2% of the target data set. MLS clients may specify the types of actions they want taken for different extents of possible duplication in some embodiments.”).]).  
Both Ghanta in view of Maag and Dirac ‘430 are analogous art since both teach data pre-processing phases in machine learning systems.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the data pre-processing phase of Ghanta in view of Maag and incorporate the duplicate detection step of Dirac ‘430 as a way to detect and remove duplicates from an input data set. The motivation to combine is taught in Dirac ‘430, as machine learning systems are computationally intensive, and removing duplicate data will improve the ([Dirac ‘430 paragraph [0132]: “Some of the types of operations requested by MLS clients may be resource-intensive. For example, ingesting a terabyte-scale data set ( e.g., in response to a client request to create a data store) or generating statistics on such a data set may take hours or days, depending on the set of resources deployed and the extent of parallelism used. Given the asynchronous manner in which client requests are handled in at least some embodiments, clients may sometimes end up submitting the same request multiple times.  … If, in response to such a duplicate submission, the MLS actually schedules another potentially large job, resources may be deployed unnecessarily and the client may in some cases be billed twice for a request that was only intended to be serviced once. Accordingly, in order to avoid such problematic scenarios, in at least one embodiment one or more of the programmatic interfaces supported by the MLS may be designed to be idempotent, such that the re-submission of a duplicate request by the same client does not have negative consequences.”]).
Regarding Claim 12, Ghanta in view of Maag as applied to Claim 9 teaches
The method of claim 9, 
wherein the data pre-processing phase of the ML pipeline includes removing duplicate entries from the input data set to generate the conditioned data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 5, and hence is rejected under similar rationale), and 
wherein generating a failure metric for at least one phase in the ML pipeline comprises determining that the target data set comprises less than a threshold amount of unique entries (This claim limitation is similar in scope as the corresponding claim limitation in Claim 5, and hence is rejected under similar rationale).  
Regarding Claim 17, Ghanta in view of Maag as applied to Claim 13 teaches
The article of manufacture of claim 13, 
wherein the data pre-processing phase of the ML pipeline includes removing duplicate entries from the input data set to generate the conditioned data set (This claim limitation is similar in scope as the corresponding claim limitation in Claim 12, and hence is rejected under similar rationale), and 
wherein generating a failure metric for at least one phase in the ML pipeline comprises determining that the target data set comprises less than a threshold amount of unique entries (This claim limitation is similar in scope as the corresponding claim limitation in Claim 12, and hence is rejected under similar rationale).  

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is 303-297-4332.  The examiner can normally be reached on Monday-Friday 8:00am - 4:30pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li Zhen can be reached on 571-272-3768.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR 




/WILLIAM WAI YIN KWAN/Examiner, Art Unit 2121                                                                                                                                                                                                        


/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121