DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed on 03/01/2021 have been fully considered but they are not persuasive. 
Applicant’s arguments with respect to claims 1-3, 5-10, 12-16, and 18-20 have been considered but are deemed to be moot because the arguments are directed to amended claim limitations that have not been previously examined. 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3 are rejected under 35 U.S.C. 103 as being unpatentable over Kuncheva et al. (“PCA Feature Extraction for Change Detection in Multidimensional Unlabeled Data”) in view of Lu et al.  (“Concept drift detection via competence models”).
Regarding Claim 1,
	Kuncheva et al. teaches a method comprising: 
	receiving, by a sidecar learning model, operational input data submitted to a predictive learning model, the sidecar learning model trained on a same training data used to train the predictive learning model (p. 71, Figure 3 
    PNG
    media_image1.png
    590
    1148
    media_image1.png
    Greyscale
and p. 70, section II, paragraph 1 “Fig. 3 shows the two major scenarios for change detection.
When the labels of the data are available straight after classification, or even with some delay, the classification error can be monitored directly. When substantial increase is found, change is signaled. Most of the existing change detection methods and criteria are developed under this assumption. Within the second scenario, labels are not available, and the question is whether the incoming data distribution matches the training one. The two scenarios share a distribution modeling block in the diagram. The modeling is sometimes implicit, and is included in the calculation of the change detection criterion. Compared to the multidimensional case, approximating distributions in the 1-D case can be much more accurate and useful. This explains the greater interest in the 1-D case. Methods such as hidden Markov models, Gaussian mixture modeling, Parzen windows, kernel-based approximation, and martingales have been proposed for this task.”  teaches unlabeled data [operational data] submitted to feature extractor [predictive learning model] and a distribution modeling block [sidecar learning model] used to determine an indicator of change,
and teaches labeled data [training data] submitted to classifier [predictive learning model] and distribution modeling block [sidecar learning model] used to determine indicator of change).
	Kuncheva et al. does not appear to explicitly teach determining a deviation of the operational input data from the training data by comparing the operational input data to the training data; generating by the sidecar learning model, a drift signal that characterizes the deviation of the operational input data from the training data; and based on the drift signal exceeding a predetermined threshold, retraining the predictive learning model based on the operational input data. 
	Lu et al. teaches determining a deviation of the operational input data from the training data by comparing the operational input data to the training data (p. 13, section 2.1, paragraph 1 “Concept drift detection can be formulated as follows. Suppose there is a CBR system listening to a data stream where each new observation is represented by                         
                            
                                
                                    c
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            y
                                        
                                        
                                            i
                                        
                                    
                                
                            
                            ,
                        
                     where                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            1
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            2
                                        
                                    
                                    ,
                                     
                                    …
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            n
                                        
                                    
                                
                            
                            ∈
                            X
                        
                     is the feature vector,                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                            ∈
                            Y
                        
                     is the target label. As it is unrealistic to store the full history of the stream, we base our concept drift detection algorithm on a two-sliding-window paradigm. Both windows contain a number of successive data points. We assume data points within each window are independent random samples taken from two unknown, multi-dimensional, non-parametric distributions F and F’, respectively. We then define the null hypothesis H0, which asserts that F and F’ are identical. The goal is to design a proper statistical test that is able to not only refuse H0, if it is not true, but also highlight some local regions of the problem space where H0 does not hold and quantify the difference between F and F’. When H0 is true, the probability of making an error (where the test says that F and F’ are different when in fact they are not) should be, at most, α, where α is a user-supplied parameter” teaches comparing two data sets from two different sliding windows [comparing the operational input data to the training data] by quantifying a difference between their data distributions [determining a deviation of the operation input data from training data]); 
generating by the sidecar learning model, a drift signal that characterizes the deviation of the operational input data from the training data ( p. 13, section 2.1, paragraph 1 “Concept drift detection can be formulated as follows. Suppose there is a CBR system listening to a data stream where each new observation is represented by                         
                            
                                
                                    c
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            y
                                        
                                        
                                            i
                                        
                                    
                                
                            
                            ,
                        
                     where                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            1
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            2
                                        
                                    
                                    ,
                                     
                                    …
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            n
                                        
                                    
                                
                            
                            ∈
                            X
                        
                     is the feature vector,                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                            ∈
                            Y
                        
                     is the target label. As it is unrealistic to store the full history of the stream, we base our concept drift detection algorithm on a two-sliding-window paradigm. Both windows contain a number of successive data points. We assume data points within each window are independent random samples taken from two unknown, multi-dimensional, non-parametric distributions F and F’, respectively. We then define the null hypothesis H0, which asserts that F and F’ are identical. The goal is to design a proper statistical test that is able to not only refuse H0, if it is not true, but also highlight some local regions of the problem space where H0 does not hold and quantify the difference between F and F’. When H0 is true, the probability of making an error (where the test says that F and F’ are different when in fact they are not) should be, at most, α, where α is a user-supplied parameter” teaches calculating the probability of making an error [drift signal]); and 
based on the drift signal exceeding a predetermined threshold, retraining the predictive learning model based on the operational input data (p. 14, section 2.2, paragraph 1 “ Gama et al. … presented a Drift Detection Method (DDM) that traces and controls the online error-rate of the learning algorithm. Treating the error of a set of examples as a random variable from Bernoulli trails, the probability for the number of errors in a sample of n examples can be generalized as Binomial distribution. A significant increase in the error of the algorithm suggests that the class distribution is changing. Their method declares a new concept if the error reaches the warning level, and a new model is learnt when the drift level is exceeded”  teaches a new model being learned [retraining predictive model] when drift level exceeds a warning level [based on the drift signal exceeding a predetermined threshold] based on the set of examples from Bernoulli trials [based on the operational input data]).
Kuncheva et al. and Lu et al. are considered analogous art because they are directed to approaches to detecting and handling concept drift due to degradation in classifier performance.
	In view of the teachings of Kuncheva et al. it would have been obvious for a person of ordinary skill in the art to apply the teachings of Lu et al. at the time the application was filed in order to use concept drift to effectively assist decision makers to perform smarter maintenance operations on a case-based reasoning system at an appropriate time (cf. Lu et al., p. 12, section 1, paragraph 5-6, “ … Knowing whether concept drift happens could help to recognize obsolete cases that conflict with current concepts and distinguish noise cases from novel cases. Moreover, developing a detection method that is able to explain where and how concept drifts could facilitate further decision capabilities and be suitable for handling local concept drift problems.
Kuncheva et al. discloses this as a necessary activity for the taught invention (cf. Kuncheva et al., p. 70, section I, paragraph 7 “… we propose that principal component analysis (PCA) can be used as a general method for feature extraction to improve change detection from multidimensional unlabeled incoming data.”).
Regarding Claim 2, 
	Kuncheva et al. in view of Lu et al. teaches the method of claim 1. 
	Kuncheva et al. does not appear to explicitly teach the method … further comprising: receiving the training data; modeling, by the sidecar learning model, a joint distribution of the training data; and wherein determining the deviation of the operational input data from the training data comprises comparing the joint distribution of the training data to the joint distribution of the operational data.
 	Lu et al. teaches the method … further comprising: receiving the training data (p. 13, section 2.1, paragraph 2 “ … As is well-known, one of the challenges in the spam filtering domain is to handle concept drift problems. In a case-based spam filtering system where emails are continuously classified, the problem arises of how we can benefit from the available feedback (new cases) and improve the accuracy of the system. Treating the newest emails as an independent training set, e.g., emails received during the last month, we can detect whether there is a concept drift between our existing case-base that is assumed to follow an unknown distribution … , and the most recent emails, which are assumed to follow an unknown distribution … The correct maintenance can accordingly be carried out when a drift has been reported”  teaches a spam filtering system receiving new emails [receiving the training data] ); 
	modeling, by the sidecar learning model, a joint distribution of the training data; and wherein determining the deviation of the operational input data from the training data comprises comparing the joint distribution of the training data to the joint distribution of the operational data. (p. 13, section 2.1, paragraph 1 “Concept drift detection can be formulated as follows. Suppose there is a CBR system listening to a data stream where each new observation is represented by                         
                            
                                
                                    c
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            y
                                        
                                        
                                            i
                                        
                                    
                                
                            
                            ,
                        
                     where                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            1
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            2
                                        
                                    
                                    ,
                                     
                                    …
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            n
                                        
                                    
                                
                            
                            ∈
                            X
                        
                     is the feature vector,                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                            ∈
                            Y
                        
                     is the target label. As it is unrealistic to store the full history of the stream, we base our concept drift detection algorithm on a two-sliding-window paradigm. Both windows contain a number of successive data points. We assume data points within each window are independent random samples taken from two unknown, multi-dimensional, non-parametric distributions F and F’, respectively. We then define the null hypothesis H0, which asserts that F and F’ are identical. The goal is to design a proper statistical test that is able to not only refuse H0, if it is not true, but also highlight some local regions of the problem space where H0 does not hold and quantify the difference between F and F’. When H0 is true, the probability of making an error (where the test says that F and F’ are different when in fact they are not) should be, at most, α, where α is a user-supplied parameter” teaches hypothesis testing using data drawn from a 
teaches using a hypothesis test to determine if the two distributions F and F’ are identical [comparing the joint distribution of the training data to the joint distribution of the operational data]).
Kuncheva et al. and Lu et al. are combinable for the same rationale as set forth above with respect to claim 1.
Regarding Claim 3,
	Kuncheva et al. in view of Lu et al. teaches the method of claim 2.
Kuncheva et al. further teaches wherein the sidecar learning model comprises one of a Gaussian mixture model, a self organizing map, an auto-encoding neural network, and a Mahalanobis-Taguchi system (p. 71, Figure 3 
    PNG
    media_image1.png
    590
    1148
    media_image1.png
    Greyscale
and p. 70, section II, paragraph 1 “Fig. 3 shows the two major scenarios for change detection.
When the labels of the data are available straight after classification, or even with some delay, the classification error can be monitored directly. When substantial increase is found, change is signaled. Most of the existing change detection methods and criteria are developed under this assumption. Within the second scenario, labels are not available, and the question is whether the incoming data distribution matches the training one. The two scenarios share a distribution modeling block in the diagram. The modeling is sometimes implicit, and is included in the calculation of the change detection criterion. Compared to the multidimensional case, approximating distributions in the 1-D case can be much more accurate and useful. This explains the greater interest in the 1-D case. Methods such as hidden Markov models, Gaussian mixture modeling, Parzen windows, kernel-based approximation, and martingales have been proposed for this task.”  teaches the distribution modeling block [sidecar learning model] comprising a Gaussian mixture model).
Any limitation that recites “or” has been interpreted as requiring one of the alternatives and not all of the alternatives. 
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Kuncheva et al. (“PCA Feature Extraction for Change Detection in Multidimensional Unlabeled Data”) in view of Lu et al.  (“Concept drift detection via competence models”) and in further view of Lindstrom et al. (“Drift Detection using Uncertainty Distribution Divergence”).
Regarding Claim 5,
Kuncheva et al. in view of Lu et al. teaches the method of claim 1.
Kuncheva et al. in view of Lu et al. does not appear to explicitly teach the method … further comprising automatically generating the sidecar learning model.
Lindstrom et al. teaches the method … further comprising automatically generating the sidecar learning model (p. 7, Figure 1 
    PNG
    media_image2.png
    675
    774
    media_image2.png
    Greyscale
 and p. 7, section 3, paragraph 2 “Figure 1 shows an overview of the CDBD process. At a high level CDBD monitors an indicator for the occurrence of concept drift and when it triggers the classifier is rebuilt using recent data …”  teaches measuring for drift [automatically generating the sidecar learning model] once batch of new data is classified).
Kuncheva et al., Lu et al., and Lindstrom et al. are considered analogous art because they are directed to approaches to detecting and handling concept drift due to degradation in classifier performance.
	In view of the teachings of Kuncheva et al. in view of Lu et al. it would have been obvious for a person of ordinary skill in the art to apply the teachings of Lindstrom et al. at the time the application was filed in order to identify triggered detection approaches that do not required labelled instances to detect concept drift, thus reducing dependency on classification of test examples (cf. Lindstrom et al., p. 2, section 1, paragraph 5 “ … CDBD is a concept drift handling approach that explicitly detects changes without requiring the true classes of test instances. CDBD compares the distribution of classifier output confidences in a batch of test examples to a reference distribution constructed from training data, and uses this comparison to generate a measure of concept drift. When this measure is above a given threshold, concept drift is deemed to have taken place, and the classifier is updated. CDBD only requires labelled data to update the classifier once concept drift has been identified, and so using CDBD can significantly reduce the overall amount of labelled data required to keep a classification model up to date …”). The Examiner notes that a person of ordinary skill in the art would find a suggestion to perform this type of analysis since Kuncheva et al. discloses this as a necessary activity for the taught invention (cf. Kuncheva et al., p. 70, section I, paragraph 7 “… we propose that principal component analysis (PCA) can be used as a general method for feature extraction to improve change detection from multidimensional unlabeled incoming data.”).
Claims 6-8 are rejected under 35 U.S.C. 103 as being unpatentable over Kuncheva et al. (“PCA Feature Extraction for Change Detection in Multidimensional Unlabeled Data”) in view  Lu et al.  (“Concept drift detection via competence models”) and in further view of Kirby et al. (US 2009/0043547 A1).
Regarding Claim 6,
Kuncheva et al. in view of Lu et al. teaches the method of claim 1.
Kuncheva et al. in view of Lu et al. does not appear to explicitly teach wherein generating the drift signal that characterizes the deviation of the operational input data from the training data further comprises generating an alert that indicates the operational input data deviates from the training data by a predetermined criteria.
	Kirby et al. teaches wherein generating the drift signal that characterizes the deviation of the operational input data from the training data further comprises generating an alert that indicates the operational input data deviates from the training data by a predetermined criteria (paragraph 0367, “Output from the model generator 2128 and/or the approximation model analyzer 2136 may include various statistics and/or graphs related to how well the generated model conforms to the training data. In particular, one or more of the various graphs and/or statistics illustrated in FIGS. 4A-4D, 5A-5D, 6A-6C, 7, 10A-10B, 11A, 11B, 12A-12C, 13A, 13B, 14A-14D, 15A-15C, 16, 17, 18, 19, and/or 20”” teaches the alert representing a graph). 
Kuncheva et al., Lu et al., and Kirby et al. are considered analogous art because they are directed to approximation methods for performing non-linear classification of unstructured datasets. 
	In view of the teachings of Kuncheva et al. in view of Lu et al. it would have been obvious for a person of ordinary skill in the art to apply the teachings of Kirby et al. at the time the application was filed in order to model diverse data sets without making any changes to Kirby et al., paragraph 0177 “ … no adjustments or parameter settings were made to the programmatic embodiment based on the data set being approximated. Hence, the present approximation method and system approaches a black-box methodology for nonlinear function approximation. This feature of the present approximation method and system permits the advancement of a variety of other processes, e.g., the representation of data on manifolds as graphs of functions …, pattern classification …, as well as the low-dimensional modeling of dynamical systems …”). The Examiner notes that a person of ordinary skill in the art would find a suggestion to perform this type of analysis since Kuncheva et al. discloses this as a necessary activity for the taught invention (cf. Kuncheva et al., p. 70, section I, paragraph 7 “… we propose that principal component analysis (PCA) can be used as a general method for feature extraction to improve change detection from multidimensional unlabeled incoming data.”).
Regarding Claim 7,
Kuncheva et al. in view of Lu et al. teaches the method of claim 1.
Kuncheva et al. in view of Lu et al. does not appear to explicitly teach the method … further comprising generating a confidence signal that identifies a confidence level of the predictive learning model to the operational input based on the drift signal. 
Kirby et al. teaches the method … further comprising generating a confidence signal that identifies a confidence level of the predictive learning model to the operational input data based on the drift signal (paragraph 0365 “Accordingly, once the generator 2128 has completed an instance of training, the generator 2128 outputs the data defining the generated model to the approximation models database 2132 together with various other types of information such as the final ACF, the final RMSE, a confidence level …, etc. Moreover, the user may be notified of such results.” teaches outputting model data such as a confidence level [confidence signal] to the user).
Kuncheva et al., Lu et al., and Kirby et al. are combinable for the same rationale as set forth above with respect to claim 6.
Regarding Claim 8,
Kuncheva et al. in view of Lu et al. teaches the method of claim 1.
Kuncheva et al. in view of Lu et al. does not appear to explicitly teach the method … further comprising presenting, in a user interface, a real-time graph that depicts the deviation of the operational input data from the training data.
Kirby et al. teaches the method … further comprising presenting, in a user interface, a real-time graph that depicts the deviation of the operational input data from the training data (paragraph 0367, “Output from the model generator 2128 and/or the approximation model analyzer 2136 may include various statistics and/or graphs related to how well the generated model conforms to the training data. In particular, one or more of the various graphs and/or statistics illustrated in FIGS. 4A-4D, 5A-5D, 6A-6C, 7, 10A-10B, 11A, 11B, 12A-12C, 13A, 13B, 14A-14D, 15A-15C, 16, 17, 18, 19, and/or 20” teaches outputs from the model generator including graphs [real-time graph] related to how well the model conforms to the training data).
Kuncheva et al., Lu et al., and Kirby et al. are combinable for the same rationale as set forth above with respect to claim 6.




Claims 9-10 and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Brand et al. (US 2016/0071027 A1) in view of Kuncheva et al. (“PCA Feature Extraction for Change Detection in Multidimensional Unlabeled Data”) and in further view of Lu et al.  (“Concept drift detection via competence models”).
Regarding Claim 9,
	Brand et al. teaches a computing device, comprising: a memory and a processor device coupled to the memory ( 
paragraphs 0154-0155 “ [0154] “ … The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data … 
[0155] “ … The processor and memory can be … incorporated in, special purpose logic circuitry” teaches a computer [computing device] comprising a memory and processor being collocated in special purpose logic circuitry [a processor device coupled to the memory]) to:
… determine a deviation of the operational input data from the training data by comparing the operational input data to the training data; generate, by the sidecar learning model, a drift signal that characterizes the deviation of the operational data from the input data (Figure 9 
    PNG
    media_image3.png
    912
    591
    media_image3.png
    Greyscale
and paragraph 0146, “The concept drift engine 930 receives parameters 912 and 922, and determines a difference between the parameters. In determining a difference, the concept drift engine 930 can determine a sum of a difference between each particular parameter of the parameters 912 and 922 … “ teaches the concept drift engine [sidecar learning model] comparing the parameters of the first system [operational input data] to the parameters of the second system by calculating a difference [deviation] between the parameters of the first system [operational input data] and the parameters of the second system [training data]).
Brand et al. does not appear to explicitly teach … receive, by a sidecar learning model, operational input data submitted to a predictive learning model, the sidecar learning model trained on a same training data used to train the predictive learning model; and … based on the drift signal exceeding a predetermined threshold, retrain the predictive learning model based on the operational input data. 
Kuncheva et al. teaches … receive, by a sidecar learning model, operational input data submitted to a predictive learning model, the sidecar learning model trained on a same training data used to train the predictive learning model (p. 71, Figure 3 
    PNG
    media_image1.png
    590
    1148
    media_image1.png
    Greyscale
and p. 70, section II, paragraph 1 “Fig. 3 shows the two major scenarios for change detection.
When the labels of the data are available straight after classification, or even with some delay, the classification error can be monitored directly. When substantial increase is found, change is signaled. Most of the existing change detection methods and criteria are developed under this assumption. Within the second scenario, labels are not available, and the question is whether the incoming data distribution matches the training one. The two scenarios share a distribution modeling block in the diagram. The modeling is sometimes implicit, and is included in the calculation of the change detection criterion. Compared to the multidimensional case, approximating distributions in the 1-D case can be much more accurate and useful. This explains the greater interest in the 1-D case. Methods such as hidden Markov models, Gaussian mixture modeling, Parzen windows, kernel-based approximation, and martingales have been proposed for this task.”  teaches unlabeled data [operational data] submitted to feature extractor [predictive learning model] and a distribution modeling block [sidecar learning model] used to determine an indicator of change, 
and teaches labeled data [training data] submitted to classifier [predictive learning model] and distribution modeling block [sidecar learning model] used to determine indicator of change).
Brand et al. and Kuncheva et al. are considered analogous art because they are directed to approaches to detecting and handling concept drift due to degradation in classifier performance.
	In view of the teachings of Brand et al. it would have been obvious for a person of ordinary skill in the art to apply the teachings of Kuncheva et al. at the time the application was filed in order to use principal component analysis to extract features from data and improve drift detection from multidimensional unlabeled incoming data (cf. Kuncheva et al., pp. 69-70, section I, paragraph 6 “Given the context-dependent nature of concept change, feature extraction Brand et al. discloses this as a necessary activity for the taught invention (cf. Brand et al., paragraph 0019, “… a system may need to flag trending words in Social media, and the earlier this flagging can be done the better. This is done by having two identical algorithms simultaneously model the databased on incoming events, where the only difference between the two algorithms is that they are given distinct learning-rate parameters. By comparing the output of the two algorithms and measuring the “concept drift” between their two models, trends and changes in trends can be detected. At any given time point T, the algorithms model the system based on the data they have seen, i.e., it captures the State of the system at time T-A. Varying the learning rate results in a different A, larger for a stable model, Smaller for an agile model. Com paring the two results is akin to taking the derivative, and thus akin to measuring the rate of change. A developer may implement the same system but with more copies of the basic algorithm in order to measure higher-order derivatives. A second-derivative, for example, is a useful metric for detecting Sudden jumps and separating them from gradual changes.”).
Lu et al. further teaches … based on the drift signal exceeding a predetermined threshold, retrain the predictive learning model based on the operational input data (p. 14, section 2.2, paragraph 1 “ Gama et al. … presented a Drift Detection Method (DDM) that traces and controls the online error-rate of the learning algorithm. Treating the error of a set of examples as a random variable from Bernoulli trails, the probability for the number of errors in a sample of n examples can be generalized as Binomial distribution. A significant increase in the error of the algorithm suggests that the class distribution is changing. Their method declares a new concept if the error reaches the warning level, and a new model is learnt when the drift level is exceeded”  teaches a new model being learned [retraining predictive model] when drift level exceeds a warning level [based on the drift signal exceeding a predetermined threshold] based on the set of examples from Bernoulli trials [based on the operational input data]).
Regarding Claim 10,
	Brand et al. in view of Kuncheva et al. and in further view of Lu et al. teaches the computing device of claim 9.
Brand et al. in view of Kuncheva et al. does not appear to explicitly teach wherein the processor device is further to receive the training data; model, by the sidecar learning model, a joint distribution of the training data; and wherein to determine the deviation of the operational input data from the training data, the processor device is further to compare the joint distribution of the training data to the joint distribution of the operational data.
 	Lu et al. teaches wherein the processor device is further to receive the training data (p. 13, section 2.1, paragraph 2 “ … As is well-known, one of the challenges in the spam filtering domain is to handle concept drift problems. In a case-based spam filtering system where emails are continuously classified, the problem arises of how we can benefit from the available feedback (new cases) and improve the accuracy of the system. Treating the newest emails as an independent training set, e.g., emails received during the last month, we can detect whether there is a concept drift between our existing case-base that is assumed to follow an unknown distribution … , and the most recent emails, which are assumed to follow an unknown distribution … The correct maintenance can accordingly be carried out when a drift has been reported”  teaches a spam filtering system receiving new emails [receive the training data] ); 
	model, by the sidecar learning model, a joint distribution of the training data; and wherein to determine the deviation of the operational input data from the training data, the processor device is further to compare the joint distribution of the training data to the joint distribution of the operational data. (p. 13, section 2.1, paragraph 1 “Concept drift detection can be formulated as follows. Suppose there is a CBR system listening to a data stream where each new observation is represented by                         
                            
                                
                                    c
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            y
                                        
                                        
                                            i
                                        
                                    
                                
                            
                            ,
                        
                     where                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            1
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            2
                                        
                                    
                                    ,
                                     
                                    …
                                    
                                        
                                            x
                                        
                                        
                                            i
                                            n
                                        
                                    
                                
                            
                            ∈
                            X
                        
                     is the feature vector,                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                            ∈
                            Y
                        
                     is the target label. As it is unrealistic to store the full history of the stream, we base our concept drift detection algorithm on a two-sliding-window paradigm. Both windows contain a number of successive data points. We assume data points within each window are independent random samples taken from two unknown, multi-dimensional, non-parametric distributions F and F’, respectively. We then define the null hypothesis H0, which asserts that F and F’ are identical. The goal is to design a proper statistical test that is able to not only refuse H0, if it is not true, but also highlight some local regions of the problem space where H0 does not hold and quantify the difference between F and F’. When H0 is true, the probability of making an error (where the test says that F and F’ are different when in fact they are not) should be, at most, α, where α is a user-supplied parameter” teaches hypothesis testing using data drawn from a multi-dimensional distribution F [model, by the sidecar learning model, a joint distribution of the training data], and 
teaches using a hypothesis test to determine if the two distributions F and F’ are identical [compare the joint distribution of the training data to the joint distribution of the operational data]).
Brand et al., Kuncheva et al., and Lu et al. are combinable for the same rationale as set forth above with respect to claim 9.
Regarding Claim 15,
	Claim 15 is substantially similar to claim 9 and therefore is rejected on the same ground as claim 9.  Claim 15 is directed to a “computer program product” that corresponds to the method of claim 9. 
	Brand et al. teaches a computer program product stored on a non-transitory computer-readable storage medium that includes instructions (paragraph 0148, “ … Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of data processing apparatus … “ ).
Regarding Claim 16,
	Claim 16 is substantially similar to claim 10 and therefore is rejected on the same ground as claim 10.  Claim 16 is directed to a “computer program product” that corresponds to the method of claim 10. 
Brand et al. teaches a computer program product stored on a non-transitory computer-readable storage medium that includes instructions (paragraph 0148, “ … Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of data processing apparatus … “ ).

Claims 12 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Brand et al. (US 2016/0071027 A1) in view of Kuncheva et al. (“PCA Feature Extraction for Change Detection in Multidimensional Unlabeled Data”) and in view of Lu et al.  (“Concept drift detection via competence models”) and in further view of Lindstrom et al. (“Drift Detection using Uncertainty Distribution Divergence”).
Regarding Claim 12,
	Brand et al. in view of Kuncheva et al. and in further view of Lu et al. teaches the computing device of claim 9.
Brand et al. in view of Kuncheva et al. and in further view of Lu et al. does not appear to explicitly teach wherein the processor device is further to: receive a request to train the predictive learning model; and in response to the request, automatically generate the sidecar learning model.
Lindstrom et al. teaches wherein the processor device is further to: receive a request to train the predictive learning model; and in response to the request, automatically generate the sidecar learning model (p. 7, Figure 1 
    PNG
    media_image2.png
    675
    774
    media_image2.png
    Greyscale
 and p. 7, section 3, paragraph 2 “Figure 1 shows an overview of the CDBD process. At a high level CDBD monitors an indicator for the occurrence of concept drift and when it triggers the classifier is rebuilt using recent data …”  teaches measuring for drift [automatically generating the sidecar learning model] once batch of new data is classified; 
p. 8, section 3, paragraph 5 “The trigger is CDBD is the rule, or rules, which use the indicator to determine that concept drift has occurred and the classifier should be updated … “ teaches the classifier requiring updating [train the predictive learning model] based on determination by an indicator that concept drift has occurred [receiving a request]).
Brand et al., Kuncheva et al., Lu et al., and Lindstrom et al. are considered analogous art because they are directed to approaches to detecting and handling concept drift due to degradation in classifier performance.
	In view of the teachings of Brand et al. in view of Kuncheva et al. and in further view of Lu et al. it would have been obvious for a person of ordinary skill in the art to apply the teachings of Lindstrom et al. at the time the application was filed in order to identify triggered detection approaches that do not required labelled instances to detect concept drift, thus reducing dependency on classification of test examples (cf. Lindstrom et al., p. 2, section 1, paragraph 5 “ … CDBD is a concept drift handling approach that explicitly detects changes without requiring the true classes of test instances. CDBD compares the distribution of classifier output confidences in a batch of test examples to a reference distribution constructed from training data, and uses this comparison to generate a measure of concept drift. When this measure is above a given threshold, concept drift is deemed to have taken place, and the classifier is updated. CDBD only requires labelled data to update the classifier once concept drift has been identified, and so using CDBD can significantly reduce the overall amount of labelled data required to keep a classification model up to date …”). The Examiner notes that a person of ordinary skill in the art would find a suggestion to perform this type of analysis since Brand et al. discloses this as a necessary activity for the taught invention (cf. Brand et al., paragraph 0019, “… a system may need to flag trending words in Social media, and the earlier this flagging can be done the better. This is done by having two identical algorithms simultaneously model the databased on incoming events, where the only difference between the two algorithms is that they are given distinct learning-rate parameters. By comparing the output of the two algorithms and measuring the “concept drift” between their two models, trends and changes in trends can be detected. At 
Regarding Claim 18,
	Claim 18 is substantially similar to claim 10 and therefore is rejected on the same ground as claim 12.  Claim 18 is directed to a “computer program product” that corresponds to the method of claim 12. 
Brand et al. teaches a computer program product stored on a non-transitory computer-readable storage medium that includes instructions (paragraph 0148, “ … Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of data processing apparatus … “ ).
Claims 13-14 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Brand et al. (US 2016/0071027 A1) in view of Kuncheva et al. (“PCA Feature Extraction for Change Detection in Multidimensional Unlabeled Data”) and in view of Lu et al.  (“Concept drift detection via competence models”) and in further view of Kirby et al. (US 2009/0043547 A1).
Regarding Claim 13,
Brand et al. in view of Kuncheva et al. and in further view of Lu et al. teaches the computing device of claim 9.
Brand et al. in view of Kuncheva et al. and in further view of Lu et al. does not appear to explicitly teach wherein the processor device is further to generate a confidence signal that identifies a confidence level of the predictive learning model to the operational input data based on the drift signal.
Kirby et al. teaches wherein the processor device is further to generate a confidence signal that identifies a confidence level of the predictive learning model to the operational input data based on the drift signal (paragraph 0365 “Accordingly, once the generator 2128 has completed an instance of training, the generator 2128 outputs the data defining the generated model to the approximation models database 2132 together with various other types of information such as the final ACF, the final RMSE, a confidence level …, etc. Moreover, the user may be notified of such results.” teaches outputting model data such as a confidence level [confidence signal] to the user).
Brand et al., Kuncheva et al., Lu et al., and Kirby et al. are considered analogous art because they are directed to approaches to detecting and handling concept drift due to degradation in classifier performance.
	In view of the teachings of Brand et al. in view of Kuncheva et al. and in further view of Lu et al. it would have been obvious for a person of ordinary skill in the art to apply the teachings of Kirby et al. at the time the application was filed in order to model diverse data sets without making any changes to embodiments of the classification method and system, thus accelerating classification of the data set being modeled (cf. Kirby et al., paragraph 0177 “ … no adjustments or parameter settings were made to the programmatic embodiment based on the Brand et al. discloses this as a necessary activity for the taught invention (cf. Brand et al., paragraph 0019, “… a system may need to flag trending words in Social media, and the earlier this flagging can be done the better. This is done by having two identical algorithms simultaneously model the databased on incoming events, where the only difference between the two algorithms is that they are given distinct learning-rate parameters. By comparing the output of the two algorithms and measuring the “concept drift” between their two models, trends and changes in trends can be detected. At any given time point T, the algorithms model the system based on the data they have seen, i.e., it captures the State of the system at time T-A. Varying the learning rate results in a different A, larger for a stable model, Smaller for an agile model. Com paring the two results is akin to taking the derivative, and thus akin to measuring the rate of change. A developer may implement the same system but with more copies of the basic algorithm in order to measure higher-order derivatives. A second-derivative, for example, is a useful metric for detecting sudden jumps and separating them from gradual changes.”).
Regarding Claim 14,
	Brand et al. in view of Kuncheva et al. and in further view of Lu et al. teaches the computing device of claim 9.
Brand et al. in view of Kuncheva et al. and in further view of Lu et al. does not appear to explicitly teach wherein the processor device is further to present, in a user interface, a real-time graph that depicts the deviation of the operational input data from the training data.
Kirby et al. teaches wherein the processor device is further to present, in a user interface, a real-time graph that depicts the deviation of the operational input data from the training data (paragraph 0367, “Output from the model generator 2128 and/or the approximation model analyzer 2136 may include various statistics and/or graphs related to how well the generated model conforms to the training data. In particular, one or more of the various graphs and/or statistics illustrated in FIGS. 4A-4D, 5A-5D, 6A-6C, 7, 10A-10B, 11A, 11B, 12A-12C, 13A, 13B, 14A-14D, 15A-15C, 16, 17, 18, 19, and/or 20” teaches outputs from the model generator including graphs [real-time graph] related to how well the model conforms to the training data).
Brand et al., Kuncheva et al., Lu et al., and Kirby et al. are combinable for the same rationale as set forth above with respect to claim 13.
Regarding Claim 19,
	Claim 19 is substantially similar to claim 10 and therefore is rejected on the same ground as claim 13.  Claim 19 is directed to a “computer program product” that corresponds to the method of claim 13. 
Brand et al. teaches a computer program product stored on a non-transitory computer-readable storage medium that includes instructions (paragraph 0148, “ … Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of data processing apparatus … “ ).
Regarding Claim 20,
	Claim 20 is substantially similar to claim 14 and therefore is rejected on the same ground as claim 14.  Claim 20 is directed to a “computer program product” that corresponds to the method of claim 14. 
Brand et al. teaches a computer program product stored on a non-transitory computer-readable storage medium that includes instructions (paragraph 0148, “ … Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of data processing apparatus … “ ).
Conclusion 
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on 571-272-7796.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CHIAKA CHUKWUMA OKOROH/Examiner, Art Unit 2125                                                                                                                                                                                                        
/MICHAEL J HUNTLEY/Primary Examiner, Art Unit 2116