DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application was filed on September 26, 2019.
Claims 1-20 are presented for examination and are pending.

Information Disclosure Statement
The information disclosure statement(s) (IDS) was/were submitted on September 26, 2019. The submission(s) is/are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement(s) are being considered by the examiner.

Drawings
The drawings filed on September 26, 2019 are accepted.

Claim Interpretation
“A computer program product comprising: one or more compute readable storage media and program instructions…” as recited in independent claim 9 and respective dependent claims 10-17 and 
“…one or more computer readable storage media…” as recited in independent claim 17 and respective dependent claims 18-20
are interpreted to be non-transitory, as mentioned by Paragraph [107] of the Specification below: 
“A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, 

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 2, 4, 5, 9, 10, 12, 13, 17, 18, and 20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Salman et al. (“Machine Learning for Anomaly Detection and Categorization in Multi-Cloud Environments”).

Regarding Claim 1, 
Salman teaches: 
A computer-implemented method comprising: (Page 98, Section 2B: “To build the UNSW dataset, packets were generated using IXIA PerfectStorm tool for the realistic modern normal activities and the synthetic contemporary attack behaviors in the network traffic. Then, tcpdump files were collected and used to extract 49 features. Those features were extracted using Argus and Bro network monitoring tools. The collected data were divided into training and testing sets to be used for learning and prediction of attack behaviors, respectively. Statistics about the nine types of UNSW attacks along with normal data are summarized in Table I.” and Page 99, Section 2B: “In addition, we use a feature selection scheme to reduce the number of features while building the machine learning model. This has resulted in better performance in term of anomaly detection and prediction accuracy of anomalous traffic (presented in Section III-B). Also, we compare the RF technique against the linear regression (LR) technique to demonstrate a better performance by the RF.” teaches building machine learning models using data from the UNSW dataset (a dataset built using computer-implemented tools), this suggest a computer-based implementation)

obtaining, by one or more processors, a first set of training samples, one of the training samples including values of a plurality of performance indicators of a target system observed at a historical point in time; (Page 98, Section 1: “We have considered a new and publicly available dataset given by UNSW [10, 11]. We use supervised learning to build anomaly detection models and demonstrate their detection efficiency.” teaches obtaining the UNSW dataset (training samples) that is used to build the anomaly detection models; Page 99, Table 1 teaches that the UNSW dataset contains values of anomalies/intrusions within a computing system (anomalies/intrusions are associated with and indicate the performance of a computing system))

determining, by one or more processors, whether the first set of training samples are qualified for training a prediction model, the prediction model predicting an operational status of the target system at a target point in time based on values of the plurality of performance indicators observed at the target point in time; and (Page 99, Section 3A: “The original UNSW dataset included 49 features extracted from all the collected traces of the network traffic… We have used best-first feature selection technique [31] to reduce the number of features. In this technique, the user defines a criterion to finalize the optimal number of features. Such criterion could be: maximizing overall accuracy, minimizing the false negative or false positive rate, or minimizing the classification error for a particular attack type. In the first iteration, the algorithm chooses the best feature among all that achieves the selected criterion. This feature is saved to a set of optimal features which is initially empty. In the next iteration, each of the other features is used along with the optimal subset to build the learning models. The feature with which the model performs the best is added to the subset. The algorithm keeps appending the best feature to the subset in each iteration.” teaches performing feature selection to determine if the set of features from the UNSW dataset (training samples) is qualified to train the prediction model, based on user-defined criterion; Page 100 Section 3B: “After selecting the optimal set of features, we used supervised machine learning techniques to build the anomaly detection models. We specifically used LR [32] and RF [33] due to their simple learning models and better performance with the UNSW dataset.” teaches that random forest and linear regression models are trained after performing feature selection, the models are anomaly detection models (predicts operational state of computing system))

in response to determining that the first set of training samples are qualified for training
the prediction model, training, by one or more processors, the prediction model based on the
first set of training samples. (Page 100 Section 3B: “After selecting the optimal set of features, we used supervised machine learning techniques to build the anomaly detection models. We specifically used LR [32] and RF [33] due to their simple learning models and better performance with the UNSW dataset.” teaches that random forest and linear regression models (prediction models) are trained after performing feature selection and determining an optimal (qualified) set of features)

Regarding Claim 2, 
Salman teaches: 
The method of claim 1, 

Salman further teaches: 
wherein determining whether the first set of training samples are qualified for training the prediction model comprises: detecting, by one or more processors, a data characteristic associated with the operational status of the target system from the first set of training samples; and (Page 99, Section 3A: “We have used best-first feature selection technique [31] to reduce the number of features. In this technique, the user defines a criterion to finalize the optimal number of features. Such criterion could be: maximizing overall accuracy, minimizing the false negative or false positive rate, or minimizing the classification error for a particular attack type. In the first iteration, the algorithm chooses the best feature among all that achieves the selected criterion.” teaches using a user defined criterion (data characteristic) to determine if the selected features are optimal (qualified) for training the model, the data characteristic is associated with anomaly detection (operational status) because the selected features are used to train anomaly detection models)

in response to the data characteristic being detected from the first set of training
samples, determining, by one or more processors, that the first set of training samples are
qualified for training the prediction model. (Page 99, Section 3A: “We have used best-first feature selection technique [31] to reduce the number of features. In this technique, the user defines a criterion to finalize the optimal number of features. Such criterion could be: maximizing overall accuracy, minimizing the false negative or false positive rate, or minimizing the classification error for a particular attack type. In the first iteration, the algorithm chooses the best feature among all that achieves the selected criterion. This feature is saved to a set of optimal features which is initially empty. In the next iteration, each of the other features is used along with the optimal subset to build the learning models. The feature with which the model performs the best is added to the subset. The algorithm keeps appending the best feature to the subset in each iteration.” teaches determining if the selected set of features are optimal (qualified) for training the model based on a user defined criterion (data characteristic))

Regarding Claim 4, 
Salman teaches: 
The method of claim 1, 

Salman further teaches: 
wherein training the prediction model comprises: generating, by one or more processors, an objective function for training the prediction model; and (Page 100, Section 3B: “With the experimental results obtained, we observe that the anomaly detection error rate is as low as 1% with RF scheme and 11 features. With LR, the minimum error rate that could be obtained is 4.5% with 18 optimal features.” and Equation 3: 

    PNG
    media_image1.png
    43
    469
    media_image1.png
    Greyscale

teaches generating the error rate (objective function) for the trained RF and LR models)

determining, by one or more processors, at least one parameter of the prediction model
such that the objective function is minimized. (Page 99, Section 3A: “The algorithm keeps appending the best feature to the subset in each iteration. It stops when the next iteration does not produce an improved result compared to the current subset or the complete set. Thus, it is guaranteed that the selected subset gives the least error in prediction and hence may be considered as an optimal subset of features.” teaches that features are selected to minimize potential error, because the model is trained based on the selected features, the model is trained (parameters are determined) to minimize overall error rate (objective function))

Regarding Claim 5, 
Salman teaches: 
The method of claim 4, 

Salman further teaches: 
wherein generating the objective function comprises: estimating, by one or more processors, a first ratio of training samples that are predicted to be indicative of an abnormal status of the target system to the first set of training samples; (Page 100, Equation 3: 

    PNG
    media_image1.png
    43
    469
    media_image1.png
    Greyscale

teaches determining the false positives (first ratio of training samples predicted to be indicative of an abnormal status to the set of training samples) of the model)

determining, by one or more processors, a second ratio of training samples that indicate
the abnormal status of the target system to the first set of training samples; and (Page 100, Equation 3: 

    PNG
    media_image1.png
    43
    469
    media_image1.png
    Greyscale

teaches determining the false negatives (second ratio of training samples that indicative abnormal status to the set of training samples) of the model)


generating, by one or more processors, the objective function based on the first and second ratios. Page 100, Equation 3: 

    PNG
    media_image1.png
    43
    469
    media_image1.png
    Greyscale

teaches that the overall error rate (objective function) is determined based on the false positives (first ratio) and false negatives (second ratio))

Regarding Claim 9,
Claim 9 recites A computer program product… performing limitations that are similar to claim 1, thus is rejected with the same rationale applied against claim 1.

Regarding Claim 10,
Claim 10 recites The computer program product of claim 9… performing limitations that are similar to claim 2, thus is rejected with the same rationale applied against claim 2.

Regarding Claim 12,
Claim 12 recites The computer program product of claim 9… performing limitations that are similar to claim 4, thus is rejected with the same rationale applied against claim 4.

Regarding Claim 13,
Claim 13 recites The computer program product of claim 12… performing limitations that are similar to claim 5, thus is rejected with the same rationale applied against claim 5.

Regarding Claim 17,
Claim 17 recites A computer system… performing limitations that are similar to claim 1, thus is rejected with the same rationale applied against claim 1.

Regarding Claim 18,
Claim 18 recites The computer system of claim 17… performing limitations that are similar to claim 2, thus is rejected with the same rationale applied against claim 2.

Regarding Claim 20,
Claim 20 recites The computer system of claim 17… performing limitations that are similar to claim 4, thus is rejected with the same rationale applied against claim 4.



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 6-8 and 14-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Salman in view of Hu et al. (“AdaBoost-Based Algorithm for Network Intrusion Detection”)
Regarding Claim 6, 
Salman teaches: 
The method of claim 1, 

Salman does not appear to explicitly teach: 
generating, by one or more processors, a plurality of model instances for the prediction model;

determining, by one or more processors, respective weights of the plurality of model instances; and

combining, by one or more processors, the plurality of model instances into an optimized prediction model based on the weights of the plurality of model instances.

However, Hu teaches: 
generating, by one or more processors, a plurality of model instances for the prediction model; (Page 579, Section 3B: “In the AdaBoost algorithm, weak classifiers are selected iteratively from a number of candidate weak classifiers and are combined linearly to form a strong classifier for classifying the network data. Let H = {hf} be the set of constructed weak classifiers.” teaches generating a set (plurality) of weak classifiers (model instances))

determining, by one or more processors, respective weights of the plurality of model instances; and (Page 580, Section 3B: 

    PNG
    media_image2.png
    176
    810
    media_image2.png
    Greyscale

teaches that a strong classifier is created by the summation of each weak classifier h multiplied by a weight a; Page 580, Section 3B: 

    PNG
    media_image3.png
    236
    704
    media_image3.png
    Greyscale

teaches determining alpha (weight applied to each classifier) based on the weighted classification errors;
combining, by one or more processors, the plurality of model instances into an optimized prediction model based on the weights of the plurality of model instances. (Page 579, Section 2: “A strong classifier is obtained by combining the weak classifiers. The strong classifier has higher classification accuracy than each weak classifier.” teaches combining the weak classifiers (plurality of model instances) into a strong classifier (optimized prediction model), Page 580, Section 3B: 
    PNG
    media_image2.png
    176
    810
    media_image2.png
    Greyscale

teaches that the strong classifier is created based on the summation of each weak classifier multiplied by the weight alpha)

Salman and Hu are analogous art because they are directed to anomaly detection. 


Regarding Claim 7, 
The combination of Salman and Hu teaches: 
The method of claim 6, 

Hu further teaches: 
wherein the plurality of model instances comprise a first model instance, and wherein determining respective weights of the plurality of model instances comprises: determining, by one or more processors, respective weights of the first set of training samples; (Page 580, Section 3C: “The initial weights (wi(1) (i = 1, . . . , n)) reflect the importance degrees of the samples and influence the sum of the weighted errors for the strong classifier. Usually, the initial weights are chosen to be equal: wi(1) = 1/n (i = 1, . . . , n). In the AdaBoost theory, the uniform initial weights have a strong influence on the mean of the classification errors. This is not very suitable for intrusion detection because it is necessary to reduce the false-alarm rate rather than the mean error: In real applications, almost all behaviors are normal. A high false-alarm rate wastes resources, as each alarm must be checked. In the following, we propose adjustable initial weights to make a tradeoff between the false-alarm and detection rates… Correspondingly, the initial weights are defined as follows:”

    PNG
    media_image4.png
    116
    705
    media_image4.png
    Greyscale
teaches determining weights for the set of training samples)

determining, by one or more processors, a first set of prediction results based on the first set of training samples by using the first model instance; and (Page 580, Section 3B: 
    PNG
    media_image5.png
    635
    774
    media_image5.png
    Greyscale

teaches determining the weighted classification errors (first set of prediction results) for each weak classifier (model instance))

determining, by one or more processors, a first weight of the first model instance based on the weights of the first set of training samples and the first set of prediction results. (Page 580, Section 3B: 
    PNG
    media_image2.png
    176
    810
    media_image2.png
    Greyscale

teaches that alpha is a weight applied to each weak classifier in the set of weak classifiers; Page 580, Section 3B: 

    PNG
    media_image3.png
    236
    704
    media_image3.png
    Greyscale

teaches that alpha is calculated based on the sum of weighted classification errors (set of prediction results); Page 580, Section 3B: 

    PNG
    media_image6.png
    573
    777
    media_image6.png
    Greyscale



The combination of claim 6 has already incorporated the AdaBoost algorithm, therefore already incorporating the details of the weights of the models required by claim 7. 

Regarding Claim 8, 
The combination of Salman and Hu teaches: 
The method of claim 7, 

Hu further teaches: 
wherein the plurality of model instances further comprise a second model instance, and wherein determining respective weights of the plurality of model instances comprises: (Page 579, Section 3B: “In the AdaBoost algorithm, weak classifiers are selected iteratively from a number of candidate weak classifiers and are combined linearly to form a strong classifier for classifying the network data. Let H = {hf} be the set of constructed weak classifiers.” teaches a set (plurality) of weak classifiers (model instances))

updating, by one or more processors, the weights of the first set of training samples
based on the first weight of the first model instance; (Page 580, Section 3B: 

    PNG
    media_image7.png
    460
    685
    media_image7.png
    Greyscale

teaches updating the weights, w (weights of the first set of training samples) based on alpha (weight applied to the weak classifier (first weight of the first model instance)); Page 579, Section 3B: “Let
{w1, . . . , wi, . . . , wn} be the sample weights that reflect the importance degrees of the samples and, in statistical terms, represent an estimation of the sample distribution.” teaches that the weights are applied to the training samples)

determining, by one or more processors, a second set of prediction results based on the first set of training samples by using the second model instance; and (Page 580, Section 3B: 
    PNG
    media_image5.png
    635
    774
    media_image5.png
    Greyscale

teaches determining the weighted classification errors (set of prediction results) for each weak classifier (model instance))

determining, by one or more processors, a second weight of the second model instance based on the updated weights of the first set of training samples and the second set of prediction results. Page 580, Section 3B: 

    PNG
    media_image2.png
    176
    810
    media_image2.png
    Greyscale

Page 580, Section 3B: 

    PNG
    media_image8.png
    512
    713
    media_image8.png
    Greyscale

teaches that alpha depends on both the updated weights applied to the training samples and the weighted classification errors (prediction results)

The combination of claim 6 has already incorporated the AdaBoost algorithm, therefore already incorporating the details of the second model instance required by claim 8. 

Regarding Claim 14,
Claim 14 recites The computer program product of claim 9… performing limitations that are similar to claim 6, thus is rejected with the same rationale applied against claim 6.

Regarding Claim 15,


Regarding Claim 16,
Claim 16 recites The computer program product of claim 15… performing limitations that are similar to claim 8, thus is rejected with the same rationale applied against claim 8.


Claims 3, 11, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Salman in view of Hu, further in view of Shin (“Graphs and ML: Multiple Linear Regression”)

Regarding Claim 3, 
Salman teaches: 
The method of claim 2, 

Salman further teaches: 
[the linear regression] reflecting the data characteristic (Page 99: “We conducted an experiment to demonstrate the enhancement in the performance of learning models with the reduced number of features. The selection criterion was the minimum testing error using both the LR and RF learning models.” teaches that the user defined criterion (data characteristic) is used to evaluate the linear regression model)

in response to [performing linear regression,] determining, by one or more processors, that the data characteristic is detected from the first set of training samples. (Page 100: “After selecting the optimal set of features, we used supervised machine learning techniques to build the anomaly detection models. We specifically used LR [32] and RF [33] due to their simple learning models and better performance with the UNSW dataset. Both algorithms have been widely adopted in machine learning, especially for the development of the IDS [7].” teaches performing linear regression; Page 99: “We conducted an experiment to demonstrate the enhancement in the performance of learning models with the reduced number of features. The selection criterion was the minimum testing error using both the LR and RF learning models.” teaches that the user defined criterion (data characteristic) is used to evaluate the linear regression model)

Salman does not appear to explicitly teach: 
wherein detecting the data characteristic from the first set of training samples comprises:
representing, by one or more processors, the first set of training samples as a set of points in a space having a number of dimensions, one of the plurality of performance indicators corresponding to one of the number of dimensions;
determining, by one or more processors, whether a geometric representation of a predetermined shape is available in the space for fitting the set of points [by performing linear regression]
[performing linear regression to determine that the geometric representation is available in the space]


However, Hu teaches: 
wherein detecting the data characteristic from the first set of training samples comprises:
representing, by one or more processors, the first set of training samples as a set of points in a space having a number of dimensions, one of the plurality of performance indicators corresponding to one of the number of dimensions; (Page 579, Section 3B: “Let the set of training sample data be {(x1, y1), . . . , (xi, yi), . . . , (xn, yn)}, where xi denotes the ith feature vector; yi ∈ {+1,−1} is the label of the ith feature vector, denoting whether the feature vector represents a normal behavior or not; and n is the size of the data set.” teaches that the training samples are points having 2 dimensions (x and y), y is the label denoting whether there is abnormal behavior (performance indicator) of a computing system)

Salman and Hu are analogous art because they are directed to anomaly detection. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to take the Linear regression and Random forest models of Salman and combine these models using Hu’s AdaBoost algorithm with a motivation to correct misclassifications made by weak classifiers and be less susceptible to overfitting (Hu, Page 578). 


The combination of Salman and Hu does not appear to explicitly teach: 
determining, by one or more processors, whether a geometric representation of a predetermined shape is available in the space for fitting the set of points [by performing linear regression]
[performing linear regression to determine that the geometric representation is available in the space]

However, Shin teaches: 
determining, by one or more processors, whether a geometric representation of a predetermined shape is available in the space for fitting the set of points [by performing linear regression] (Pages 3-4: “If p = 2, these (x1, x2, y) data points lie in a 3-D coordinate system (with x, y, and z axes) and multiple linear regression finds the plane that best fits the data points.”

    PNG
    media_image9.png
    507
    848
    media_image9.png
    Greyscale

teaches performing linear regression to determine if a plane (geometric representation of a predetermined shape) fits within the data points (available in the space))

[performing linear regression to determine that the geometric representation is available in the space] (Pages 3-4: “If p = 2, these (x1, x2, y) data points lie in a 3-D coordinate system (with x, y, and z axes) and multiple linear regression finds the plane that best fits the data points.” teaches teaches performing linear regression to determine if a plane (geometric representation of a predetermined shape) fits within the data points (available in the space))

Salman, Hu, and Shin are analogous art because they are directed to machine learning model analysis. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Shin’s method of performing linear regression by fitting a 

Regarding Claim 11,
Claim 11 recites The computer program product of claim 9… performing limitations that are similar to claim 3, thus is rejected with the same rationale applied against claim 3.

Regarding Claim 19,
Claim 19 recites The computer system of claim 18… performing limitations that are similar to claim 3, thus is rejected with the same rationale applied against claim 3.

Conclusion
The prior art made of record but not relied upon is considered pertinent to the applicant’s disclosure: 
Hu et al. (“Online Adaboost-Based Parameterized Methods for Dynamic Distributed Network Intrusion Detection”) teaches using AdaBoost (ensemble of weak classifiers) for anomaly detection in cloud-based systems. 
Chen et al. (“Combining Incremental Hidden Markov Model and Adaboost Algorithm for Anomaly Intrusion Detection”) teaches using an ensemble of AdaBoost and HMM for anomaly detection. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHOUN ABRAHAM whose telephone number is (571)272-8144. The examiner can normally be reached Mon - Fri 08:00-16:30.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/S.J.A./Examiner, Art Unit 2125                                                                                                                                                                                                        
/BRIAN M SMITH/Primary Examiner, Art Unit 2122