DETAILED ACTION
Currently claims 1-20 are pending for application 16/394120 filed on 25 April 2019.  All references cited in the IDS have been considered.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

	Claim Objections	
Each of Claims 4, 11, and 18 is objected to because of the following informalities:  
Each of claims 4, 11, and 18 recites “the loss residual being determine as …” which should read instead “the loss residual being determined as …”.
Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.



Claims 1-20 are rejected under 35 U.S.C. 101. because the claims are directed to an abstract idea; and because the claims as a whole, considering all claim elements both individually and in combination, do not amount to significantly more than the abstract idea, see Alice Corporation Pty. Ltd. v. CLS Bank International, et al, 573 U.S. (2014). In 
Step 1: Is the claim to a process, machine, manufacture, or composition of matter? Yes—claim 1 recites a method which is a process. Claims 8 and 15 recite a product and system, respectively.
Step 2A, prong one: Does claim 1 recite an abstract idea, law of nature or natural phenomenon? Yes—the limitations of “selecting a… model for application”, “defining an … architecture comprising a neural network”, “recording a loss value at each iteration to provide a plurality of loss values…”, “calculating a penalty score using at least a portion of the plurality of loss values, the penalty score being based on a loss span penalty PLS, a convergence penalty Pc, and a fluctuation penalty PF”,  “comparing the penalty score P to a threshold penalty score to affect a comparison”, and “selectively employing the trained autoencoder for ...”  as drafted, are mental steps of selecting a model, defining a neural network, recording a loss value, calculating a penalty score from various penalties, comparing the penalty score to a threshold, and selectively employing the trained model based on forming an agent that performs an advocacy function for an alternative, identifying/determining an interaction characteristic according to a (engagement) model and a weight, determination/indicator of an attention level of a user towards an agent, and adjusting/modifying, according to a (engagement) model, the agent according to a metric associated with that agent. Under the broadest reasonable interpretation, the limitations cover processes performed in the mind with pencil and paper. In addition (and alternatively), the limitations “calculating a penalty score using at least a portion of the plurality of loss values, the LS, a convergence penalty Pc, and a fluctuation penalty PF” and training the model are mathematical steps for computing a score and determining the trained model that fall within the mathematical concepts group. Therefore, claim 1 recites an abstract idea.
Step 2A, prong two: Does the claim recite additional elements that integrate the judicial exception into a practical application? No—the judicial exception is not integrated into a practical application. Although the claim recites that the recited functionality includes “A computer-implemented method”, “machine-learning (ML)”, “autoencoder”, and “anomaly detection”, the computer is recited at a high-level of generality such that it amounts to no more than a mere instructions to apply the exception using a generic computer component. Further, the elements of “machine-learning (ML)”, “autoencoder”, and “anomaly detection” are recited at a high level of generality that merely generally links the judicial exception to a particular, respective, technological environment and do not impose a meaningful limitation on the judicial exception. 
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? No—the only limitation on the performance of the described method is that it must be computer implemented with other limitations reciting “machine-learning (ML)”, “autoencoder”, and “anomaly detection”. These elements are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity (generic computer system, processing resources, links the judicial exception to a particular, respective, technological environment).  The claim thus recites computing components only at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using generic computer components; mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.
Taken alone, their additional elements do not amount to significantly more than the above- identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation.
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claims 8 and 15, which recite a computer product and a system, respectively. 
As to dependent claims 5, 6, 12, 13, 19, and 20, additional limitations are recited that fall under Step2A prong 1 as mental steps: 
Claims 5, 12 19: … “randomly divided…” (pen and paper)
Claims 6, 13, 20 : “wherein defining …comprises providing a number of hidden layers of the neural network, and a size of each hidden layer” (pen and paper)
In addition, it is noted that claims 2-5, 9-12, and 16-19 recite additional limitations that fall under Step2A prong 1 as mathematical steps in the mathematical concepts group:
Claims 2, 9, 16: “wherein the loss span penalty PLS is calculated as a minimum of a smoothed loss divided by a maximum of the smoothed loss, the smoothed loss being determined based on the plurality of loss values”  
Claims 3, 10, 17: “wherein determining the convergence penalty Pc comprises: selecting an interval of iterations, over which loss values in the plurality of loss are each below a threshold loss; determining a number of iterations in the interval of iterations; and calculating the convergence penalty Pc as the quotient of the number of iterations and a total number of iterations in training of the autoencoder”.  
Claims 4, 11, 18 : “wherein the fluctuation penalty PF is determined as a difference between a maximum of a loss residual and the minimum of the loss residual, the loss residual being determine as a difference between a smoothed loss and the plurality of loss values. “ 
Claims 5, 12 19: … “randomly divided…” 
In addition, claims 5, 7, 12, 14, and 19 recite additional elements to be addressed at Step 2A, Prong 2 and at Step 2B as follows: 
Each of claims 5, 12, and 19 recites the function “the data set is … divided into a training sub-set, and a validation sub-set”  that is used in the mathematical step of training the (autoencoder) model where it is noted that, as pointed out previously, the autoencoder is recited at a high level of generality and merely links the judicial exception to a particular technological environment and does not impose a meaningful limitation on the judicial exception and that the processors that perform the function of training is recited at a high level of generality and are no more than mere instructions to apply the exception using a generic computer and, thereby, do not impose a meaningful limit on the judicial exception. In addition, the claimed extra-solution data gathering (forming training and validation sub-sets) is acknowledged to be well-understood, routine, conventional activity (see, e.g., court recognized WURC examples in MPEP 2106.05(d)(II)(i)). 
In addition, each of claims 7 and 14 recites “processing a data stream from one or more Internet-of-Things (IoT) devices that monitor an environment to selectively detect in anomalous condition within the environment”.  Each of the elements “data stream”, “IoT devices”, “monitor an environment”, and “selectively detect in anomalous condition within the environment” is recited at a high level of generality that merely generally links the judicial exception to a particular technological environment (step 2A, prong 2) and, therefore does not impose a meaningful limitation on the judicial exception and are no more than mere instructions to apply the exception using a generic computer (generic computer system, processing resources). In addition, the claimed extra-solution of monitoring an IoT device to detect anomalies is also well-understood, routine, conventional activity routine (see, for example, Meidan et al. (“N-BaloT-Tetwork-Based Detection of IoT Botnet Attacks Using Deep Autoencoders”, IEEE Pervasive Computing, 2018, pp. 12-22)  [Abstract, pp. 14-15, Related Work, Table 1].)
In summary, as shown in the analysis above, claims 1-20 do not provide any additional elements that when considered individually or as an ordered combination, amount to significantly more than the abstract idea identified. Therefore, as a whole claims 1-20 do not recite what have the courts have identified as "significantly more”. In particular, there is no indication that the combination of elements improves the functioning of a computer or improves another technology when claims are considered individually or as an ordered combination.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 2, 4-9, 11-16, and 18-20 are  rejected under 35 U.S.C. 103 as being unpatentable over Meidan et al. (“N-BaloT-Tetwork-Based Detection of IoT Botnet Attacks Using Deep Autoencoders”, IEEE Pervasive Computing, 2018, pp. 12-22), hereinafter referred to as Meidan, in view of Dewancker et al. (“A Strategy for Ranking Optimization Methods using Multiple Criteria”, JMLR: Workshop and Conference Proceedings 64, 2016, pp. 11-20),  hereinafter referred to as Dewancker, and in further view of  Raschka et al. (“Model Evaluation, Mode Selection, and Algorithm Selection in Machine Learning”, http://arxiv.org/pdf/1811.12808v2.pdf, arXiv:1811.12808v2 [cs.LG], 3 December 2018, pp. 1-49],  hereinafter referred to as Raschka..

In regards to claim 1, Meidan teaches A computer-implemented method for selecting a machine-learning (ML) model for application in anomaly detection, the method being executed by one or more processors and comprising: defining an autoencoder architecture comprising a neural network; ([p. 13, Introduction] 
For detecting attacks launched from IoT bots we propose N-BaIoT, a network-based approach for the IoT that uses deep learning techniques to perform anomaly detection. Specifically, we extract statistical features that capture behavioral snapshots of benign IoT traffic, and train a deep autoencoder (one for each device) to learn the IoT device’s normal behaviors. The autoencoders attempt to compress snapshots. When an autoencoder fails to reconstruct a snapshot, it is a strong indication that the observed behavior is anomalous (the IoT device has been compromised and is exhibiting an unknown behavior)., wherein a computer-based machine learning model/autoencoder (Figure 1) performs anomaly detection.) defining an autoencoder architecture comprising a neural network; ([p. 19, Experimental Results and Discussion] For training and optimization, we used Keras. Each autoencoder had an input layer whose dimension is equal to the number of features in the dataset (115). As noted by Ignacio Arnaldo and his colleagues6 and by Li, Ma, and Jiao,7 autoencoders effectively perform dimensionality reduction internally, such that the code layer between the encoder(s) and decoder(s) efficiently compresses the input layer and reflects its essential characteristics. In our experiments, four hidden layers of encoders were set at decreasing sizes of 75 percent, 50 percent, 33 percent, and 25 percent of the input layer’s dimension. The next layers were decoders, with the same sizes as the encoders but with an increasing order (starting from 33 percent). Table 3 provides technical details about the training stage with a focus on the dataset properties, the optimized hyperparameters of the autoencoders, and the botnet infections., wherein an architecture of the autoencoder is specified (e.g., number of layers, nodes per layer), during training of the autoencoder, recording a loss value at each iteration to provide a plurality of loss values, the autoencoder being trained using a data set that is associated with a domain, and a learning rate to provide a trained autoencoder; ([p. 16, Training an Anomaly Detector] For training and optimization, we use two separate datasets that only contain benign data, from which the model learns patterns of normal activity. The first dataset is the training set (DStrn) and is used for training the autoencoder, given input parameters such as the learning rate (η, the size of the gradient descent step) and the number of epochs (complete passes through the entire DStrn). …Once the network was trained, as described in Section 5.4, the learned parameter values were set as initialisation values of a multilayer perceptron and fine-tuned with backpropagation to minimise the mean-squared error between the inputs and outputs of the network, i.e., an autoencoder (AE). When the network was trained, its bottom half (i.e., a DBN) is used to extract feature sets, which are then taken as input to train a one-class SVM in the usual way. ..., wherein the autoencoder is trained with domain-specific training data (IoT packet snapshots with features shown in table 2) and according to hyperparameter values (including the learning rate) during which the course of which the error (reconstruction loss) is monitored (interpreted as occurring each iteration/epoch until training is stopped for any reason).) calculating a penalty score using at least a portion of the plurality of loss values, …  ([p. 16, Training an Anomaly Detector] We optimize each trained model’s parameters and hyperparameters such that when applied to unseen traffic the model maximizes the true positive rate (TPR, detecting attacks once they occur) and minimizes the false positive rate (FPR, wrongly marking benign data as malicious) …The second dataset is the optimization set (DSopt) and is used to optimize these two hyperparameters (η and epochs) iteratively until the mean square error (MSE) between a model’s input (original feature vector) and output (reconstructed feature vector) stops decreasing. Stopping at this point prevents overfitting DStrn, thus promoting better detection results with future data. DSopt is later used to optimize a threshold (tr) that discriminates between benign and malicious observations and, finally, the window size (ws), by which the FPR is minimized.… The hyperparameters of DBN-based networks, learning rate (for pretraining 0.001 − 0.01, for fine tuning 0.1 − 1), number of epochs (for pretraining 5 − 10, for fine tuning 10 − 30), number of hidden units (d n), are set based on the best performance on a validation set. The SVDD parameters, width ν (0 − 1), and σ (1 − ∞), are selected via a grid-search., wherein a penalty score is assigned to the trained autoencoder in the form, at least, of the false positive rate (based upon which the window size hyperparameter is clearly optimized but upon which the optimization of other hyperparameters is indicated through maximizing TPR while minimizing FPR) but wherein the identification/quantification of the point at which the MSE ceases to decrease is, in a more general sense, also a penalty score because it also is an evaluation metric for a given model used in model selection.)  comparing the penalty score P to a threshold penalty score to affect a comparison; and selectively employing the trained autoencoder for anomaly detection within the domain based on the comparison.  ([p. 16, Training an Anomaly Detector, Table 3] We optimize each trained model’s parameters and hyperparameters such that when applied to unseen traffic the model maximizes the true positive rate (TPR, detecting attacks once they occur) and minimizes the false positive rate (FPR, wrongly marking benign data as malicious) …The second dataset is the optimization set (DSopt) and is used to optimize these two hyperparameters (η and epochs) iteratively until the mean square error (MSE) between a model’s input (original feature vector) and output (reconstructed feature vector) stops decreasing. ….The hyperparameters of DBN-based networks, learning rate (for pretraining 0.001 − 0.01, for fine tuning 0.1 − 1), number of epochs (for pretraining 5 − 10, for fine tuning 10 − 30), number of hidden units (d n), are set based on the best performance on a validation set. The SVDD parameters, width ν (0 − 1), and σ (1 − ∞), are selected via a grid-search., wherein a particular trained autoencoder is selected and applied to perform anomaly detection such that this selection is based on the set of hyperparameters found to provide the optimal performance according to the FPR (or TPR) metric such that the threshold in this selection is the model with the second best FPR (or TPR) metric found over the space of variation of the set of hyperparameters subject to optimization and wherein it is noted that the chosen autoencoder model is also selected based upon the IoT device characteristics – Table 3).)   
However, Meidan does not explicitly teach the penalty score being based on a loss span penalty PLS, a convergence penalty Pc, and a fluctuation penalty PF; 
In other words, Meidan discloses selection criteria in the hyperparameter optimization process based on TPR or FPR rather than a multi-objective criteria including convergence, fluctuation, or loss span penalties.
However, Dewancker, in the analogous environment of performing model selection through hyperparameter optimization, teaches calculating a penalty score using at least a portion of the plurality of loss values, the penalty score being based on a loss span penalty PLS, a convergence penalty Pc, …;  ([Abstract, p. 12, Section 2, p. 13, Section 2.1, p. 16, Section 2.2.1] An important component of a suitably automated machine learning process is the automation of the model selection which often contains some optimal selection of hyperparameters. The hyperparameter optimization process is often conducted with a black-box tool, but, because different tools may perform better in different circumstances, automating the machine learning workflow might involve choosing the appropriate optimization method for a given situation. This paper proposes a mechanism for comparing the performance of multiple optimization methods for multiple performance metrics across a range of optimization problems., First, we use pairwise Mann-Whitney U tests (discussed in the supplemental content) at a chosen α significance on the Best Found results to determine a partial ranking based only on that statistic, • Any tied results from that step are then subject to additional partial ranking using the same test on the Area Under Curve metric,, Many metrics exist for describing the quality of an optimizer, and each application values them differently. In the context of an AutoML problem, the goal of the optimizer is to find the optimal model design or hyperparameters, and the quality of an optimizer might be judged on the proximity of the solution to the optimal design or the speed with which that solution is found (thus facilitating more model experimentation/learning)….To measure the speed of improvement, we supplement the best found metric with the Area Under Curve metric, 1 T PT i=1(fbest[i] − fLB). A specified lower bound fLB on the function ensures the AUC is always positive. The name AUC reflects the physical interpretation of the metric as an approximate integral of the best seen traces., Our metrics are by no means the only criteria by which optimization algorithms can, or should, be judged. Knowledge of xopt also permits use of the gap metric, (f(x1)−fbest[T]) (f(x1)−f(xopt)) (Huang et al., 2006; Brochu et al., 2010) which is cleanly scaled between 0 and 1. Cumulative regret, PT i=1(f(xi) − f(xopt)), penalizes suggestions which do not improve fbest (Srinivas et al., 2010; Bull, 2011) and thus may be valuable for an online automated machine learning setting.
wherein a set of optimization criteria are identified for potential use (in combination) for hyperparameter optimization in model selection such that these diverse criteria include a convergence penalty score in the form of the area under curve metric that characterizes the speed of improvement/convergence and the gap metric ((f(x_1)-f_best(T))/(f(x1)-f(xopt)) which is a loss span penalty score since it is based on the span (variation) between the loss function value at the start of training and a loss function after time T of training) and wherein the set of individual penalty/evaluation metrics are combined in a hierarchical multi-criteria ranking process that effectively weights the contribution/importance of each of those metrics for ranking the models relative to one another (the ranking or any resultant statistic to form that ranking is an overall penalty score).)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meidan  to incorporate the teachings of Dewancker to perform parameter optimization for an autoencoder based on  calculating a penalty score using at least a portion of the plurality of loss values, the penalty score being based on a loss span penalty PLS and a convergence penalty Pc. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved learning model performance in which the model hyperparameters are automatically tuned according to multiple criteria over a hierarchy of optimization metrics based on the measured related to the learning/loss curve associated with that model. (Dewancker, [Abstract, pp. 16-17, Section 3, p. 17, Section 4]).
However, Meidan and Dewancker do not explicitly teach ..a fluctuation penalty PF… Although Dewancker teaches the application of a plurality of evaluation of metrics, including the cumulative regret metric which expresses a difference between a maximum (best) performance loss (f(x_opt)) and a smaller (interpretable as minimum) sub-optimal performance loss, this metric does not provide an indication of a “fluctuation” or variance in the loss curve.
However, Raschka, in the analogous art of learning model selection according to evaluation metrics, teaches a fluctuation penalty PF.    ([p. 6, Section 1.2, p. 10, Section 1.7, p. 34, Section 4.2] The variance is simply the statistical variance of the estimator βˆ and its expected value E[ ˆβ], for instance, the squared difference of the : Variance = E h βˆ − E[βˆ] 2 i . (9) The variance is a measure of the variability of a model’s predictions if we repeat the learning process multiple times with small fluctuations in the training set. The more sensitive the model-building process is towards these fluctuations, the higher the variance., Using the holdout method as described in Section 1.5, we computed a point estimate of the generalization performance of a model. Certainly, a confidence interval around this estimate would not only be more informative and desirable in certain applications, but our point estimate could be quite sensitive to the particular training/test split (for instance, suffering from high variance)…. As discussed earlier, we compute the prediction accuracy on a dataset S (here: test set) of size n as follows: <equation 10> … Since we are interested in the average number of successes, not its absolute value, we compute the variance of the accuracy estimate as <equation 15> … Under the normal approximation, we can then compute the confidence interval as <equation 17>., There are several different statistical hypothesis testing frameworks that are being used in practice to compare the performance of classification models, including conventional methods such as difference of two proportions (here, the proportions are the estimated generalization accuracies from a test set), for which we can construct 95% confidence intervals based on the concept of the Normal Approximation to the Binomial that was covered in Section 1., wherein a given model (with a respective set of hyperparameter values) is evaluated (for selection) using variance-based metrics such that the variance and corresponding confidence value (a fluctuation) associated with an accuracy (loss) measure.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meidan and Dewancker to incorporate the teachings of Raschka for the penalty score to include a fluctuation penalty PF. The modification would have been obvious because one of ordinary skill would have been motivated to improve model selection by using training methods and hyperparameter optimization methods that are based on effective statistical evaluation metrics, including ones which characterize the bias-variance tradeoff. (Rashka, [Abstract, pp. 16-17, Section 3, pp. 46-47, Section 4.15]).

In regards to claim 2, the rejection of claim 1 is incorporated and Meidan does not further teach wherein the loss span penalty PLS is calculated as a minimum of a smoothed loss divided by a maximum of the smoothed loss, the smoothed loss being determined based on the plurality of loss values.  In other words, Meidan discloses selection criteria in the hyperparameter optimization process based on TPR or FPR rather than a multi-objective criteria including convergence, fluctuation, or loss span penalties.
However, Dewancker, in the analogous environment of performing model selection through hyperparameter optimization, teaches wherein the loss span penalty PLS is calculated as a minimum of a smoothed loss divided by a maximum of the smoothed loss, the smoothed loss being determined based on the plurality of loss values   ([p. 16, Section 2.2.1, Figure 1] Our metrics are by no means the only criteria by which optimization algorithms can, or should, be judged. Best Found measures proximity to the optimal function value f(xopt), but not proximity to the optimal vector xopt which could be more insightful in some AutoML-pertinent circumstances. The metrics could account for the probabilistic nature of the problem; for example, the probability of fbest[T] being more than 10% from the optimal value (Dolan and More´, 2002). Knowledge of xopt also permits use of the gap metric, (f(x1)−fbest[T] (f(x1)−f(xopt)) (Huang et al., 2006; Brochu et al., 2010) which is cleanly scaled between 0 and 1.., wherein one of a set of optimization criteria includes the gap metric, which is a loss span penalty score since it is based on the span (variation) between the loss function value at the start of training and a loss function after time T of training) such that the loss function is a loss associated with the accuracy (as suggested in Figure 1) and such that this loss function is smoothed because it is a span relative to a best loss function over a time T (i.e., “smoothed” by considering a best result over the total time T) but also because the loss function metric is based on evaluations performed over independent tests (i.e., the stochastic interpretation of the results is interpreted as being a statistical analysis that forms a “smoothed” result).)   
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meidan to incorporate the teachings of Dewancker for the loss span penalty PLS to be calculated as a minimum of a smoothed loss divided by a maximum of the smoothed loss, the smoothed loss being determined based on the plurality of loss values. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved learning model performance in which the model hyperparameters are automatically tuned according to multiple criteria over a hierarchy of optimization metrics based on the measured related to the learning/loss curve associated with that model, and in which one metric quantifies a relative gap between losses. (Dewancker, [Abstract, pp. 16-17, Section 3, p. 17, Section 4]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meidan and Dewancker to incorporate the teachings of Raschka for the same reasons as pointed out for claim 1.

In regards to claim 4, the rejection of claim 1 is incorporated and Meidan and Dewancker do not further teach wherein the fluctuation penalty PF is determined as a difference between a maximum of a loss residual and the minimum of the loss residual, the loss residual being determine as a difference between a smoothed loss and the plurality of loss values.    In other words, Meidan discloses selection criteria in the hyperparameter optimization process based on various convergence criteria including the determination of a minimum loss residual (the smallest MSE during training); however, Meidan does not disclose a fluctuation penalty criterion. Although Dewancker teaches the application of a plurality of evaluation of metrics, including the cumulative regret metric which expresses a difference between a maximum (best) performance loss (f(x_opt)) and a smaller (interpretable as minimum) sub-optimal performance loss, this metric does not provide an indication of a “fluctuation” or variance in the loss curve.
However, Raschka, in the analogous art of learning model selection according to evaluation metrics, teaches wherein the fluctuation penalty PF is determined as a difference between a maximum of a loss residual and the minimum of the loss residual, the loss residual being determine as a difference between a smoothed loss and the plurality of loss values.    ([p. 6, Section 1.2, p. 10, Section 1.7, p. 34, Section 4.2] The variance is simply the statistical variance of the estimator βˆ and its expected value E[ ˆβ], for instance, the squared difference of the : Variance = E h βˆ − E[βˆ] 2 i . (9) The variance is a measure of the variability of a model’s predictions if we repeat the learning process multiple times with small fluctuations in the training set. The more sensitive the model-building process is towards these fluctuations, the higher the variance., Using the holdout method as described in Section 1.5, we computed a point estimate of the generalization performance of a model. Certainly, a confidence interval around this estimate would not only be more informative and desirable in certain applications, but our point estimate could be quite sensitive to the particular training/test split (for instance, suffering from high variance)…. As discussed earlier, we compute the prediction accuracy on a dataset S (here: test set) of size n as follows: <equation 10> … Since we are interested in the average number of successes, not its absolute value, we compute the variance of the accuracy estimate as <equation 15> … Under the normal approximation, we can then compute the confidence interval as <equation 17>., There are several different statistical hypothesis testing frameworks that are being used in practice to compare the performance of classification models, including conventional methods such as difference of two proportions (here, the proportions are the estimated generalization accuracies from a test set), for which we can construct 95% confidence intervals based on the concept of the Normal Approximation to the Binomial that was covered in Section 1., wherein a given model (with a respective set of hyperparameter values) is evaluated (for selection) using variance-based metrics such that the variance and corresponding confidence value associated with an accuracy (loss) measure is computed in general as a difference between a smoothed estimate (mean) and the individual samples from which the mean is computed (i.e., the variance corresponds to the loss residual since it is determined as the difference between a mean/smoothed accuracy/loss and a corresponding set of accuracy/loss values) but wherein that variance/standard deviation is used to form a confidence around the mean accuracy value (equation 17) such that the lower and upper limits to that confidence bounds correspond to the difference between the maximum of the loss residual and the minimum of the loss residual (and are representative of the fluctuation in the accuracy/loss metric).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meidan and Dewancker to incorporate the teachings of Raschka for the fluctuation penalty PF to be determined as a difference between a maximum of a loss residual and the minimum of the loss residual, the loss residual being determined as a difference between a smoothed loss and the plurality of loss values. The modification would have been obvious because one of ordinary skill would have been motivated to improve model selection by using training methods and hyperparameter optimization methods that are based on effective statistical evaluation metrics, including ones which characterize the bias-variance tradeoff. (Rashka, [Abstract, pp. 16-17, Section 3, pp. 46-47, Section 4.15]).

In regards to claim 5, the rejection of claim 1 is incorporated and Meidan further teaches wherein, for training of the autoencoder, the data set is … divided into a training sub-set, and a validation sub-set.  ([p. 19, Section Experimental Results and Discussion] Each of the nine sets of benign data we collected in our lab, corresponding to the nine IoT devices, was divided chronologically into three equidimensional sets: DStrn for training the autoencoder, DSopt for parameter optimization, and the benign part of DStst for estimating the FPR…. Then we extracted the features from the malicious data and appended each benign part of DStst (previously mentioned) to the respective malicious part of DStst, to form a single test dataset per IoT device with both benign and malicious instances.,wherein an anomaly detection dataset is partitioned into a training subset (DStrn) and a testing subset (DStst) in which the division is done chronologically.)
However, Meidan and Dewancker do not explicitly teach …randomly…. The division in Meidan is based on time; Dewancker does not discuss test-train data set partition.
However, Raschka, in the analogous art of learning model selection according to evaluation metrics, teaches wherein, for training …, the data set is randomly divided into a training sub-set, and a validation sub-set.      ([p. 8, Section 1.4, p. 8, Section 1.5, p. 15, Section 2.3, Figure 4] Thus, a recommended practice is to divide the dataset in a stratified fashion. Here, stratification simply means that we randomly split a dataset such that each class is correctly represented in the resulting subsets (the training and the test set) – in other words, stratification is an approach to maintain the original class proportion in resulting subsets., Step 1. First, we randomly divide our available data into two subsets: a training and a test set. … Typically, we assign 2/3 to the training set and 1/3 of the data to the test set., One way to obtain a more robust performance estimate that is less variant to how we split the data into training and test sets is to repeat the holdout method k times with different random seeds and compute the average performance over these k repetitions: <equation 21>, wherein a data set is randomly partitioned between a validation sub-set and a training subset such as through repeated holdout validation of stratification (used, for example, to generate the learning curves of Figure 4).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meidan and Dewancker to incorporate the teachings of Raschka for the data set used to train the autoencoder to be randomly divided into a training sub-set, and a validation sub-set. The modification would have been obvious because one of ordinary skill would have been motivated to improve model selection by using training methods, such as random holdout or stratification, and hyperparameter optimization methods that are based on/optimized using effective statistical evaluation metrics. (Rashka, [Abstract, p. 14, Section 2.2, pp. 16-17, Section 3, p. 22, Section 3.3, pp. 46-47, Section 4.15]).


In regards to claim 6, the rejection of claim 1 is incorporated and Meidan further teaches wherein defining the auto-encoder architecture at least partially comprises providing a number of hidden layers of the neural network, and a size of each hidden layer.  ([p. 19, Experimental Results and Discussion] In our experiments, four hidden layers of encoders were set at decreasing sizes of 75 percent, 50 percent, 33 percent, and 25 percent of the input layer’s dimension. The next layers were decoders, with the same sizes as the encoders but with an increasing order (starting from 33 percent). Table 3 provides technical details about the training stage with a focus on the dataset properties, the optimized hyperparameters of the autoencoders, and the botnet infections., wherein the architecture of the autoencoder is characterized by a number of layers and a size for each layer.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meidan to incorporate the teachings of Dewancker and Raschka for the same reasons as pointed out for claim 1.

In regards to claim 7, the rejection of claim 1 is incorporated and Meidan further teaches wherein employing the trained autoencoder for anomaly detection comprises processing a data stream from one or more Internet-of-Things (IoT) devices that monitor an environment to selectively detect in anomalous condition within the environment.  ([p. 13, Introduction, p. 15, Proposed Detection Method] Efficiency. In the enterprise scenario, it is common to monitor the traffic data of all connected hosts, but the amount of monitored traffic is prohibitively large to store and use for training deep neural networks (DNNs). Our method uses incremental statistics to perform the feature extraction, and the training of the autoencoders can be performed in a semionline manner (train on a batch of observations and then discard).,Our proposed method for detecting IoT botnet attacks relies on deep autoencoders for each device, trained on statistical features extracted from benign traffic data. When applied to new (possibly infected) data of an IoT device, detected anomalies may indicate that the device is compromised. This method consists of four main stages: data collection, feature extraction, training an anomaly detector, and continuous monitoring., wherein the autoencoder-based anomaly detection system processes a stream of data collected from monitored IoT devices.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meidan to incorporate the teachings of Dewancker and Raschka for the same reasons as pointed out for claim 1.

Claims 3, 10, and 17 are  rejected under 35 U.S.C. 103 as being unpatentable over Meidan, in view of Dewancker, in view of Raschka, and in further view of  Mazzoni et al. (“Active Learning in the Presence of Unlabelable Examples”, Jet Propulsion Laboratory, 2004, pp. 1-12],  hereinafter referred to as Mazzoni.

In regards to claim 3, the rejection of claim 1 is incorporated and Meidan, Dewancker, and Raschka do not further teach wherein determining the convergence penalty Pc comprises: selecting an interval of iterations, over which loss values in the plurality of loss are each below a threshold loss; determining a number of iterations in the interval of iterations; and calculating the convergence penalty Pc as the quotient of the number of iterations and a total number of iterations in training of the autoencoder.  In other words, Meidan discloses selection criteria in the hyperparameter optimization process based on a convergence criterion (MSE no longer decreasing as previously pointed out) for an autoencoder, he does not teach a convergence penalty score metric as recited. Also, Dewancker does not explicitly teach the form of a convergence metric recited in the claim. Specifically, although Dewancker teaches the determination of a number of iterations associated with a pattern of convergence (abscissa in Figure 1) and the use of this number in the metric/score of convergence (AUC),  Raschka does not disclose a convergence speed evaluation metric.
However, Mazzoni, in the analogous art of evaluating different learning algorithms/models, teaches wherein determining the convergence penalty Pc comprises: selecting an interval of iterations, over which loss values in the plurality of loss are each below a threshold loss; determining a number of iterations in the interval of iterations; and calculating the convergence penalty Pc as the quotient of the number of iterations and a total number of iterations in training of the autoencoder and a total number of iterations in training of the autoencoder ([p. 7, Section 4.1, Figure 2] To compare the effectiveness of different active learning algorithms, we use three evaluation methods…. In some cases, however, we are more interested in how quickly the learner reaches a specific level of accuracy (p). We define p-trials(ALG) to be the number of trials required by ALG to reach accuracy p and p-speed to be the factor of improvement with respect to Random, where larger values of p-speed indicate faster convergence: <equation 2>.  The preceding algorithms are useful for comparing the relative efficiency of different active learners., wherein one of a set of the criteria for evaluating different learning models (the performance for each of which is represented by a (loss/accuracy) learning curve such as in Figure 2) is the ratio of two numbers of iterations rho_random/rho_active shown in equation 2 in which the numerator is a maximum number of iterations (interpreted as being the total number of iterations) associated with a learning model convergence to rho  (i.e., “Random” forms a reference upper bound to the number of iterations required for convergence) and the numerator is the denominator is the number of iterations required to reach a certain level of performance for a particular learning model paradigm (i.e., “active”) and wherein this ratio is being interpreted as inherently corresponding to a ratio of the number of iterations in which the particular learning model paradigm (in particular all of the algorithms shown in Table 1 except for the maxmin algorithms),is better than the convergence threshold rho (i.e., the accuracy curves are monotonically improving) and the total number of iterations (this accuracy is interpreted as remaining over the convergence threshold for iterations after the iteration in which it reaches that threshold) as indicated by the following corresponding function inherently corresponding to equation 2: 1-(1/rhospeed(active)) = (rho(random)-rho(active))/rho(random); in other words, 1/rhospeed also characterizes a speed of convergence and is a number between 0 and 1 (with any active paradigm performing better than a random paradigm and with 1 meaning poor convergence speed) such that 1 minus 1/rhospeed also characterizes a speed of convergence also between 0 and 1 (but now with 1 meaning good convergence speed) such that any of these modifications of equation 2 is inherently obvious without modifying the underlying function of that metric as a representation of convergence speed).) 

It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meidan, Dewancker , and Raschka to incorporate the teachings of Mazzoni for the convergence penalty Pc to be computed by selecting an interval of iterations, over which loss values in the plurality of loss are each below a threshold loss, determining a number of iterations in the interval of iterations, and calculating the convergence penalty Pc as the quotient of the number of iterations and a total number of iterations in training of the autoencoder and a total number of iterations in training of the autoencoder. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved classification model performance through model design in which one of a set of model performance evaluation criteria includes a convergence speed determined by a ratio of iterations relative to a performance threshold, in particularly for model designs in which at least some samples are unlabeled (Mazzoni, [Abstract, pp. 11-12, Section 5, Table 1, Figure 3]).

Claim 10/8 is also rejected because it is just a CRM implementation of the same subject matter of claim 3/1 which can be found in Meidan, Dewancker, Raschka, and Manzoni.  

Claim 17/15 is also rejected because it is just a system implementation of the same subject matter of claim 3/1 which can be found in Meidan, Dewancker, Raschka, and Manzoni.  


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Cui et al. (“MLtuner: System Support for Automatic Machine Learning Tuning”, CMJ-PDL-16-108, Parallel Data Laboratory, Carnegie Mellon University, October, 2016, pp. 1-23) teach tuning of ML model hyperparameters using empirical metrics derived from learning curves such as convergence speed and stability. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT LEWIS KULP whose telephone number is (571)272-7983. The examiner can normally be reached M, Th, F 8-5:30; Tu 8-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT LEWIS KULP/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124