DETAILED ACTION
This the response to applicant’s amendment action regarding application number 15/843,949, filed December 15, 2017.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendments
The amendment filed July 1, 2022 has been entered. Examiner acknowledges receipt of Amendments to Application 15/843,949, which include: Amendments to the Claims, Amendments to the Drawings, and Remarks containing Applicant’s amendments. 
Regarding Applicant’s Remarks and Amendments to the Claims, Examiner acknowledges Applicant has amended Claims 1, 16-17, and 20, with Claims 2-4, 11, 13, and 15 previously cancelled. Examiner acknowledges Applicant has added new Claims 21-22. Claims 1, 5-10, 12, 14, and 16-22 remain pending in the application. 
Regarding Applicant’s Remarks and Amendments to the Claims, Examiner acknowledges Applicant’s corrections has resolved the identified claim objections in Claims 1, 17, and 20, and therefore the respective claim objections previously set forth in the Non-Final Office Action mailed April 8, 2022 are withdrawn.  However, Examiner notes that Applicant has introduced a claim objection for new Claim 22, which is further detailed in the section indicated below.
Regarding Applicant’s Remarks and Amendments to the Drawings, Examiner acknowledges Applicant’s latest submission for Figure 2 has resolved the identified drawing objection, and therefore the respective drawing objection previously set forth in the Non-Final Office Action mailed April 8, 2022 is withdrawn. 

Response to Arguments
Examiner acknowledges receipt of Arguments to Application 15/843,949, which include: Remarks containing Applicant’s arguments. 
Regarding Applicant’s Remarks for Claims 1, 5, and 16-20 under 35 U.S.C. 103 as being unpatentable over Williams, Jr. et al., U.S. PGPUB 2015/0254555, published 9/10/2015 [hereafter referred as Williams] in view of An et al., Variational Autoencoder based Anomaly Detection using Reconstruction Probability, December 27, 2015 [hereafter referred as An], in further view of Casas et al., UNADA: Unsupervised Network Anomaly Detection Using Sub-space Outliers Ranking, In Networking 2011, Part I, LNCS 6640, 2011 IFIP International Federation for Information Processing [hereafter referred as Casas]; for Claim 6 under 35 U.S.C. 103 as being unpatentable over Williams in view of An, in further view of Casas as applied to Claim 1; in even further view of Zhou et al., Distributed Anomaly Detection by Model Sharing, 2009 International Conference on Apperceiving Computing and Intelligence Analysis, IEEE 2009 [hereafter referred as Zhou]; for Claim 7 under 35 U.S.C. 103 as being unpatentable over Williams in view of An, in further view of Casas, in even further view of Zhou as applied to Claim 6; in even further view of Tuor et al., Deep Learning for Unsupervised Insider Threat Detection in Structured Cybersecurity Data Streams, arXiv:1710.00811v1, October 2, 2017 [hereafter referred as Tuor]; for Claims 8-10, 12, and 14 under 35 U.S.C. 103 as being unpatentable over Williams in view of An, in further view of Casas, in even further view of Zhou, in even further view of Tuor as applied to Claim 7; in even further view of Elovici et al., WO2018/037411, filed 8/23/2017 [hereafter referred as Elovici], Examiner acknowledges Applicant’s arguments and have considered them, and have found them to be not persuasive. Examiner notes that the Applicant has amended the claims and added new claims such that it necessitates further examination and re-evaluation of the amended, original and new claims, where those newly introduced limitations and new claims will be discussed in the relevant sections indicated below. The updated claim mappings according to the Applicant’s amended and new claims are provided in the relevant sections indicated below.
Regarding Applicant’s Remarks:
“… Applicant submits that the combination of Williams, An, Casas, Zhou, Tuor, and Elovici[[do]] does not disclose, teach, or fairly suggest the above emphasized claimed features. In rejecting claim 16, the Office Action indicates the following …
< prior art mapping of Claim 16 from Non-Final Office Action mailed April 8, 2022 > 
The DLNN model is not the same as the alleged Fast Learning Model in Williams.
Paragraph [0176] of Williams discusses the differences between the two models, namely the DLNN model and the Fast Learning Model. Therefore, Williams combined with the cited art does not disclose or render obvious "wherein the inference model is a trained, unsupervised machine learning model, implemented as an auto-encoder by a neural network ... wherein the cognitive algorithm is being trained until the trained model is used to replace the inference model" recited in claim 1. Moreover, Williams is intended to have one fast model and a different slow model.
Further, the Office Action recognized that Williams did not substitute the Fast Learning Model for the DLNN model. The Office Action attempts to read use of the combination function 520 in Williams as alleged substitution. However, any overly broad reading of Williams combined with the cited art does not disclose or render obvious "replacing the inference model, as currently used to classify the non-stationary data, with the trained model" recited in claim 1.”
Examiner has considered the above arguments, and has found them to be not persuasive.
Examiner points out that Applicant’s arguments are primarily directed to the following amended limitations that were previously present in dependent Claim 16 (and now integrated into independent Claim 1): “… while collecting the non-stationary data and classifying the collected non-stationary data: training a cognitive algorithm corresponding to said inference model, based on non-stationary data collected from the network, to obtain a trained model … ; replacing the inference model, as currently used to classify the non-stationary data, with the trained model …”, with the usage of the term “replacing” in place of the earlier term “substituting”, and where the inference model is defined and recited in the following limitation from independent Claim 1: “… wherein the inference model is a trained, unsupervised machine learning model, implemented as an auto-encoder by a neural network”. Examiner reminds Applicant that MPEP 2111 requires that during patent examination, the pending claims must be given their broadest reasonable interpretation consistent with the specification, and an Examiner must construe claim terms in the broadest reasonable manner during prosecution as is reasonably allowed in an effort to establish a clear record of what applicant intends to claim. Examiner addresses each of the above recited limitations in the following paragraphs.
Examiner points out that Applicant’s sub-argument that the Williams reference does not disclose “an inference model is a trained, unsupervised machine learning model, implemented as an auto-encoder by a neural network” is not persuasive. As indicated in the Non-Final Office Action mailed April 8, 2022, Williams teaches an auto-encoder defined by a Model Structure component, where the auto-encoder is indicated as a representation of a deep learning neural network (DLNN) model that is further trained using training data, and its resulting model representation is stored as a trained model (Williams [0097]: “Model Structure 516 may be a file or information otherwise provided that describes the structure of each model implemented in the system … may include configurations such as classifiers … auto-encoders, which reduce the dimensionality of data … Model Structure 516 may also include a specification of a combination of the machine learning models described above, together with additional machine learning models that consume the output of DLNN models. For example, configuring an auto-encoder to reduce the dimensionality of input data, followed by a k-Nearest-Neighbor model used to detect anomalies in the reduced dimensionality space … During the Training Process 512, the training data is processed through a training algorithm, and computes the biases, weights, and transfer functions which are stored in Model(s) 518.”). Williams further indicates that the stored auto-encoder models used for performing classification predictions are representations of DLNN models, and further indicates that the Training Corpus component providing the input to train the identified models may be based on unsupervised training techniques that require no labels or outputs, hence teaching that the resulting trained auto-encoder is a trained, unsupervised machine learning model (Williams [0092]-[0094]: “… the system may be initialized using Training Process 512, which may take as input Training Corpus 508 and calculates the biases, weights, and transfer functions of the active machine learning model or models 518 with a selected one or more training algorithms … the system may employ fully unsupervised training, in which the Training Corpus 508 contains no labels or output values …”; [0100]: “… once the Model(s) 508 have been stored, … Scoring Process 522, which applies Model(s) 518 to the input data and executes Combination Function 512 to select the correct predicted classification …”; and Figure 6, [0104]: “FIG. 6 shows an overview flow chart of process 600 for classifying data using machine learning … before classification begin the classifier model may be generated by training. … the classifier model may be generated based on a deep learning neural network (DLNN) as described above … the DLNN may be trained using training data appropriate for the current domain being modeled … models generated based on a DLNN may be referred to as deep learning models or deep learning neural network models (DLNN models).”). Furthermore, Examiner points out that Applicant’s own claim language serves as Applicant’s definition of the inference model it is trying to claim, as it defines the term “inference model” as a trained, unsupervised machine learning model, implemented as an auto-encoder by a neural network (“the inference model is a trained, unsupervised machine learning model, implemented as an auto-encoder by a neural network”), where the term “inference” merely indicates that the trained model is used to perform predictions, which the Williams reference teaches all elements in the claim language as shown above. Hence, given the above evidence, the limitation “wherein the inference model is a trained, unsupervised machine learning model, implemented as an auto-encoder by a neural network” from independent Claim 1 is taught in the Williams reference and is within the scope of Applicant’s claimed invention, and as such, Applicant’s sub-argument is not persuasive, and the prior art rejection is maintained.
Regarding Applicant’s sub-argument that the Williams reference does not teach the amended limitation “… while collecting the non-stationary data and classifying the collected non-stationary data: … training a cognitive algorithm corresponding to said inference model, based on non-stationary data collected from the network, to obtain a trained model …”, Examiner also finds this sub-argument to be not persuasive. Under its broadest reasonable interpretation in light of Applicant’s specification paragraph [0061], the phrase “… training a cognitive algorithm corresponding to said inference model …” broadly indicates that the training of the learning algorithm conforms with or follows the training of the inference model, and hence this limitation broadly recites obtaining another trained model from the training of a cognitive algorithm based on non-stationary data collected from the network, where the training of the cognitive algorithm conforms with or follows the training of the inference model. As indicated in the Non-Final Office Action mailed April 8, 2022, Williams teaches the generation and training of a separate Fast Learning Model, where algorithms such as decision trees, random forests, or any specific algorithm based on the characteristics of the data and goals of the system can be used to generate and train the Fast Learning Model, and as such, these machine learning algorithms correspond to cognitive algorithms (Williams [0173]-[0174]: “… the DLNN model is combined with a machine learning model that can be trained quickly to recognize new sets of data, or a Fast Learning Model … A Fast learning Model is a machine learning model which may be less accurate than a DLNN but can be trained more quickly based on the characteristics of the algorithm, or because a subset of training data and recent feedback is presented for training. … Some examples of a Fast Learning Model include, but are not limited to decision trees, and random forests. … Those skilled in the art will appreciate that the specific algorithm utilized is chosen according to the characteristics of the data and goals of the system.”). Williams additionally teaches that the Fast Learning Model is based on a new or modified classifier model being retrained on the detected errors from the DLNN model and/or changed input data signals from the network, where both the Fast Learning model and the DLNN model are trained to perform the same type of classification predictions and may be different instances of the same classifier, such that the Fast Learning Model can also be employed to classify the same data as the DLNN model (thus representing a scenario where the training of the Fast Learning Model conforms with or follows the training of the DLNN model). Hence this process of using source network information to train/retrain a Fast Learning Model (using a learning algorithm), and using the trained Fast Learning model to classify the same data as the DLNN model represents a process of training a cognitive algorithm that follows or conforms with the training of the existing inference model to obtain another trained model (Williams [0082]: “… classifier applications … may be arranged to employ one or more trained models to classify the observed network information that occurs on the network …”; Figure 7, [0114]-[0116]: “… a user may determine the source data has changed such that the existing classifiers are not trained to recognize (classify) the provided data sufficiently … a user may desire a classifier specific, more precise classifier to classify the data associated with the new attack … a user may modify and/or tune one or more classifiers and/or create new classifiers based on the errors and/or signals of the model … a user may create a new classifier and associate the appropriate training data with the classifier … a FL model may be a machine learning component that is arranged to train faster than the DLNN model …”; [0118]: “… the FL model may be retrained based on the data and/or network information associated with the classification errors made by the deep learning model …”; and [0120]: “… the provided source data may be classified using the fast learning (FL) model …”). Hence, given the above evidence, this limitation under its broadest reasonable interpretation is taught in the Williams reference and is within the scope of Applicant’s claimed invention. Thus, Applicant’s sub-argument is not persuasive, and the prior art rejection is maintained.
Examiner further points out that Applicant’s sub-argument that the DLNN model and the Fast Learning Model are not the “same” in Williams (as Williams intended to have one fast model and a different slow model) is also not persuasive. Examiner points out that Williams also teaches that the Fast Learning Model and the DLNN model may be different instances of the same classifier (Williams [0122]: “… the FL and DLNN models may include different instances of the same classifier …”), where “different instances of the same classifier” indicates that the two models may be of the same type, such as two auto-encoders. Examiner also points out that the teaching in Williams of one “slower” trained model and one “faster” trained model only indicates that one model has a slower training time compared to the other model, and has no bearing on whether they represent the same type of model or not. In fact, Examiner points out that Applicant’s own specification paragraph [0061] provides the same consistent scope in the Williams reference, where Applicant identifies a slower trained model and a faster trained model, with these two models representing different instances ([0061]: “… one can separate the slower training from the faster score inference, thanks to different instances of the model that run in parallel …”). Hence, given the above evidence, Applicant’s sub-argument is not persuasive, and the prior art rejection is maintained.
Regarding Applicant’s amended limitation “… replacing the inference model, as currently used to classify the non-stationary data, with the trained model …”, where Applicant has amended the limitation to use the term “replacing” instead of “substituting”, Examiner finds that this change does not change the scope of the original limitation. Under its broadest reasonable interpretation, the verb “to substitute” as defined in the Merriam-Webster dictionary describes an action that puts or uses in the place of another, while the verb “to replace” as defined in the Merriam-Webster describes an action that takes the place of or puts something new in the place of. Both terms convey the same meaning of putting or using one element in place of another element, and hence the amended limitation broadly recites an action of putting or using a trained model (trained by a learning algorithm in the preceding limitation) in place of an inference model to perform classification. Examiner also points out that performing an action that puts or uses one model in the place of another does not require a physical replacement of a model to accomplish the goal of the replacement/substitution, where the goal is to utilize the classification result of one model in place of the classification result of another model. The replacement/substitution can be a logical replacement/substitution, where the classification result of one model is chosen over the classification result of another model. Hence, two models can examine the same input data and produce classification scores, where the model that produces the better classification score is selected as the representative classification result for the system, effectively accomplishing the same task as if one model were being physically replaced by another model. In fact, Applicant’s own specification paragraphs [0061]-[0062] indicates a similar concept, where Applicant broadly recites two models (a faster model and a slower model) receiving the same network traffic input data and computing scores based on the input, and selecting a classification result based on the characteristics of the network traffic input or according to a periodic schedule ([0061]-[0062]: “As one may expect attacks to occur in a sudden manner, one may want anomaly scores to be computed for each incoming data point in near real-time … one can separate the slower training from the faster score inference, thanks to different instances of the model that run in parallel … the cognitive algorithm underlying the inference model may be retrained, while a previously trained model is used to classify S20 non-stationary data … the resulting (trained) model may be substituted S23-S24 to the current inference model, so as to keep on classifying S20 newly collected data based on the substituted model …”). Examiner notes that Applicant remains silent on a specific type of replacement mechanism involved (physical or logical), and hence, a logical replacement that analyzes resulting classification scores from each model can also be used to achieve the same goal of choosing the better classification result from one of the two models. As indicated in the Non-Final Office Action mailed April 8, 2022, Williams teaches a Combination Function component that analyzes the classification confidence scores generated by the DLNN Model and the Fast Learning Model based on the received source network information, and selects the class (and the resulting classification result) representing the highest probability of accuracy. Hence, this Combination Function component that selects the classification with the highest probability of accuracy (based on confidence scores) is essentially performing a logical replacement of one model over another model based on the received network traffic (Williams [0175]: “Both DLNNs and Fast Learning Models operate as described in FIG. 5, and the multiple output classifications are handled by Combination Function 520 responsible for assigning the best class to the data. Combination Function 520 analyzes the scores predicted by both models in combination with the confidence and performance of each model, finally selecting the class representing the highest probability of accuracy.”; and [0176]: “… Subsequent runtime scoring of the Fast Learning Model may have a higher accuracy and confidence (compared to the DLNN) for data similar to the type that have been submitted through the Fast Learning Model training process. … The Combination Function 520 chooses as output whichever class or classes represent the higher accuracy and confidence.”). As an extension of the above logical replacement concept, Williams additionally teaches the scenario where the Fast Learning Model produces classifications with a higher confidence score than the DLNN model, resulting in the classification result produced by the Fast Learning Model being employed instead of the classification result produced from the DLNN model, where this process of employing the classification result produced by the Fast Learning Model (instead of those produced from the DLNN model) corresponds to using the trained model (trained by the learning algorithm) in place of the inference model to perform classification (Williams [0122]-[0123]: “… even though the FL model may be less precise than the DLNN model, since it has been trained on the training data corresponding to the tuning in block 704, it may produce matches/classifications that have a higher confidence score than the DLNN model … At block 712 … the FL model classification result may be employed … since the FL model produced a classification results that has a higher values confidence level that the classification result produced by the DLNN model, the data may be classified based on the FL model rather than the DLNN model.”). As a side note, Williams additionally teaches the opposite scenario, in which the classification result from the DLNN model is employed when its corresponding confidence level is higher than the one associated with the classification result from a Fast Learning Model (Williams [0124]). Hence, given the above evidence, the recited amended limitation under its broadest reasonable interpretation is still taught in the Williams reference and is within the scope of Applicant’s claimed invention. Thus, Applicant’s sub-argument is not persuasive, and the prior art rejection is maintained.
Regarding Applicant’s Remarks:
“… Therefore, Williams combined with the cited art does not disclose or render obvious "… wherein the cognitive algorithm is being trained until the trained model is used to replace the inference model …" recited in claim 1. …”
Examiner has considered the above argument, and has found it to be not persuasive. Examiner points out that Applicant’s above argument is directed to the newly introduced limitation (“… wherein the cognitive algorithm is being trained until the trained model is used to replace the inference model …”). Examiner finds that this limitation does not change the scope or further limits the existing limitations, since this newly introduced limitation broadly indicates that the trained model (based on the cognitive algorithm) stops or completes training when it is being used in the system to perform classification. A person having ordinary skill in the art would understand that the training of any model stops once a model is stored, and is further used for performing classifications in a system based on real-time input data. It would also be obvious to a person having ordinary skill in the art that the concept of re-training a previously trained model would also indicate that the training for the previously trained model was stopped or completed for an earlier set of data, with the re-training being focused on training the previously trained model on a new set of data. These actions of storing the trained model and re-training a model also broadly indicate that the previously trained model has completed training. While the Williams reference does not explicitly mention stopping model training, the Williams reference does teach actions such as storing the trained models and re-training models based on new input data, which are actions that can only be performed once the previous training process has been completed, thereby also teaching the stopping or completion of these training processes (Williams [0082]: “… classifier applications … may be arranged to employ one or more trained models to classify the observed network information that occurs on the network … the network information buffered in sensor computers … may be employed as training data and/or test data for re-training the one or more classification models using a machine learning application …”; Figure 5 and [0097]-[0100]: “… During the Training Process 512, the training data is processed through a training algorithm and computes the biases, weights, and transfer functions which are stored in Model(s) 518) … once the Model(s) 518 have been stored, a test of the system’s performance will execute prior to any runtime scoring …”; and [0127]: “… If the training of the machine learning model is complete, the model may be ready to be used for data classification.”). In particular, Williams teaches that the Fast Learning Model (based on a machine learning algorithm) can be incrementally trained or re-trained on augmented data used to train the DLNN, following the same training and model storing process as indicated in FIG.5 (Williams [0174]-[0175]: “A Fast Learning Model is a machine learning model … can be trained more quickly based on the characteristics of the algorithm or because a subset of training data and recent feedback is presented for training. The Fast Learning Model is either incrementally trained based on new data … or retrained entirely on the new data … possibly augmented with a subset of the data used to train the DLNN. … Both DLNNs and Fast Learning Models operate as described in FIG. 5 …”). As established earlier in the preceding arguments, the classification scores predicted by both the Fast Learning Model and DLNN model are analyzed, with the model having the better classification score (and resulting highest probability of accuracy) being selected as the classification result to be used in the system going forward, thus also teaching the intended use of the trained model (which the classification generated by the Fast Learning model is selected in place of the classification generated by the DLNN model). Hence, given the above evidence, the recited amended limitation under its broadest reasonable interpretation is still taught in the Williams reference and is within the scope of Applicant’s claimed invention. Thus, Applicant’s argument is not persuasive, and the prior art rejection is maintained.

Claim Interpretation
Applicant has provided the following definitions in the specification, which will be used as part of the examination:
Non-Markovian, stateful classification: 
According to paragraphs [0046] and paragraphs [0047] in the specification: “ … non-Markovian processes involved herein keep track of prior states of the non-stationary data collected. Moreover, the stateful (also called memoryful) processes involved herein track information about the sender and/or the receiver of the non-stationary data collected 510. This can be achieved by forming data points (e.g., in the form of vectors of n features each), where data points are formed by aggregating data related to data flows from respective sources and for given time periods.”. Hence the term “non-Markovian, stateful classification” will be interpreted as any real-time data that has state information and is aggregated from respective sources and for given time periods.
Unsupervised model: 
According to paragraph [0008], an inference model is “a trained, unsupervised machine learning model… This model can be implemented as an auto-encoder by a neural network. .. Still, the unsupervised model may be a multi-layer perceptron model, yet implemented in a form of an auto-encoder by the neural network.”. Hence the term  “unsupervised model” will be interpreted as “an inference model, implemented as an auto-encoder by a neural network”.
Supervised model: 
According to paragraph [0016], a supervised model “is configured as a nearest-neighbor classifier”. Hence the term “supervised model” will be interpreted as “a nearest-neighbor classifier”.
Cognitive algorithm: 
According to paragraph [0063], a cognitive algorithm, “cognitive model”, “machine learning model” or the like are interchangeably used. Hence the term “cognitive algorithm” will be interpreted as an algorithm or machine learning model where applicable.

Claim Objections
Claim 22 is objected to 

because of the following informality: A typographical error (missing word) in the newly-added claim limitation: “… wherein replacing the inference model … with the trained model occurs based on a data traffic of the non-stationary data in the network”. Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 5, 16-20, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over 
 Williams, Jr. et al., U.S. PGPUB 2015/0254555, published 9/10/2015 [hereafter referred as Williams] in view of An et al., Variational Autoencoder based Anomaly Detection using Reconstruction Probability, published December 27, 2015 [hereafter referred as An], in further view of Casas et al., UNADA: Unsupervised Network Anomaly Detection Using Sub-space Outliers Ranking, 2011 IFIP International Federation for Information Processing [hereafter referred as Casas].
Regarding amended Claim 1, 
Williams teaches
(Currently Amended) A computer-implemented method for detecting anomalies in non-stationary data in a network of computing entities, the method comprising: 
collecting non-stationary data in the network, wherein the non-stationary data comprises network packets (Examiner’s note: Williams teaches classifying and detecting anomalies in data using a plurality of network computers. Referring to the logical flow diagram in Williams Figure 5, Williams teaches at step 502 data is collected for classification and incrementally refined for further analysis, where the collected data includes captured/buffered/real-time network information. The capturing/buffering of this real-time network information represents a process for collecting network packets representing non-stationary data in a network (Williams [0026]: “…the data provided for classification may be real-time network information, captured/buffered network information, or the like. Also, in at least one of the various embodiments, a sensor computer may be employed to monitor and buffer some or all of the data, such as, network information in real-time.”; Figure 5 and [0083]-[0084]: “FIGS. 5-9 represent the generalized operation for classifying data using machine learning that may be incrementally refined based on expert input in accordance with at least one of the various embodiments. … processes 500, 600, 700, 800, and 900 described in conjunction with FIGS. 5-9 may be implemented by and/or executed on a single network computer, … and/or executed on a plurality of network computers … and/or executed on one or more virtualized computers, such as, those in a cloud-based environment … FIG. 5 illustrates a logical diagram of process 500 that may be arranged to classify data using machine learning that may be incrementally refined based on expert input … In step 502, data may be collected for submission to the system. The term data is used broadly to describe information requiring analysis.”; and [0126]: “… network information may include … network packet dumps, system performance metrics (e.g., CPU utilization, network connections, … or the like), wire line traces, …”).) …
… while collecting the non-stationary data: 
classifying the collected, non-stationary data according to a non-Markovian, stateful classification (Examiner’s note: Williams teaches capturing monitored network traffic is done over a period of time intervals, where the network traffic is associated with one or more particular users and/or user groups, and with this association with one or more particular users and/or user groups representing a form of identifying and capturing the data according to flows based on certain network state information provided in network packet headers such as source/destination IP addresses and corresponding source/destination ports, all of which collectively represents the definition of a non-Markovian, stateful classification of data (Williams [0129]-[0130]: “…a sensor computer may be configured to buffer a particular amount/type of network information depending on the type of information the machine learning model may be used to classify. … one sensor computer may be arranged to buffer web server traffic information, while another sensor computer may be employed to monitor network traffic that may be associated with one or more particular users and/or user groups … the sensor computers may group the captured network information into time buckets, such that each window include the network information that was captured over a defined time interval. The duration of the time interval may be defined using configuration. For example, in at least one of the various embodiments, a time interval may be defined to be, 1 second, 10 seconds, 1 minute, 1 hour, 4 hours, 1 day, 1 week, and so on.”).), …
… based on an inference model, wherein the inference model is a trained, unsupervised machine learning model, implemented as an auto-encoder by a neural network (Examiner’s note: Examiner points out that Applicant’s own claim language serves as a definition for the inference model, which is a “trained, unsupervised machine learning model, implemented as an auto-encoder by a neural network”, where the term “inference” merely indicates that the trained model is used to perform predictions. Referring to Williams Figure 5, Williams teaches the network computer/sensor computer system includes a training process using a model structure that defines the machine learning model used for classification and analysis in the system, which can include unsupervised and supervised models as well as including a combination of models. Williams further teaches an auto-encoder defined by a Model Structure component, where the auto-encoder is indicated as a representation of a deep learning neural network (DLNN) model that is further trained using training data, and its resulting model representation is stored as a trained model (Williams Figure 5, elements 512, 516; and [0097]: “Model Structure 516 may be a file or information otherwise provided that describes the structure of each model implemented in the system … may include configurations such as classifiers … auto-encoders, which reduce the dimensionality of data … Model Structure 516 may also include a specification of a combination of the machine learning models described above, together with additional machine learning models that consume the output of DLNN models. For example, configuring an auto-encoder to reduce the dimensionality of input data, followed by a k-Nearest-Neighbor model used to detect anomalies in the reduced dimensionality space … During the Training Process 512, the training data is processed through a training algorithm, and computes the biases, weights, and transfer functions which are stored in Model(s) 518.”). Williams further indicates that the stored auto-encoder models used for performing classification predictions are representations of DLNN models, and further indicates that the Training Corpus component providing the input to train the identified models may be based on unsupervised training techniques that require no labels or outputs, hence teaching that the resulting trained auto-encoder is a trained, unsupervised machine learning model (Williams [0092]-[0094]: “… the system may be initialized using Training Process 512, which may take as input Training Corpus 508 and calculates the biases, weights, and transfer functions of the active machine learning model or models 518 with a selected one or more training algorithms … the system may employ fully unsupervised training, in which the Training Corpus 508 contains no labels or output values …”; [0100]: “… once the Model(s) 508 have been stored … Scoring Process 522, which applies Model(s) 518 to the input data and executes Combination Function 512 to select the correct predicted classification …”; and Figure 6, [0104]: “FIG. 6 shows an overview flow chart of process 600 for classifying data using machine learning … before classification begin the classifier model may be generated by training. … the classifier model may be generated based on a deep learning neural network (DLNN) as described above … the DLNN may be trained using training data appropriate for the current domain being modeled … models generated based on a DLNN may be referred to as deep learning models or deep learning neural network models (DLNN models).”).), and 
wherein classifying the collected, non-stationary data comprises:
forming data points from the collected, non-stationary data (Examiner’s note: Williams teaches capturing the monitored network traffic (Williams [0129]-[0130]) over a period of time intervals, with “associating the traffic with one or more particular users and/or user groups” represents (under broadest reasonable interpretation) identifying and capturing the data according to flows based on certain network state information provided in network packet headers such as source/destination IP addresses and corresponding source/destination ports, all of which collectively represents the definition of a non-Markovian, stateful classification of data. Williams further teaches performing feature extraction by concatenating related data such as the captured network packets from multiple data sources correlated by time intervals (Williams [0026], [0130], [0145]), where the grouping/ordering of data and performing feature extraction during data ingestion step 504 according to the data characteristics are interpreted as steps for “forming data points from the collected, non-stationary data” (Williams Figure 5, element 504; and [0088]-[0089]: “In step 504, data may be ingested into the system and prepared for processing. Data preparation may include a number of processes that may be required to ensure the system can interpret and handle data from various sources. The configuration of a data ingestion process depends upon the system needs and data characteristics. In at least one of the various embodiments, it may include high-level feature extraction where the output of the process is a collection of numeric values that represent all of the data upon which the system performs a classification decision.”).); and 
for each data point of the formed data points: 
… feeding the auto-encoder with said each data point for the auto-encoder  (Examiner’s note: Referring to Williams Figure 5, Williams teaches the network computer/sensor computer system includes a training process using a model structure that defines the machine learning model used for classification and analysis, where a combination of models are specified (i.e., auto-encoder and k-nearest-neighbor model) as the models used for training, and where the training process receives training data (separated from test data) from the data ingestion step 504, corresponding to “feeding the auto-encoder with said each data point for the auto-encoder (Williams Figure 5, elements 504, 512, 516, 518; [0090], [0092], and [0097]: “During the Training Process 512, the training data is processed through a training algorithm and computes the biases, weights, and transfer functions which are stored in Model(s) 518.”).) … 
… feeding the selected outputs into a supervised, machine learning model, for it to further classify the selected outputs, whereby said anomalies are detected based on outputs from the supervised model (Examiner’s note: Referring to Williams Figure 5, Williams teaches the network computer/sensor computer system includes a training process and a model structure process that defines the machine learning model used for classification and analysis, where the model structure defines the structure of each model implemented in the system, where the system can include unsupervised and supervised models, as well as including a combination of models, where the auto-encoder represents an inference model, and the outputs of the auto-encoder represent the selected outputs resulting from the inference model, and the k-nearest neighbor model represents a supervised machine learning model classifier for detecting anomalies (where under the broadest reasonable interpretation, the detection or prediction of anomalies is considered a form of “further classification”). Given the specified model combination of the auto-encoder, followed by a k-nearest-neighbor model, it logically follows that the outputs from the auto-encoder will be used as inputs into the k-nearest-neighbor-model (Williams Figure 5, element 516; [0092], [0095], and [0097]: “Model Structure 516 may also include a specification of a combination of the machine learning models described above, together with additional machine learning models that consume the output of DLNN models. For example, configuring an auto-encoder to reduce the dimensionality of input data, followed by a k-Nearest-Neighbor model used to detect anomalies in the reduced dimensionality space.”).) …
… detecting anomalies in the classified data (Examiner’s note: Williams teaches the system performing detection of new entities as anomalies, where the new detected entities are described as network traffic captured by a sensor computer, the new network entities described as including instances of web server, database, domain name server user applications. A person having ordinary skill in the art would understand that identification of these instances requires the detection of flows through parsing of packet header information such as source/destination IP addresses and corresponding source/destination ports (Williams [0147]-[0149]: “At decision block 910, … if the network information for the detected entity is buffered, control may flow to block 912; otherwise, control may flow to decision block 914. … newly detected entities may initially be marked and/or tagged as new entities. … threshold values may be defined in configuration to indicate the amount of network information that must be captured for a given class and/or entity.  … At block 912, in at least one of the various embodiments, anomalies and/or classifications associated with the detected entity may now be included in the report information.” and [0141]: “the detection of an previously unknown/unseen instance of an application, such as, a web server, database, domain name server, user applications (e.g., games, office applications, and so on), file sharing applications, or the like.”).) …
… while collecting the non-stationary data and classifying the collected non-stationary data: … training a cognitive algorithm corresponding to said inference model, based on non-stationary data collected from the network, to obtained a trained model (Examiner’s note: Under its broadest reasonable interpretation in light of Applicant’s specification paragraph [0061], the phrase “… training a cognitive algorithm corresponding to said inference model …” broadly indicates that the training of the learning algorithm conforms with or follows the training of the inference model, and hence this limitation broadly recites obtaining another trained model from the training of a cognitive algorithm based on non-stationary data collected from the network, where the training of the cognitive algorithm conforms with or follows the training of the inference model. Williams teaches the generation and training of a separate Fast Learning Model, where algorithms such as decision trees, random forests, or any specific algorithm based on the characteristics of the data and goals of the system can be used to generate and train the Fast Learning Model, and as such, these machine learning algorithms correspond to cognitive algorithms (Williams [0173]-[0174]: “… the DLNN model is combined with a machine learning model that can be trained quickly to recognize new sets of data, or a Fast Learning Model … A Fast learning Model is a machine learning model which may be less accurate than a DLNN but can be trained more quickly based on the characteristics of the algorithm, or because a subset of training data and recent feedback is presented for training. … Some examples of a Fast Learning Model include, but are not limited to decision trees, and random forests. … Those skilled in the art will appreciate that the specific algorithm utilized is chosen according to the characteristics of the data and goals of the system.”). Williams additionally teaches that the Fast Learning Model is based on a new or modified classifier model being retrained on the detected errors from the DLNN model and/or changed input data signals from the network, where both the Fast Learning model and the DLNN model are trained to perform the same type of classification predictions and may be different instances of the same classifier, such that the Fast Learning Model can also be employed to classify the same data as the DLNN model (thus representing a scenario where the training of the Fast Learning Model conforms with or follows the training of the DLNN model). Hence this process of using source network information to train/retrain a Fast Learning Model (using a learning algorithm), and using the trained Fast Learning model to classify the same data as the DLNN model represents a process of training a cognitive algorithm that follows or conforms with the training of the existing inference model to obtain another trained model (Williams [0082]: “… classifier applications … may be arranged to employ one or more trained models to classify the observed network information that occurs on the network …”; Figure 7, [0114]-[0116]: “… a user may determine the source data has changed such that the existing classifiers are not trained to recognize (classify) the provided data sufficiently … a user may desire a classifier specific, more precise classifier to classify the data associated with the new attack … a user may modify and/or tune one or more classifiers and/or create new classifiers based on the errors and/or signals of the model … a user may create a new classifier and associate the appropriate training data with the classifier … a FL model may be a machine learning component that is arranged to train faster than the DLNN model …”; [0118]: “… the FL model may be retrained based on the data and/or network information associated with the classification errors made by the deep learning model …”; and [0120]: “… the provided source data may be classified using the fast learning (FL) model …”).) …
… wherein the cognitive algorithm is being trained until the trained model is used to replace the inference model (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly indicates that the trained model (based on the cognitive algorithm) stops or completes training when it is being used in the system to perform classification, where the phrase “… to replace the inference model” represents an intended use of the trained model. A person having ordinary skill in the art would understand that the training of any model stops once a model is stored, and is further used for performing classifications in a system based on real-time input data. It would also be obvious to a person having ordinary skill in the art that the concept of re-training a previously trained model would also indicate that the training for the previously trained model was stopped or completed for an earlier set of data, with the re-training being focused on training the previously trained model on a new set of data. These actions of storing the trained model and re-training a model also broadly indicate that the previously trained model has completed training. Williams teaches actions such as storing the trained models and re-training models based on new input data, where these performed actions can only be performed once the previous training process has been completed, and hence indicate that a trained model has completed training (Williams [0082]: “… classifier applications … may be arranged to employ one or more trained models to classify the observed network information that occurs on the network … the network information buffered in sensor computers … may be employed as training data and/or test data for re-training the one or more classification models using a machine learning application …”; Figure 5 and [0097]-[0100]: “… During the Training Process 512, the training data is processed through a training algorithm and computes the biases, weights, and transfer functions which are stored in Model(s) 518) … once the Model(s) 518 have been stored, a test of the system’s performance will execute prior to any runtime scoring …”; and [0127]: “… If the training of the machine learning model is complete, the model may be ready to be used for data classification.”). Williams further teaches that the Fast Learning Model (based on a machine learning algorithm) can be incrementally trained or re-trained on augmented data used to train the DLNN, following the same training and model storing process as indicated in FIG.5, where eventually both the Fast Learning Model and DLNN model are further utilized by the system to generate classification scores. The model having the better classification score (and resulting highest probability of accuracy) is selected as the classification result to be used in the system going forward, which corresponds to the intended use of the phrase (“… to replace the inference model”) and further explained in the subsequent recited limitation (Williams [0174]-[0175]: “A Fast Learning Model is a machine learning model … can be trained more quickly based on the characteristics of the algorithm or because a subset of training data and recent feedback is presented for training. The Fast Learning Model is either incrementally trained based on new data … or retrained entirely on the new data … possibly augmented with a subset of the data used to train the DLNN. … Both DLNNs and Fast Learning Models operate as described in FIG. 5 …”).) …
… replacing the inference model, as currently used to classify the non-stationary data, with the trained model (Examiner’s note: Under its broadest reasonable interpretation in light of Applicant’s specification paragraph [0062], the verb “to replace” as defined in the Merriam-Webster describes an action that takes the place of or puts something new in the place of, and hence this limitation broadly recites putting or using the trained model (trained by a learning algorithm in the preceding limitation) in place of the inference model to perform classification. As indicated earlier, Williams teaches a Combination Function component that analyzes the classification confidence scores generated by the DLNN Model and the Fast Learning Model based on the received source network information, and selecting the class representing the highest probability of accuracy, where the generation of these confidence scores and the selection of the class with the highest probability of accuracy indicates that these two models are performing classifications at the same time, and the selection of the classification with the highest probability of accuracy indicates that one of two models is currently used to perform classification (Williams [0175]-[0176]: “Both DLNNs and Fast Learning Models operate as described in FIG. 5, and the multiple output classifications are handled by Combination Function 520 responsible for assigning the best class to the data. Combination Function 520 analyzes the scores predicted by both models in combination with the confidence and performance of each model, finally selecting the class representing the highest probability of accuracy … Subsequent runtime scoring of the Fast Learning Model may have a higher accuracy and confidence (compared to the DLNN) for data similar to the type that have been submitted through the Fast Learning Model training process. … The Combination Function 520 chooses as output whichever class or classes represent the higher accuracy and confidence.”). As indicated earlier, Williams further teaches the concept of the Fast Learning Model producing matches/classifications with higher confidence score than the DLNN model, resulting in the classification result produced by the Fast Learning Model being employed instead of those classification results from the DLNN model, where this process of the Combination Function component employing the classification result produced by the Fast Learning Model instead of those classification results from the DLNN model corresponds to using the trained model (trained by the learning algorithm) in place of the inference model to perform classification (Williams [0122]-[0123]: “… even though the FL model may be less precise than the DLNN model, since it has been trained on the training data corresponding to the tuning in block 704, it may produce matches/classifications that have a higher confidence score than the DLNN model … At block 712 … the FL model classification result may be employed … since the FL model produced a classification results that has a higher values confidence level that the classification result produced by the DLNN model, the data may be classified based on the FL model rather than the DLNN model.”).).  
While Williams teaches the encoding phase of an auto-encoder, Williams does not explicitly teach
… for each data point of the formed data points: 
[feeding the auto-encoder] … to reconstruct said each data point according to one or more parameters learned by a cognitive algorithm of the auto-encoder;
scoring a degree of anomaly of said each data point, according to a reconstruction error in reconstructing said each data point, to obtain anomaly scores;
selecting outputs from the classification performed based on the degree of anomaly, wherein the outputs selected have a degree of anomaly above a threshold degree; …
	An teaches
… for each data point of the formed data points: 
[feeding the auto-encoder] … to reconstruct said each data point according to one or more parameters learned by a cognitive algorithm of the auto-encoder (Examiner’s note: Referring to An p.4 Algorithm 2, An teaches 𝛉 and 𝛟 representing auto-encoder parameters that are initialized and trained (learned) by the auto-encoder during classification and determination of reconstruction errors (An p.4 Algorithm 2 Autoencoder based anomaly detection algorithm, and p.4 2nd paragraph: “Autoencoder based anomaly detection is a deviation based anomaly detection method using semi-supervised learning. It uses the reconstruction error as the anomaly score. Data points with high reconstruction are considered to be anomalies. Only data with normal instances are used to train the autoencoder. After training, the autoencoder will reconstruct normal data very well, while failing to do so with anomaly data which the autoencoder has not encountered. Algorithm 2 shows the anomaly detection algorithm using reconstruction errors of autoencoders.”).);
scoring a degree of anomaly of said each data point, according to a reconstruction error in reconstructing said each data point, to obtain anomaly scores (Examiner’s note: An teaches obtaining anomaly scores by calculating the auto-encoder reconstruction error for an input data point (An p.4 Algorithm 2 Autoencoder based anomaly detection algorithm, and p.3 2nd paragraph: “Deviation based anomaly detection is mainly based on spectral anomaly detection, which uses reconstruction errors as anomaly scores. The first step is to reconstruct the data using dimension reduction methods such as principal components analysis or autoencoders. Reconstructing the input using k-most significant principal components and measuring the difference between its original data point and the reconstruction leads to the reconstruction error which can be used as an anomaly score. Data points with high reconstruction error are defined as anomalies.” and An p.4 Algorithm 2: within the for loop of the algorithm (for each input data point), calculate a reconstruction error using: “reconstruction error(i) = ∥                        
                            
                                
                                    x
                                
                                
                                    (
                                    i
                                    )
                                
                            
                            -
                            
                                
                                    g
                                
                                
                                    θ
                                
                            
                            (
                            
                                
                                    f
                                
                                
                                    ϕ
                                
                            
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            
                                                
                                                    i
                                                
                                            
                                        
                                    
                                
                            
                            )
                        
                    ∥”, where the reconstruction error corresponds to an anomaly score, with the calculated distance between the input data point and decoded and encoded phases of the auto-encoder corresponding to the degree of the anomaly.);
selecting outputs from the classification performed based on the degree of anomaly, wherein the outputs selected have a degree of anomaly above a threshold degree (Examiner’s note: Referring to An p.4 Algorithm 2, An teaches within the for loop of the algorithm the calculation of a reconstruction error, and selecting the identified input data point as either being anomalous or not according to the reconstruction error (representing the degree of anomaly) being above a threshold α, thus corresponding to “selecting outputs from the classification performed based on the degree of anomaly, wherein the outputs selected have a degree of anomaly above a threshold” (An p.4 Algorithm 2: within the for loop of the algorithm (for each data point): “if reconstruction error(i) > α then                         
                            
                                
                                    x
                                
                                
                                    (
                                    i
                                    )
                                
                            
                        
                     is an anomaly else                         
                            
                                
                                    x
                                
                                
                                    (
                                    i
                                    )
                                
                            
                        
                     is not an anomaly end if”).); …
	Both Williams and An are analogous art since they both teach using auto-encoders for anomaly classification and detection.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the scoring process of Williams and incorporate the algorithm steps of calculating an auto-encoder reconstruction error of An as a way to generate anomaly scores. The motivation to combine is taught by An, as auto-encoders already reduce the original data to lower dimensional embeddings, which separates anomalies and normal data by taking out noise and other unimportant features. Thus, using lower-dimensional embeddings not only makes it easier to detect anomalies, but also using the reconstruction error as an anomaly score provides a built-in scoring process for identifying more anomalous data, resulting in improved anomaly detection behavior of the system (An p.1 last paragraph – p.2 1st paragraph: “Among many anomaly detection methods, spectral anomaly detection techniques try to find the lower dimensional embeddings of the original data where anomalies and normal data are expected to be separated from each other. After finding those lower dimensional embeddings, they are brought back to the original data space which is called the reconstruction of the original data. By reconstructing the data with the low dimension representations, we expect to obtain the true nature of the data, without uninteresting features and noise. Reconstruction error of a data point, which is the error between the original data point and its low dimensional reconstruction, is used as an anomaly score to detect anomalies. … With the advent of deep learning, autoencoders are also used to perform dimension reduction by stacking up layers to form deep autoencoders. By reducing the number of units in the hidden layer, it is expected that the hidden units will extract features that well represent the data. Moreover, by stacking autoencoders we can apply dimension reduction in a hierarchical manner, obtaining more abstract features in higher hidden layers leading to a better reconstruction of the data.”).
While Williams in view of An teaches capturing network information during time intervals (where this capturing process performs collection of network packets representing non-stationary data in the network), Williams in view of An does not explicitly teach
… wherein collecting the non-stationary data comprises: for each network traffic source, creating a data point comprising a vector of features computed over an aggregation of network packets from each network traffic source having timestamps within consecutive, non-overlapping time intervals of a pre-defined length; … 
Casas teaches
… wherein collecting the non-stationary data comprises: for each network traffic source, creating a data point comprising a vector of features computed over an aggregation of network packets from each network traffic source having timestamps within consecutive, non-overlapping time intervals of a pre-defined length (Examiner’s note: Casas teaches capturing network traffic at a packet level, where the network traffic is captured in contiguous and consecutive time slots of fixed length ∆T (where the contiguous and consecutive time slots of a fixed length represent consecutive, non-overlapping time intervals of a pre-defined length), and aggregated in IP flows, where these aggregated IP flows are according to different flow-resolution levels based on aggregation keys using source or destination IP addresses. A person having ordinary skill in the art would understand that capturing network traffic at a packet level and identifying them as being received in contiguous and consecutive time slots of fixed length requires the use of timestamps to identify the time receipt of each packet. Casas further teaches these IP flows can be aggregated to identify 1-to-N anomalies (which involve many IP flows from the same source IPsrc towards different destinations), N-to-1 anomalies (which involve IP flows from different sources towards a single destination IPdst), 1-to-1 anomalies, and N-N anomalies (using multiple N-to-1 or 1-to-N instances), with each IP flow                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                    ∈ Y described by a set of m traffic attributes/features, and with                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                    ∈                         
                            
                                
                                    R
                                
                                
                                    m
                                
                            
                        
                     representing a vector of traffic features describing flow                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                     (with                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                     representing a data point) (Casas pp.41-42 Figure 1 and 4th paragraph: “UNADA runs in three consecutive steps, analyzing packets captured in contiguous time slots of fixed length … captured packets are first aggregated into multi-resolution traffic flows …”; p.43 Section 3. Multi-resolution Flow Aggregation and Change-Detection 1st paragraph: “UNADA performs unsupervised anomaly detection on single-link packet-level traffic, captured in consecutive time slots of fixed length ∆T and aggregated in IP flows (standard 5-tuples). IP flows are additionally aggregated at different flow-resolution levels , using 9 different aggregation keys … include … source Network Prefixes (                        
                            
                                
                                    l
                                
                                
                                    2,3
                                    ,
                                    4
                                
                            
                        
                    : IPsrc /8, /16, /24), destination Network Prefixes (                        
                            
                                
                                    l
                                
                                
                                    5,6
                                    ,
                                    7
                                
                            
                        
                    : IPdst /8, /16, /24), source IPs (                        
                            
                                
                                    l
                                
                                
                                    8
                                
                            
                        
                    : IPsrc), and destination IPs (                        
                            
                                
                                    l
                                
                                
                                    9
                                
                            
                        
                    : IPdst).”; and pp.43-44 Section 4. Unsupervised Anomaly Detection through Clustering 1st paragraph: “… IP flows are analyzed at two different resolutions, either IPsrc or IPdst aggregation key. … 1-to-N anomalies involve many IP flows from the same source towards different destinations … N-to-1 anomalies involve IP flows from different sources towards a single destination … let Y={                        
                            
                                
                                    y
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    y
                                
                                
                                    n
                                
                            
                        
                    } be the set of n aggregated-flows (at IPsrc or IPdst) in the flagged slot. Each flow                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                    ∈ Y is described by a set of m traffic attributes or features, … Let                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                    ∈                        
                            
                                
                                    R
                                
                                
                                    m
                                
                            
                        
                     be the vector of traffic features describing flow                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                    , and X={                        
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    x
                                
                                
                                    n
                                
                            
                        
                    }∈                        
                            
                                
                                    R
                                
                                
                                    n
                                    x
                                    m
                                
                            
                        
                     the complete matrix of features …”).); … 
Both Williams in view of An and Casas are analogous art since they both teach anomaly detection using machine learning techniques.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the captured network information taught in Williams in view of An and perform the IP flow aggregation steps taught Casas as a way to further classify the selection of data points. The motivation to combine is taught in Casas, the resulting matrix of features (formed by the aggregated IP flows and vectors of traffic features) and its associated robust clustering algorithm can be applied to any monitoring system without any kind of calibration or customization, thus making the configuration of this system that uses this method simpler to configure and use. Casas also teaches another advantage, which is that the applied robust clustering algorithm (utilizing a DBSCAN clustering algorithm) is ideal for clustering data points in lower dimensions, which leads to improved anomaly classification and detection performance in the system (Casas p.43 2nd paragraph: “UNADA presents several advantages with respect to current state of the art. First and most important, it works in a complete unsupervised fashion, which means that it can be directly plugged-in to any monitoring system and start to work from scratch, without any kind of calibration … Secondly, it uses a robust density-based clustering technique to avoid general clustering problems such as sensitivity to initialization, specification of number of clusters, detection of particular cluster shapes, or structure-masking by irrelevant features. Thirdly, it performs clustering in very low dimensional spaces, avoiding sparsity problems when working with high-dimensional data….”; and p.45 1st paragraph: “Using small values for k provides several advantages: firstly, doing clustering in low-dimensional spaces is more efficient and faster than clustering in bigger dimensions. Secondly, density-based clustering algorithms such as DBSCAN provide better results in low-dimensional spaces [7], because high-dimensional spaces are usually sparse, making it difficult to distinguish between high and low density regions.”).
Regarding previously presented Claim 5, 
Williams in view of An, in further view of Casas teaches
(Previously Presented) The computer-implemented method according to claim 1, 
wherein: the unsupervised model is implemented as an under-complete auto-encoder by the neural network (Examiner’s note: An teaches an auto-encoder that has a reduced number of units in the hidden layer is by definition an ‘under-complete’ auto-encoder (An p.2 2nd paragraph: “With the advent of deep learning, autoencoders are also used to perform dimension reduction by stacking up layers to form deep autoencoders. By reducing the number of units in the hidden layer, it is expected that the hidden units will extract features that well represent the data.”).), and 
wherein classifying the collected data further comprises, performing a dimension reduction, based on said each data point (Examiner’s note: An teaches an auto-encoder that finds lower dimensional embeddings for the original input data, which is a form of dimension reduction (An p.1 last paragraph – p.2 1st paragraph: “Among many anomaly detection methods, spectral anomaly detection techniques try to find the lower dimensional embeddings of the original data where anomalies and normal data are expected to be separated from each other. After finding those lower dimensional embeddings, they are brought back to the original data space which is called the reconstruction of the original data. By reconstructing the data with the low dimension representations, we expect to obtain the true nature of the data, without uninteresting features and noise. Reconstruction error of a data point, which is the error between the original data point and its low dimensional reconstruction, is used as an anomaly score to detect anomalies. …”).).  
Regarding amended Claim 16, 
Williams in view of An, in further view of Casas teaches
(Original) The computer-implemented method according to claim 1, wherein the method further comprising: further classifying non-stationary data collected according to a non-Markovian, stateful classification, based on the substituted model, so as to be able to detect new anomalies in further classified data (Examiner’s note: Williams teaches that after using the classification result for the Fast Learning model, the flow in Williams Figure 5 proceeds with using the Fast Learning model (e.g., the substituted model) to detect new anomalies until the DLNN model has been re-trained to increase its accuracy and performance (Williams [0122]-[0123], [0127]; [0176]: “Whenever a user, such as a Domain Expert, adjusts the predicted output of the DLNN, that data element may be submitted to the training process of the Fast Learning Model, quickly modifying and improving future output of the Fast Learning Model. Subsequent runtime scoring of the Fast Learning Model may have a higher accuracy and confidence (compared to the DLNN) for data similar to the type that have been submitted through Fast Learning Model training process. Conversely, the DLNN may have lower accuracy and confidence for the same data, but a high degree of accuracy and confidence for data that has not been submitted to the Fast Learning Model. The Combination Function 520 chooses as output whichever class or classes represent the higher accuracy and confidence.”, and [0177]-[0178]).).  
Regarding amended Claim 17, 

Claim 17 recites a computerized system adapted to interact with a network of computing entities for detecting anomalies in non-stationary data, wherein the system is configured to perform claim limitations that are similar in scope to corresponding claim limitations in Claim 1, and hence is rejected under similar rationale and motivations provided by Williams, An, and Casas as indicated in Claim 1. In addition, Williams teaches a network computer with sensor computer functionalities, where the network computer contains a processor and associated memory storing computer readable instructions and program modules such as a classifier application and machine learning engine, and where this network computer represents a computerized system (Williams Figure 3, elements 300, 302, 326; [0066], [0067], [0070]-[0072]).
Regarding previously presented Claim 18, 
Williams in view of An, in further view of Casas teaches
(Previously Presented) The computerized system according to claim 17, wherein: 
the system comprises a memory storing both an inference model, which is a trained, unsupervised machine learning model, and a nearest-neighbor classifier model, which is a supervised machine learning model (Williams Figure 5, element 516: examiner’s note: Referring to Williams Figure 5, Williams teaches the network/sensor computer system includes a training process and a model structure process that defines the machine learning model used for classification and analysis, where the model structure defines the structure of each model implemented in the system, where the system can include unsupervised and supervised models, as well as including a combination of models, with the auto-encoder representing inference model, and the outputs of the auto-encoder representing the selected outputs resulting from the inference model, and the k-nearest neighbor model detecting anomalies representing a supervised machine learning model classifier (where the detection or prediction of anomalies is considered a form of classification) (Williams [0092], [0095], and [0097]: “Model Structure 516 may also include a specification of a combination of the machine learning models described above, together with additional machine learning models that consume the output of DLNN models. For example, configuring an auto-encoder to reduce the dimensionality of input data, followed by a k-Nearest-Neighbor model used to detect anomalies in the reduced dimensionality space.”).), and 
wherein the system is further configured to: select outputs from data as classified with said inference model and feed the selected outputs into the supervised, machine learning model (Examiner’s note: Referring to Williams Figure 5, Williams teaches the network computer/sensor computer system includes a training process using a model structure that defines the machine learning model used for classification and analysis in the system, which can include unsupervised and supervised models as well as including a combination of models, where the auto-encoder represents an inference model performing the initial data classification, the outputs of the auto-encoder represent the selected outputs resulting from the inference model (“selecting outputs from the classification performed thanks to the inference model”), and the k-nearest neighbor model represents a supervised machine learning model classifier for detecting anomalies. Given the specified model combination of the auto-encoder, followed by a k-nearest-neighbor model, it logically follows that the outputs from the auto-encoder will be selected to be used as inputs into the k-nearest-neighbor-model (Williams Figure 5, elements 512, 516; [0092], [0095], and [0097]: “Model Structure 516 may also include a specification of a combination of the machine learning models described above, together with additional machine learning models that consume the output of DLNN models. For example, configuring an auto-encoder to reduce the dimensionality of input data, followed by a k-Nearest-Neighbor model used to detect anomalies in the reduced dimensionality space.”).) …  
… so as to detect said anomalies based on outputs from the supervised model (Examiner’s note: Casas teaches applying the DBSCAN clustering algorithm (which is a form of a nearest-neighbors algorithm) by performing a query on a subset of selected data points Xi of lower dimension (Casas p.45 1st paragraph) provided into the DBSCAN algorithm (where the data points Xi were selected through a set of constraints for a particular set of k features out of possible m attributes), and getting a set of clusters Pi (each of which represents a class label for those data points within each cluster) and an associated set of q(i) outliers (“said anomalies are detected based on outputs from the supervised model”), where the DBSCAN clustering algorithm represents a “supervised, machine learning model” as it is performing further classification of the data points into clusters (Casas pp.44 last paragraph – p.45 1st paragraph (Section 4.1 Clustering Ensemble and Sub-space Clustering): “Each of the N sub-spaces Xi ⊂ X is obtained by selecting k features from the complete set of m attributes. … Each partition Pi is obtained by applying DBSCAN [13] to sub-space Xi. DBSCAN is a powerful clustering algorithm that discovers clusters of arbitrary shapes and sizes [7], relying on a density-based notion of clusters: clusters are high-density regions of the space, separated by low-density areas. This algorithm perfectly fits our unsupervised traffic analysis, because it is not necessary to specify a-priori difficult to set parameters such as the number of clusters to identify. Results provided by applying DBSCAN to sub-space Xi are twofold: a set of p(i) clusters {                        
                            
                                
                                    C
                                
                                
                                    1
                                
                                
                                    i
                                
                            
                        
                    ,                         
                            
                                
                                    C
                                
                                
                                    2
                                
                                
                                    i
                                
                            
                        
                    , ..,                         
                            
                                
                                    C
                                
                                
                                    p
                                    (
                                    i
                                    )
                                
                                
                                    i
                                
                            
                        
                    } and a set of q(i) outliers {                        
                            
                                
                                    o
                                
                                
                                    1
                                
                                
                                    i
                                
                            
                        
                    ,                         
                            
                                
                                    o
                                
                                
                                    2
                                
                                
                                    i
                                
                            
                        
                    , ..,                         
                            
                                
                                    o
                                
                                
                                    q
                                    (
                                    i
                                    )
                                
                                
                                    i
                                
                            
                        
                    }.”).).
Regarding previously presented Claim 19, 
Williams in view of An, in further view of Casas teaches
(Previously Presented) The computerized system according to claim 18, wherein: 
the system further comprises a validation expert system configured to couple to the supervised model (Examiner’s note: Williams teaches a domain expert 530 receiving information from scoring process 522 (producing anomaly scores) from outputs from the model(s) 518 (implemented as a combination of unsupervised auto-encoder model followed by the supervised nearest-neighbor model) and (optionally) combination function 520; this connection flow 518, 520, (520) and 530 represents a coupling between the supervised model and the domain expert (representing a validation expert system) (Williams Figure 5, elements 518, 520, 522, 530; and [0100]-[0102]: “… once the Model(s) 518 have been stored, a test of the system's performance will execute prior to any runtime scoring. Both testing and runtime scoring utilize Scoring Process 522, which applies Model(s) 518 to the input data and executes Combination Function 520 to select the correct predicted classification, when appropriate. … Scoring Process 522 assigns a score to incoming data, ranking said data as a member of a class (or label), or as an anomalous data point. Runtime scoring delivers new data to the Scoring Process and makes those results available to the Domain Expert Analysis component 530.”).), 
so as for the validation expert system to take as input a sample of outputs from the supervised model (Examiner’s note: Williams teaches a domain expert 530 receiving information from scoring process 522 (producing anomaly scores) from outputs from the model(s) 518 (implemented as a combination of unsupervised auto-encoder model followed by the supervised nearest-neighbor model) and (optionally) combination function 520; this connection flow 518, 520, (520) and 530 represents a coupling between the supervised model and the domain expert (representing a validation expert system) (Williams Figure 5, elements 518, 520, 522, 530; and [0100]-[0102]: “… once the Model(s) 518 have been stored, a test of the system's performance will execute prior to any runtime scoring. Both testing and runtime scoring utilize Scoring Process 522, which applies Model(s) 518 to the input data and executes Combination Function 520 to select the correct predicted classification, when appropriate. … Scoring Process 522 assigns a score to incoming data, ranking said data as a member of a class (or label), or as an anomalous data point. Runtime scoring delivers new data to the Scoring Process and makes those results available to the Domain Expert Analysis component 530.”).) and 
the supervised model to take as input a fraction of outputs obtained from the validation expert system (Examiner’s note: Casas teaches applying the DBSCAN clustering algorithm (which is a form of a nearest-neighbors algorithm) by performing a query on a subset of selected data points Xi of lower dimension (Casas p.45 1st paragraph) provided into the DBSCAN algorithm (where the data points Xi were selected through a set of constraints for a particular set of k features out of possible m attributes), and getting a set of clusters Pi (each of which represents a class label for those data points within each cluster) and an associated set of q(i) outliers (“said anomalies are detected based on outputs from the supervised model”), where the DBSCAN clustering algorithm represents a “supervised, machine learning model” as it is performing further classification of the data points into clusters (Casas pp.44 last paragraph – p.45 1st paragraph (Section 4.1 Clustering Ensemble and Sub-space Clustering): “Each of the N sub-spaces Xi ⊂ X is obtained by selecting k features from the complete set of m attributes. … Each partition Pi is obtained by applying DBSCAN [13] to sub-space Xi. DBSCAN is a powerful clustering algorithm that discovers clusters of arbitrary shapes and sizes [7], relying on a density-based notion of clusters: clusters are high-density regions of the space, separated by low-density areas. This algorithm perfectly fits our unsupervised traffic analysis, because it is not necessary to specify a-priori difficult to set parameters such as the number of clusters to identify. Results provided by applying DBSCAN to sub-space Xi are twofold: a set of p(i) clusters {                        
                            
                                
                                    C
                                
                                
                                    1
                                
                                
                                    i
                                
                            
                        
                    ,                         
                            
                                
                                    C
                                
                                
                                    2
                                
                                
                                    i
                                
                            
                        
                    , ..,                         
                            
                                
                                    C
                                
                                
                                    p
                                    (
                                    i
                                    )
                                
                                
                                    i
                                
                            
                        
                    } and a set of q(i) outliers {                        
                            
                                
                                    o
                                
                                
                                    1
                                
                                
                                    i
                                
                            
                        
                    ,                         
                            
                                
                                    o
                                
                                
                                    2
                                
                                
                                    i
                                
                            
                        
                    , ..,                         
                            
                                
                                    o
                                
                                
                                    q
                                    (
                                    i
                                    )
                                
                                
                                    i
                                
                            
                        
                    }.”). Casas further teaches in Casas p.45 last paragraph – p.46 1st paragraph (Section 4.2 Ranking Outliers Using Evidence Accumulation) and Casas p.46 Algorithm 1 that                         
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                        
                     is defined as “the maximum neighborhood distance of a sample to identify dense regions”, which represents the set of points that were already classified by the supervised model and located within each existing cluster                         
                            
                                
                                    C
                                
                                
                                    m
                                    a
                                    x
                                
                                
                                    i
                                
                            
                        
                    , and is set in Algorithm 1 line 5 to “a fraction of the average distance between flows in sub-space Xi (we take a fraction 1/10), which is estimated from 10% of the flows, randomly selected.”; this is interpreted as the feed-back of the fraction of inputs obtained from the validation expert system to be used as inputs to the supervised model.).  
Regarding amended Claim 20, 

Claim 20 recites a computer program product for detecting anomalies in non-stationary data in a network of computing entities, where the computer program product comprises a computer readable storage medium having program instructions embodied therewith, the program instructions executable by one or more processors to cause the computer program product to perform claim limitations that are similar in scope to corresponding claim limitations in Claim 1, and hence is rejected under similar rationale and motivations provided by Williams, An, and Casas as indicated in Claim 1. In addition, Williams teaches a network computer with sensor computer functionalities, where the network computer contains a processor and associated memory storing computer readable instructions and program modules such as a classifier application and machine learning engine, with the associated memory representing a computer program product comprising a computer readable storage medium (Williams Figure 3, elements 300, 302, 326; [0066], [0067], [0070]-[0072]).
Regarding new Claim 22, 
Williams in view of An, in further view of Casas teaches
(New) The computer-implemented method according to claim 1, wherein replacing the inference model, as currently used to classify the non-stationary data, with the trained model occurs based on a data traffic of the non-stationary data in the network (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites physically or logically changing or updating one model with another model based on received changes in network data traffic. As indicated earlier, Williams teaches collecting network packets (representing non-stationary data in a network) for classification, and training and storing models based on the collected input data. Williams further teaches these models include the Fast Learning Model and the DLNN models trained in the system, following the same training and model storing process as indicated in FIG.5 (Williams [0026]; Figure 5, [0082] and [0083]-[0084]; [0097]-[0100]; [0126], [0130]; and [0174]-[0175]). As indicated earlier, Williams teaches the Fast Learning Model producing matches/classifications with higher confidence score than the DLNN model and employing the classification result produced by the Fast Learning Model, where this action corresponds to using the trained model (i.e., the Fast Learning Model trained by the learning algorithm) in place of the inference model to perform classification (Williams [0122]-[0123]). Williams additionally teaches that the rationale for employing the classification result from the Fast Learning Model is allow the system to monitor and classify real-time network behavior (such as an unknown malicious attack), where a model that is tuned to this new additional training data produces scores/ranks that match closer to the current real-time data, hence corresponding to the scenario where the classification result from the trained model is used in place of the classification result of the inference model based on the data traffic of the non-stationary data (Williams [0114]-[0115]: “… if a system is arranged to monitor and classify real-time network behavior, a previously unknown malicious attack may be launched .. the network behavior associated with the attack may be flagged correctly as an anomaly but a user may desire a classifier specific, more precise classifier to classify the data associated with the new attack … a user may modify and/or tune one or more classifiers and/or create new classifiers based on the errors and/or signals of the model … tuning a classifier may include associating additional training data with classifier .. a user may create a new classifier and associate the appropriate training data with the classifier … a fast learning (FL) model may be trained using the tuned and/or new classifier …”; [0120]: “… the provided source data may be classified using the fast learning (FL) model … each classification result generated by the fast learning model may be associated with a value … that scores/ranks the how close the data matches the classifier …”).).
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over 

Williams, Jr. et al., U.S. PGPUB 2015/0254555, published 9/10/2015 [hereafter referred as Williams] in view of An et al., Variational Autoencoder based Anomaly Detection using Reconstruction Probability, published December 27, 2015 [hereafter referred as An], in further view of Casas et al., UNADA: Unsupervised Network Anomaly Detection Using Sub-space Outliers Ranking, 2011 IFIP International Federation for Information Processing [hereafter referred as Casas] as applied to Claim 1; in even further view of Zhou et al., Distributed Anomaly Detection by Model Sharing, IEEE 2009 [hereafter referred as Zhou].
Regarding previously presented Claim 6, 
Williams in view of An, in further view of Casas as applied to Claim 1 teaches
(Previously Presented) The computer-implemented method according to claim 1.
However, Williams in view of An, in further view of Casas does not teach 
… wherein classifying the collected data further comprises: sorting the data points according to their corresponding anomaly scores.
Zhou teaches
… wherein classifying the collected data further comprises: sorting the data points according to their corresponding anomaly scores (Examiner’s note: Zhou teaches using an auto-encoder neural network as an anomaly detection model. Zhou further teaches anomaly detection models Mj from different sites (“network entities/computers”) computing anomaly scores ASi for each data record (“data points”) (Zhou p.298.col.1 Section 3.1 General framework for distributed anomaly detection, 1st – 2nd paragraph), and performing a combining method for the scores ASi, where one of the combining methods is an averaging method that sorts anomaly scores (Zhou p.297 col.2 Section 2 Methodology, 1st paragraph: “In this paper, we propose a novel framework for anomaly detection from distributed data sources (or sites)…”; p.299 col.1 last paragraph; p.300 Table 1; p.299 col.1 3rd paragraph (Section 3.2 Description of combining methods): “Average anomaly score. This method takes anomaly score vectors AS(j), j = 1, …, n from all the anomaly detection models Mj that are built at distributed sites and then computes an average anomaly score vector ASF. … Alternatively, we can sort anomaly score vector ASF and thus rank all new test data records from being most anomalous to less anomalous. The higher value of anomaly score means the higher probability that the new test data record is anomalous one.”).).
	Both Williams in view of An, in further view of Casas and Zhou are analogous art since they both teach using auto-encoders for anomaly classification and detection in a plurality of network entities, and calculating anomaly scores for the classified outputs.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the scoring process of Williams in view of An, in further view of Casas and incorporate the sorting step of Zhou as a way to create a sorted list of classified outputs according to their anomaly scores. The motivation to combine is taught by Zhou, as a way to aggregate outputs with associated anomaly scores collected from different network computers. Identifying those outputs/data points that are more anomalous through sorting will aid the system to identify and flag candidate data points that contain anomalous state information for further analysis and alerting, resulting in overall improvement of the anomaly detection performance in the system (Zhou p.298 col.2 1st paragraph: “Our objective is to achieve the best possible detection performance that is comparable to the performance of anomaly detection model applied when all data sets are merged together.” and Zhou p.298 col.2 Section 3.2 Description of combining methods, 1st paragraph: “The major goal of combining local anomaly detection models built at distributed sites is to improve the quality, robustness and prediction performance of the ensemble of the models.”).
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over 
Williams, Jr. et al., U.S. PGPUB 2015/0254555, published 9/10/2015 [hereafter referred as Williams] in view of An et al., Variational Autoencoder based Anomaly Detection using Reconstruction Probability, published December 27, 2015 [hereafter referred as An], in further view of Casas et al., UNADA: Unsupervised Network Anomaly Detection Using Sub-space Outliers Ranking, 2011 IFIP International Federation for Information Processing [hereafter referred as Casas], in even further view of Zhou et al., Distributed Anomaly Detection by Model Sharing, IEEE 2009 [hereafter referred as Zhou] as applied to Claim 6; in even further view of Tuor et al., Deep Learning for Unsupervised Insider Threat Detection in Structured Cybersecurity Data Streams, October 2, 2017 [hereafter referred as Tuor].
Regarding previously presented Claim 7, 
Williams in view of An, in further view of Casas, in even further view of Zhou as applied to Claim 6 teaches
(Previously Presented) The computer-implemented method according to claim 6.
However, Williams in view of An, in further view of Casas, in even further view of Zhou does not teach 
… wherein classifying the collected data further comprises: normalizing the anomaly scores to obtain normalized anomaly scores.  
Tuor teaches
… wherein classifying the collected data further comprises: normalizing the anomaly scores to obtain normalized anomaly scores (Examiner’s note: Tuor teaches using an auto-encoder to perform anomaly detection (Tuor p.4 col.1 last paragraph – p.4 col.2 1st paragraph) and computing a weighted moving average estimate of the mean and variance for anomaly scores and standardizing each score, where the standardization of scores is interpreted as a form of normalization (Tuor p.4 col.2 Detecting Insider Threat, 1st-2nd paragraphs: “We assume the following conditions: our model produces anomaly scores … Because our model is trained in an online fashion, the anomaly scores start out quite large (when the model knows nothing about normal behavior) and trend lower over time (as normal behavior patterns are learned). To place the anomaly score for user u at time t in the proper context, we compute an exponentially weighted moving average estimate of the mean and variance of these anomaly scores and standardize each score as it arrives.”).).  
Both Williams in view of An, in further view of Casas, in even further view of Zhou and Tuor are analogous art since they both teach using auto-encoders for anomaly detection and calculating anomaly scores for the classified outputs.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the scoring process (producing sorted anomaly scores associated with the classified outputs) of Williams in view of An, in further view of Casas, in even further view of Zhou and incorporate the normalization step of Tuor as a way to normalize the sorted anomaly scores for the classified outputs. The motivation to combine is taught by Tuor, as a way to standardize the anomaly scores from data received in an online, real-time system (i.e., non-stationary data) through calculation of a weighted moving average of mean and variance. Giving the anomaly scores the appropriate context and scale relative to other surrounding data occurring within the same time interval will make it easier to perform comparisons and analysis against recent occurring data, thus improving the anomaly detection reliability and accuracy of the system (Tuor p.4 col.2 Detecting Insider Threat, 1st-2nd paragraphs: “We assume the following conditions: our model produces anomaly scores, … Because our model is trained in an online fashion, the anomaly scores start out quite large (when the model knows nothing about normal behavior) and trend lower over time (as normal behavior patterns are learned). To place the anomaly score for user u at time t in the proper context, we compute an exponentially weighted moving average estimate of the mean and variance of these anomaly scores and standardize each score as it arrives.”).
Claims 8-10, 12, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over 
Williams, Jr. et al., U.S. PGPUB 2015/0254555, published 9/10/2015 [hereafter referred as Williams] in view of An et al., Variational Autoencoder based Anomaly Detection using Reconstruction Probability, published December 27, 2015 [hereafter referred as An], in further view of Casas et al., UNADA: Unsupervised Network Anomaly Detection Using Sub-space Outliers Ranking, 2011 IFIP International Federation for Information Processing [hereafter referred as Casas], in even further view of Zhou et al., Distributed Anomaly Detection by Model Sharing, IEEE 2009 [hereafter referred as Zhou], in even further view of Tuor et al., Deep Learning for Unsupervised Insider Threat Detection in Structured Cybersecurity Data Streams, October 2, 2017 [hereafter referred as Tuor] as applied to Claim 7; in even further view of Elovici et al., WO2018/037411, filed 8/23/2017 [hereafter referred as Elovici].
Regarding original Claim 8, 
Williams in view of An, in further view of Casas, in even further view of Zhou, in even further view of Tuor as applied to Claim 7 teaches
(Original) The computer-implemented method according to claim 7.
 However, Williams in view of An, in further view of Casas, in even further view of Zhou, in even further view of Tuor does not teach
… wherein classifying the collected data further comprises: thresholding the normalized anomaly scores to obtain a selection of anomaly scores and a corresponding selection of data points.  
Elovici teaches
… wherein classifying the collected data further comprises: thresholding the normalized anomaly scores to obtain a selection of anomaly scores and a corresponding selection of data points (Examiner’s note: Elovici teaches performing comparison of normalized anomaly scores against thresholds to determine whether a sequence ‘p’ (“data point”) is considered anomalous (Elovici [0042]), where data sequences ‘p’ consist of temporal series of multi-valued events (Elovici Summary [0002]), thus generating a selection of data points (Elovici Figure 5, elements 245, 255, 265; and [0050]: “If at step 245, the gain is greater than the test threshold, typically zero, then "p" may have sufficient affinity to be considered "normal". This is determined by a calculation at a step 255. An anomaly score, AS, for "p" is calculated, as described above, and the score is compared with a threshold. Typically the score is normalized within a range of 0 to 1. If the AS if greater than the threshold, then "p" is considered anomalous. … At a step 265, following either step 255 or 260, the result of the anomaly test is output to a system that will apply the test, typically a classification system in one of the domains described above (e.g., machine testing, computer operations, behavior analysis, etc.).”).).  
Both Williams in view of An, in further view of Casas, in even further view of Zhou, in even further view of Tuor and Elovici are analogous art since they both teach anomaly detection of data sequences and calculating anomaly scores for the classified outputs.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take scoring process (producing sorted, normalized scores associated with the classified outputs) of Williams in view of An, in further view of Casas, in even further view of Zhou, in even further view of Tuor and incorporate the thresholding step of Elovici as a way to obtain a final selection of anomaly scores with associated classified outputs. The motivation to combine is taught by Elovici, as providing thresholds allows a user the ability to specify additional constraints on a set of data sequences (“data points”) in order to focus on a set of detected patterns that may be deemed as being more anomalous than others, thereby improving the anomaly detection reliability and accuracy of the system (Elovici Summary [0004]: “… generating the second support level may further include determining a second set of patterns in the second set of interaction sequences, patterns of each set may satisfy one or more pre-defined constraints, and the first and second support levels are indicative of the incidence of the first and second sets of patterns in the respective first and second sets of interactive sequences. … In some embodiments, the interaction sequences are temporally ordered, and the one or more pre-defined constraints include a sustainability constraint that a pattern shall appear as a common motif in interaction sequences generated within a predefined period of time. Alternatively or additionally, the one or more pre-defined constraints may include a frequency constraint that a pattern shall appear as a common motif within a minimum number of interaction sequences. Alternatively or additionally, the one or more pre-defined constraints include a recognition constraint that an aggregate affinity measure of the pattern shall exceed a pre-defined threshold, include the aggregate affinity measure is an aggregation of all of the affinities represented by the pattern.”).
Regarding original Claim 9, 
Williams in view of An, in further view of Casas, in even further view of Zhou, in even further view of Tuor, in even further view of Elovici teaches
(Original) The computer-implemented method according to claim 8, wherein classifying the collected non-stationary data further comprises: 
… feeding the selection of data points into a supervised, machine learning model, for it to further classify the selection of data points (Examiner’s note: As indicated earlier, Williams teaches the network computer/sensor computer system (Williams Figure 3, elements 300, 302, 326; [0070]-[0072]) includes a training process using a model structure that defines the machine learning model used for classification and analysis in the system, which can include unsupervised and supervised models as well as including a combination of models, where the auto-encoder represents an inference model performing the initial data classification, the outputs of the auto-encoder represent the selected outputs resulting from the inference model (“selecting outputs from the classification performed thanks to the inference model”), and the k-nearest neighbor model represents a supervised machine learning model classifier for detecting anomalies (where under the broadest reasonable interpretation, the detection of anomalies is considered a form of “further classification”). Given the specified model combination of the auto-encoder, followed by a k-nearest-neighbor model, it logically follows that the outputs from the auto-encoder will be selected to be used as inputs into the k-nearest-neighbor-model (Williams Figure 5, elements 512, 516; [0092], [0095], and [0097]: “Model Structure 516 may also include a specification of a combination of the machine learning models described above, together with additional machine learning models that consume the output of DLNN models. For example, configuring an auto-encoder to reduce the dimensionality of input data, followed by a k-Nearest-Neighbor model used to detect anomalies in the reduced dimensionality space.”).) …  
… whereby said anomalies are detected based on outputs from the supervised model (Examiner’s note: Casas teaches applying the DBSCAN clustering algorithm (which is a form of a nearest-neighbors algorithm) by performing a query on a subset of selected data points Xi of lower dimension (Casas p.45 1st paragraph) provided into the DBSCAN algorithm (where the data points Xi were selected through a set of constraints for a particular set of k features out of possible m attributes), and getting a set of clusters Pi (each of which represents a class label for those data points within each cluster) and an associated set of q(i) outliers (“said anomalies are detected based on outputs from the supervised model”), where the DBSCAN clustering algorithm represents a “supervised, machine learning model” as it is performing further classification of the data points into clusters (Casas pp.44 last paragraph – p.45 1st paragraph (Section 4.1 Clustering Ensemble and Sub-space Clustering): “Each of the N sub-spaces Xi ⊂ X is obtained by selecting k features from the complete set of m attributes. … Each partition Pi is obtained by applying DBSCAN [13] to sub-space Xi. DBSCAN is a powerful clustering algorithm that discovers clusters of arbitrary shapes and sizes [7], relying on a density-based notion of clusters: clusters are high-density regions of the space, separated by low-density areas. This algorithm perfectly fits our unsupervised traffic analysis, because it is not necessary to specify a-priori difficult to set parameters such as the number of clusters to identify. Results provided by applying DBSCAN to sub-space Xi are twofold: a set of p(i) clusters {                        
                            
                                
                                    C
                                
                                
                                    1
                                
                                
                                    i
                                
                            
                        
                    ,                         
                            
                                
                                    C
                                
                                
                                    2
                                
                                
                                    i
                                
                            
                        
                    , ..,                         
                            
                                
                                    C
                                
                                
                                    p
                                    (
                                    i
                                    )
                                
                                
                                    i
                                
                            
                        
                    } and a set of q(i) outliers {                        
                            
                                
                                    o
                                
                                
                                    1
                                
                                
                                    i
                                
                            
                        
                    ,                         
                            
                                
                                    o
                                
                                
                                    2
                                
                                
                                    i
                                
                            
                        
                    , ..,                         
                            
                                
                                    o
                                
                                
                                    q
                                    (
                                    i
                                    )
                                
                                
                                    i
                                
                            
                        
                    }.”).).
Regarding previously presented Claim 10, 
Williams in view of An, in further view of Casas, in even further view of Zhou, in even further view of Tuor, in even further view of Elovici teaches
(Previously Presented) The computer-implemented method according to claim 9, wherein: 
the supervised model is configured as a nearest-neighbor classifier (Examiner’s note: Casas teaches applying the DBSCAN clustering algorithm (which is a form of a nearest-neighbors algorithm) by performing a query on a subset of selected data points Xi of lower dimension (Casas p.45 1st paragraph) provided into the DBSCAN algorithm (where the data points Xi were selected through a set of constraints for a particular set of k features out of possible m attributes), and getting a set of clusters Pi (each of which represents a class label for those data points within each cluster) and an associated set of q(i) outliers (“said anomalies are detected based on outputs from the supervised model”), where the DBSCAN clustering algorithm represents a “supervised, machine learning model” as it is performing further classification of the data points into clusters (Casas pp.44 last paragraph – p.45 1st paragraph (Section 4.1 Clustering Ensemble and Sub-space Clustering): “Each of the N sub-spaces Xi ⊂ X is obtained by selecting k features from the complete set of m attributes. … Each partition Pi is obtained by applying DBSCAN [13] to sub-space Xi. DBSCAN is a powerful clustering algorithm that discovers clusters of arbitrary shapes and sizes [7], relying on a density-based notion of clusters: clusters are high-density regions of the space, separated by low-density areas. This algorithm perfectly fits our unsupervised traffic analysis, because it is not necessary to specify a-priori difficult to set parameters such as the number of clusters to identify. Results provided by applying DBSCAN to sub-space Xi are twofold: a set of p(i) clusters {                        
                            
                                
                                    C
                                
                                
                                    1
                                
                                
                                    i
                                
                            
                        
                    ,                         
                            
                                
                                    C
                                
                                
                                    2
                                
                                
                                    i
                                
                            
                        
                    , ..,                         
                            
                                
                                    C
                                
                                
                                    p
                                    (
                                    i
                                    )
                                
                                
                                    i
                                
                            
                        
                    } and a set of q(i) outliers {                        
                            
                                
                                    o
                                
                                
                                    1
                                
                                
                                    i
                                
                            
                        
                    ,                         
                            
                                
                                    o
                                
                                
                                    2
                                
                                
                                    i
                                
                            
                        
                    , ..,                         
                            
                                
                                    o
                                
                                
                                    q
                                    (
                                    i
                                    )
                                
                                
                                    i
                                
                            
                        
                    }.”).), and 
wherein further classifying the selection of data points comprises: querying, for each data point of said selection of data points fed into the supervised model, nearest-neighbors of said each data point in the selection of data points, wherein the nearest-neighbor is based on a computed distance of said each data point (Examiner’s note: Casas p.46 Algorithm 1 teaches running the DBSCAN clustering algorithm (representing a nearest-neighbor algorithm) to search through the selection of data points (Casas p.46 Algorithm 1, lines 4-10), where this search process represents a form of querying the selection of data points. Casas teaches that this search process classifies the selection of data points by constructing a dissimilarity vector to accumulate the distance between different outliers found in each sub-space, and the centroid of the corresponding subspace to identify flows that are far from the normal operation traffic (where this distance calculation represents a process for identifying nearest-neighbors based on a computed distance each data point). Casas teaches that the distances are computed and added to the dissimilarity vector D, and weighting factors are applied to this computed distance (Casa p.46 Algorithm 1 line 9) to produce a ranking (Algorithm 1 line 11) based on the data points included within dissimilarity vector D, using the distances compared against the existing clusters                         
                            
                                
                                    C
                                
                                
                                    m
                                    a
                                    x
                                
                                
                                    i
                                
                            
                        
                    , where clusters                         
                            
                                
                                    C
                                
                                
                                    m
                                    a
                                    x
                                
                                
                                    i
                                
                            
                        
                     represent class groupings with the identified nearest-neighbors (Casas pp.45-46 Section 4.2 Ranking Outliers Using Evidence Accumulation: “… instead of producing a similarity measure between the n different aggregated flows described in X, EA4RO constructs a dissimilarity vector D ∈                         
                            
                                
                                    R
                                
                                
                                    n
                                
                            
                        
                     in which it accumulates the distance between the different outliers                         
                            
                                
                                    o
                                
                                
                                    j
                                
                                
                                    i
                                
                            
                        
                     found in each sub-space i = 1, ..,N and the centroid of the corresponding sub-space-biggest-cluster                         
                            
                                
                                    C
                                
                                
                                    m
                                    a
                                    x
                                
                                
                                    i
                                
                            
                        
                    . The idea is to clearly highlight those flows that are far from the normal-operation traffic at each of the different sub-spaces, statistically represented by                         
                            
                                
                                    C
                                
                                
                                    m
                                    a
                                    x
                                
                                
                                    i
                                
                            
                        
                    . Algorithm 1 presents a pseudo-code for EA4RO … The weighting factor wi is used as an outlier-boosting parameter, as it gives more relevance to those outliers that are “less probable”: wi  takes bigger values when the size                         
                            
                                
                                    n
                                
                                
                                    
                                        
                                            m
                                            a
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            
                        
                    of cluster                         
                            
                                
                                    C
                                
                                
                                    m
                                    a
                                    x
                                
                                
                                    i
                                
                            
                        
                     is closer to the total number of flows n. we compute a Mahalanobis distance                         
                            
                                
                                    d
                                
                                
                                    M
                                
                            
                        
                     between outliers and the centroid of the biggest cluster. The Mahalanobis distance takes into account the correlation between samples, dividing the standard Euclidean distance by the variance of the samples. This permits to boost the degree of abnormality of an outlier when the variance of the samples is smaller … In the last part of EA4RO, flows are ranked according to the dissimilarity obtained in D, and the anomaly detection threshold Th is set.”).).  
Regarding previously presented Claim 12, 
Williams in view of An, in further view of Casas, in even further view of Zhou, in even further view of Tuor, in even further view of Elovici teaches
(Previously Presented) The computer-implemented method according to claim 10, wherein detecting anomalies further comprises: triggering an anomaly alert based on a rating associated with said each data point (Examiner’s note: Casas teaches applying threshold on the                         
                            
                                
                                    D
                                
                                
                                    r
                                    a
                                    n
                                    k
                                
                            
                        
                     to detect attacks and generating anomaly alerts, where this                         
                            
                                
                                    D
                                
                                
                                    r
                                    a
                                    n
                                    k
                                
                            
                        
                     represents the ranking based on the data points included within dissimilarity vector D (Casas p.46 Algorithm 1 lines 9, 11), and this threshold on the                         
                            
                                
                                    D
                                
                                
                                    r
                                    a
                                    n
                                    k
                                
                            
                        
                     represents triggering an anomaly alert based on a rating (Casas p.46 2nd paragraph (Section 4.2 Ranking Outliers Using Evidence Accumulation): “… flows are ranked according to the dissimilarity obtained in D, and the anomaly detection threshold Th is set. The computation of Th is simply achieved by finding the value for which the slope of the sorted dissimilarity values in Drank presents a major change. In the evaluation section we explain how to perform this computation with an example of real traffic analysis. Anomaly detection is finally done as a binary thresholding operation on D: if D(i) > Th, UNADA flags an anomaly in flow yi.”; and p.48 Section 5.2 Detecting Attacks in MAWI Traffic, 1st paragraph, and p.49 Figure 2.a: “Setting the detection threshold according to the previously discussed approach results in Th1 . Indeed, if we focus on the shape of the ranked dissimilarity in figure 2.(a), we can clearly appreciate a major change in the slope after the 5th ranked flow. Note however that both attacks can be easily detected and isolated from the anomalous but yet legitimate traffic without false alarms, using for example the threshold Th2 on D.”).).  
Regarding original Claim 14, 
Williams in view of An, in further view of Casas, in even further view of Zhou, in even further view of Tuor, in even further view of Elovici teaches
(Original) The computer-implemented method according to claim 10, wherein: the supervised model is coupled to a validation expert system (Examiner’s note: Williams teaches a domain expert 530 receiving information from scoring process 522 (producing anomaly scores) from outputs from the model(s) 518 (implemented as a combination of unsupervised auto-encoder model followed by the supervised nearest-neighbor model) and (optionally) combination function 520; this connection flow 518, 520, (520) and 530 represents a coupling between the supervised model and the domain expert (representing a validation expert system) (Williams Figure 5, elements 518, 520, 522, 530; and [0100]-[0102]: “… once the Model(s) 518 have been stored, a test of the system's performance will execute prior to any runtime scoring. Both testing and runtime scoring utilize Scoring Process 522, which applies Model(s) 518 to the input data and executes Combination Function 520 to select the correct predicted classification, when appropriate. … Scoring Process 522 assigns a score to incoming data, ranking said data as a member of a class (or label), or as an anomalous data point. Runtime scoring delivers new data to the Scoring Process and makes those results available to the Domain Expert Analysis component 530.”).), and 
wherein the method further comprises: feeding the validation expert system with a sample of outputs from the supervised model, said outputs comprising data points as further classified by the supervised model, for the validation expert system to validate anomaly ratings associated to data points corresponding to said sample (Examiner’s note: Williams teaches a domain expert decision 536 within the domain expert 530 receiving scores from scoring process 522 (“anomaly ratings associated to data points corresponding to said sample”) and the associated anomalous access pattern records (representing a said selection of data points), where this process of providing generated scores (produced by data provided from the models 518) to a domain expert represents “feeding the validation expert system with a sample of outputs from the supervised model, said outputs comprising data points as further classified by the supervised model”. Casas further teaches the domain expert is used to make a decision whether to retrain on new data or maintain the existing data for recordkeeping, where the process of making that decision based on the received scores/ratings is a form of validating the anomaly ratings to see whether the data points indicate a valid threat that needs to be further handled and alerted (Williams [0218]-[0221]: “… anomaly detectors may be trained such that the length of the sequence of access pattern records may vary and multiple time windows may be used to analyze the data. … if Model(s) 518 are trained with the Training Corpus 508, new data may be ingested upon its availability from the file, database, and application servers and delivered to the Scoring Process 522. If the sequence of recent access records may be classified as similar to a known pattern of authorized usage, or as an anomaly, then the access record may be called to a security Domain Experts attention using User Interface 532 and Alerts 534, or processed automatically for further action with Decision 536 or via an external method. … In at least one of the various embodiments, if there may be a pending security investigation or anomalous access pattern detected for a given user, group, or content area, the old data may be maintained and the Model(s) 518 not retrained until it is certain that the new data does not represent unauthorized usage or an anomalous pattern of behavior. Feedback gathered during Domain Expert Decision 536 may also considered if deciding when it may be appropriate to retrain on new access record data.”).).  
Claim 21 is rejected under 35 U.S.C. 103 as being unpatentable over 
Williams, Jr. et al., U.S. PGPUB 2015/0254555, published 9/10/2015 [hereafter referred as Williams] in view of An et al., Variational Autoencoder based Anomaly Detection using Reconstruction Probability, published December 27, 2015 [hereafter referred as An], in further view of Casas et al., UNADA: Unsupervised Network Anomaly Detection Using Sub-space Outliers Ranking, 2011 IFIP International Federation for Information Processing [hereafter referred as Casas] as applied to Claim 1; in even further view of Iyer et al., U.S. PGPUB 2007/0220034, published 9/20/2007 [hereafter referred as Iyer].
Regarding new Claim 21, 
Williams in view of An, in further view of Casas as applied to Claim 1 teaches 
(New) The computer-implemented method according to claim 1.
While Williams in view of An, in further view of Casas teaches retraining models on a periodic basis (Williams [0136]), Williams in view of An, in further view of Casas does not explicitly teach
… wherein replacing the inference model, as currently used to classify the non-stationary data, with the trained model occurs based on a periodic interval.
Iyer teaches
… wherein replacing the inference model, as currently used to classify the non-stationary data, with the trained model occurs based on a periodic interval (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites physically or logically changing or updating one model with another model based on a periodic interval. Iyer teaches updating an existing model based on scheduled times for updating a model, where the updates to a model are in response to receiving evolving source and/or training data (corresponding to non-stationary data). Iyer further teaches the scheduling is based on a sliding window mechanism that defines the interval of a regular update schedule. Iyer further teaches a system containing one or more models that includes a learning and reasoning component that learns system behaviors and communicates with the automatic adjustment and event detection components to perform scheduling adjustments and trigger the selection of the appropriate model for implementation upon receipt of the scheduled event. Combining the teachings of the Williams reference involving choosing the classification result between a fast learning model and a deep learning neural network model, this process of updating a model according to a regular periodic schedule corresponds to a replacement of a model with another trained model based on a periodic interval (Iyer [0030]: “The disclosed innovation allows data mining system to automatically maintain up-to-date mining models in realtime with respect to evolving source and/or training data …”; [0035]-[0036]: “At 200, a data mining model is developed and trained on a dataset. At 202, an event is detected which triggers an automatic (and realtime) update process for updating the existing model. At 204, the model is updated … The event detection component 302 can detect predetermined events such as scheduled times for updating …”; [0038]-[0039]: “… the window of time can be three months in duration … The window can be adjusted further based on additional criteria such as how often the data changes or how much the data changes over a given time period … the sliding window update process implements model updating on a regular basis regardless of whether the model needs updating at all …”; and Figure 10, [0044], [0047]-[0049], [0053]: “… consider a trained mining model that is applied against data extracted in a 5-month wide sliding window, which is being moved every two weeks. Based on a qualitative description parameter that is a measure of how well the model describes the data, or a prediction parameter that provides some measure of how well the trained model predicts data patterns or behavior, the LR component learn and reason to make adjustments to sliding window parameters accordingly … if the description measure falls below a predetermined level,  the LR component can control the automatic adjustment component to reduce the window width to four months in an attempt to improve the measure …”; and [0056]: “… the LR component 1002 can learn that a first model performs better over another model even though the underperforming model is a most recently trained version. Accordingly, the first model can be retrained until a better model has been created tested and trained for implementation.”).).
Both Williams in view of An, in further view of Casas and Iyer are analogous art since they both teach selection of machine learning models based on receipt of non-stationary data to identify and predict patterns and behaviors in the data.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the Combination Function component that selects the classification result between the Fast Learning Model and the DLNN model taught in Williams in view of An, in further view of Casas and incorporate the model update and automatic adjustment functionality taught in Iyer as a way to periodically refresh the executing models and improve overall model performance. The motivation to combine is taught in Iyer, since updating models on a periodic basis allows the models to reflect the latest changes in the behavior and patterns in the data, thus improving the model performance and keeping the model up-to-date over time (Iyer [0004]: “Mining models are trained to ensure viability over the changing patterns in data. However, such mining models can quickly become outdated if not periodically updated to reflect changes in the behavior of the entities being modeled. and [0030]: “The disclosed innovation allow data mining systems to automatically maintain up-to-date mining models in realtime with respect to evolving source and/or training data.”).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is 303-297-4332. The examiner can normally be reached Monday-Friday 8:00am - 4:30pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/WILLIAM WAI YIN KWAN/Examiner, Art Unit 2121                                                                                                                                                                                                        
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121