Detailed Action
Remarks
This Office Action is responsive to Applicant’s Amendment filed on March 25th 2021, in which Claim 1 is amended, Claims 3-5 and 8-20 are cancelled. Claims 21-32 are newly added. Claims 1, 2, 6, 7 and 21-32 are currently pending.
Specification
Applicant’s amendment to the specification is acknowledged
Response to Arguments
The 35 U.S.C. 102 rejections to Claims 1, 8 and 15 are hereby withdrawn, as necessitated by applicant’s amendments and remarks made to the rejections.
The 35 U.S.C. 103 rejections to Claims 2-4, 9-11 and 16-18 are hereby withdrawn, as necessitated by applicant’s amendments and remarks made to the rejections.
Allowable Subject Matter
After a thorough search and examination of the present application, and in light of the following:
the prior art made of record;
Applicant’s Amendment made on 3/25/2021.
Claims 1-2, 6-7 and 27-32 are allowed. Claims 25-26 are objected to and may be allowable if their limitations are brought into the independent claim.
REASON FOR ALLOWANCE
The following is a statement of reason for the indication of allowable subject matter:
In the Examiner’s Office Action mailed on 12/28/2020, claims 1-4, 8-11 and 15-18 were rejected and claims 5-7, 12-14 and 19-20 were objected to.

As for Claim 1: The following limitations of the claim may be allowable and Claims 2, 6 
and 7 depend on Claim 1.
“calculating an accumulated layer activation for each layer of the set of layers for the selected model; and calculating a learning rate for each layer of the set of layers, wherein the learning rate is inversely proportional to the corresponding accumulated layer activation rate”
Explanation of novelty: Total calculated activation strength of a layer or sum 
of the activations/neuron outputs in a layer has an inverse relationship with calculated learning rate of the layer.
Existing prior art: the learning rate is usually chosen as an arbitrary value between 0.01 
and 0.0001 (for example) by a human or software that does it automatically, not calculated. It just can’t be too high or too low to minimize loss; this is basically trial and error and it is usually updated throughout training. There are documents in the search performed that calculate the learning rate such as US-20140229476-A1 [0041], and US-6452870-B1 (See Extended DBD on page 40), but not in accordance with each layer’s total activation strength.
No prior art concerning accumulated layer activation; which under BRI can be the sum of 
the activations/neuron outputs in a layer or activation strength of a layer; either phrase you is applicable.
Examiner identified one NPL (“How to train your Neural Network” by Pratik Bhavsar) that talks about the learning rate being too high can “kill” about 40% of the network (neurons 
As for Claim 25: This Claim is allowed based on the same reason as stated above for Claim 1.
As for Claim 26:
“The method according to claim 21, wherein the activation score is dependent on a corresponding activation divided by an average of maximum activations of each layer of the chosen model.”
Explanation of novelty: Each neuron’s activation strength value in the layer is added up and the mean is found; then each neuron’s activation strength value is divided by this mean to get the activation score for each neuron.
As for Claim 27: This Claim is allowed based on the same reason as stated above for Claim 26.
Existing prior art: Versions of “activation score” were identified by examiner during search (such as KR-20150036176-A and US-20060242147-A1) but none of them were calculated in the particular way described in the claim and some versions are not related to neurons.
The cited prior arts do not teach or fairly suggest the above limitations in combination with other limitations in the independent claims respectively.
After a search and a thorough examination of the present application in light of the prior art, Claims 1-2, 6-7 and 27-32 are allowed.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 21 is rejected under 35 U.S.C. 103 as being unpatentable over Raschka (“Model evaluation, model selection, and algorithm selection in machine learning”), hereinafter “Raschka”, in view of Shankar et al. (“Refining Architectures of Deep Convolutional Neural Networks”), hereinafter “Shankar”.
Regarding Claim 21, Raschka teaches a processor-implemented method for model 
selection for training a new dataset, the method comprising: (Raschka teaches on page 3 under [Performance Estimation]: “Running a learning algorithm over a training dataset with different hyperparameter settings will result in different models. Since we are typically interested in selecting the best-performing model from this set, we need to find a way to estimate their respective performances in order to rank them against each other.” Raschka teaches on page 7 under [Resubstitution Validation and the Holdout Method]: “In other words, we can’t tell whether the model simply memorized the training data or not, or whether it generalizes well to new, unseen data. (On a side note, we can estimate this so called optimism bias as the difference between the training accuracy and the test accuracy.)”)
choosing a model from a set of models to be evaluated for training the new dataset; (Raschka teaches on page 3 under [Performance Estimation]: “Running a learning algorithm over a training dataset with different hyperparameter settings will result in different models. Since we are typically interested in selecting the best-performing model from this set, we need to find a way to estimate their respective performances in order to rank them against each other.” Raschka teaches on page 7 under [Resubstitution Validation and the Holdout Method]: “In other words, we can’t tell whether the model simply memorized the training data or not, or whether it generalizes well to new, unseen data. (On a side note, we can estimate this so called optimism bias as the difference between the training accuracy and the test accuracy.)”)
selecting a sample input from a subset of the new dataset; (Raschka teaches on page 10 under [Holdout]: “In the first step, we randomly divide our available data into two subsets: a training and a test set… Here, the test set shall represent new, unseen data to our learning algorithm; it’s important that we only touch the test set once to make sure we don’t introduce any bias when we estimate the generalization accuracy.”)
calculating a model activation score for each of the sample inputs in the chosen model; (Raschka teaches on page 4 under [0-1 loss and prediction accuracy]: “In the following article, we will focus on the prediction accuracy, which is defined as the number of all correct predictions divided by the number of samples. We compute the prediction accuracy as the number of correct predictions divided by the number of samples n. Our objective is to learn a model h that has a good generalization performance.” One correct prediction score per sample for the model h can be referred to as a model activation score for each sample.)
calculating an accumulated model activation score for the chosen model, depending on the model activation score of each of the sample inputs in the chosen model; (Raschka teaches on page 4 under [0-1 loss and prediction accuracy]: “In the following article, we will focus on the prediction accuracy, which is defined as the number of all correct predictions divided by the number of samples. We compute the prediction accuracy as the number of correct predictions divided by the number of samples n. Our objective is to learn a model h that has a good generalization performance.” Raschka teaches on page 3 under [Performance Estimation]: “We want to identify the machine learning algorithm that is best-suited for the problem at hand; thus, we want to compare different algorithms, selecting the best-performing one as well as the best performing model from the algorithm’s hypothesis space.” The prediction accuracy can be treated as the AMAS since it is a scoring value based on an accumulation of the prediction values of all the samples for the model h. This suggests the same process as explained above is repeated for each model in order to compare them.)
calculating an accumulated model activation score for each model from the set of models to be evaluated for training the new dataset; (Raschka teaches on page 4 under [0-1 loss and prediction accuracy]: “In the following article, we will focus on the prediction accuracy, which is defined as the number of all correct predictions divided by the number of samples. We compute the prediction accuracy as the number of correct predictions divided by the number of samples n. Our objective is to learn a model h that has a good generalization performance.” Raschka teaches on page 3 under [Performance Estimation]: “We want to identify the machine learning algorithm that is best-suited for the problem at hand; thus, we want to compare different algorithms, selecting the best-performing one as well as the best performing model from the algorithm’s hypothesis space.” The prediction accuracy can be treated as the AMAS since it is a scoring value based on an accumulation of the prediction values of all the samples for the model h. This suggests the same process as explained above is repeated for each model in order to compare them.)
and selecting the model for training the new dataset with the highest accumulated model activation score. (Raschka teaches on page 3 under [Performance Estimation]: “We want to identify the machine learning algorithm that is best-suited for the problem at hand; thus, we want to compare different algorithms, selecting the best-performing one as well as the best performing model from the algorithm’s hypothesis space.” Raschka teaches on page 10 under [Holdout]: “In the first step, we randomly divide our available data into two subsets: a training and a test set. … Here, the test set shall represent new, unseen data to our learning algorithm; it’s important that we only touch the test set once to make sure we don’t introduce any bias when we estimate the generalization accuracy.” This suggests the model with the highest correct prediction score is chosen as the best model after testing on new unseen data.)
Raschka teaches all of the elements of the current invention as stated above except it does not teach it does not teach choosing a model from a set of models to be evaluated for training the new dataset, wherein the set of models is selected from a group consisting of: AlexNet, Goggle® GoogLeNet, and VGGNet;
Shankar teaches choosing a model from a set of models to be evaluated for training the new dataset, wherein the set of models is selected from a group consisting of: AlexNet, Goggle® GoogLeNet, and VGGNet; (Shankar teaches under [Introduction]: “A user has a new sizeable image dataset, which he wants to train with a CNN. He would typically try out famous CNN architectures like AlexNet, GoogleNet, VGG-11, VGG-16, VGG-19 and then select the one which gives maximum accuracy.”)
At the time of filing, it would have been obvious to a person of ordinary skill in the art to combine the teachings of Raschka who compares multiple models with different settings to pick the best one, with the teachings of Shankar who uses “modifies the architecture” of existing models such as AlexNet, GoogleNet and VGGNet to “enhance the accuracy while reducing the model size” (See page 3 of Shankar under [Low-rank and sparsification]) and then select the model with maximum accuracy.
Claim(s) 22 is rejected under 35 U.S.C. 103 as being unpatentable over Raschka in view Shankar and further in view of Liu et al. (“Towards Better Analysis of Deep Convolutional Neural Networks”), hereinafter “Liu”.
Regarding Claim 22, the rejection of Claim 21 is incorporated.
	Raschka teaches the method according to claim 21, wherein calculating the model activation score for the each of the sample inputs in the chosen model further comprises: (Raschka teaches on page 4 under [0-1 loss and prediction accuracy]: “In the following article, we will focus on the prediction accuracy, which is defined as the number of all correct predictions divided by the number of samples. We compute the prediction accuracy as the number of correct predictions divided by the number of samples n. Our objective is to learn a model h that has a good generalization performance.” One correct prediction score per sample for the model h.)
Raschka and Shankar teach all of the elements of the current invention as stated above 
except they do not teach calculating an activation score for the sample input in an activation in a layer of a set of layers of the chosen model; calculating a set of layer activation scores for the sample input, dependent on a sum of the activation scores for each activation of the layer of the set of layers of the chosen model; ranking the set of layer activation scores from highest to lowest; and summing a subset of the set of layer activation scores.
calculating an activation score for the sample input in an activation in a layer of a set of layers of the chosen model; (Liu teaches under [Abstract]: “We formulate a deep CNN as a directed acyclic graph.” A CNN is the model chosen here. Liu teaches on page 5 under [6.2.1 Learned Features as Rectangle Packing]: “We also compute the activations of each neuron on a large set of image patches (e.g., sampled from the training set) and sort the patches in decreasing order according to their activations.” The images are the input sample. The values of the activations computed for each neuron can be referred to as activation scores and the neuron itself can be referred to as an activation as implied in the following passage on page 5 under [6.2.1 Learned Features as Rectangle Packing]: “To help experts better understand the role of each neuron, we select the top-5 patches with the highest activation scores to represent the learned feature of that neuron.” Liu teaches on page 3 under [Architecture.]: “a CNN is typically composed of multiple alternating convolutional and pooling layers, followed by one or several fully connected layers”. This suggests to assume there is a set of layers in the CNN. Liu teaches on page 1 under [1 Introduction]: “First, a CNN may consist of tens or hundreds of layers (depth), thousands of neurons (width) in each layer, as well as millions of connections between neurons.” This suggests that all the neurons or activations, are located in the layer.)
calculating a set of layer activation scores for the sample input, dependent on a sum of the activation scores for each activation of the layer of the set of layers of the chosen model; (Liu teaches under [Abstract]: “We formulate a deep CNN as a directed acyclic graph.” A CNN is the model chosen here. Liu teaches on page 5 under [6.2.1 Learned Features as Rectangle Packing]: “We also compute the activations of each neuron on a large set of image patches (e.g., sampled from the training set) and sort the patches in decreasing order according to their activations.” The images are the input sample. The values of the activations computed for each neuron can be referred to as activation scores and the neuron itself can be referred to as an activation as implied in the following passage on page 5 under [6.2.1 Learned Features as Rectangle Packing]: “To help experts better understand the role of each neuron, we select the top-5 patches with the highest activation scores to represent the learned feature of that neuron.” Liu teaches on page 3 under [Architecture.]: “a CNN is typically composed of multiple alternating convolutional and pooling layers, followed by one or several fully connected layers”. This suggests to assume there is a set of layers in the CNN. Liu teaches on page 1 under [1 Introduction]: “First, a CNN may consist of tens or hundreds of layers (depth), thousands of neurons (width) in each layer, as well as millions of connections between neurons.” This suggests that all the neurons or activations, are located in the layer. Liu teaches on page 3 under [Convolution]: “The convolution operation is illustrated in Fig. 3(a), where the value of the green pixel in the output is the weighted sum of the pixels in the green region of the input.” This suggests the activation scores are a result of the weighted sum of the input image pixels. Thus a set of activation scores can be collected for all the neurons in a layer.)
ranking the set of layer activation scores from highest to lowest; (Liu teaches on page 5 under [Computing learned features of neurons.]: “We also compute the activations of each neuron on a large set of image patches (e.g., sampled from the training set) and sort the patches in decreasing order according to their activations.” The values of the activations computed for each neuron can be referred to as activation scores and the neuron itself can be referred to as an activation as implied in the following passage on page 5 under [6.2.1 Learned Features as Rectangle Packing]: “To help experts better understand the role of each neuron, we select the top-5 patches with the highest activation scores to represent the learned feature of that neuron.”)
and summing a subset of the set of layer activation scores. (The values of the activations computed for each neuron can be referred to as activation scores and the neuron itself can be referred to as an activation as implied in the following passage on page 5 under [6.2.1 Learned Features as Rectangle Packing]: “To help experts better understand the role of each neuron, we select the top-5 patches with the highest activation scores to represent the learned feature of that neuron.” Liu teaches under [DAG Formulation] on page 4: “Then we cluster the neurons in each layer, which aims to group neurons with similar roles together. We assume that neurons with similar activations have similar roles.” The cluster of neurons with similar activations in a layer can be referred to as a subset of the set of layer activation scores. Liu teaches on page 5 under [Matrix Reordering]: “The basic idea of our algorithm is to maximize the sum of the similarities between adjacent neurons in the matrix. It aims to place neurons with similar activations close to each other, and thus can reveal the cluster pattern in the neuron cluster.” This suggests the summing of the activation scores of the cluster)
At the time of filing, it would have been obvious to a person of ordinary skill in the art to modify the method of Claim 21 to include the teachings of Liu who evaluates a set of CNN models to better refine them (See page 9 under [Conclusion]).
9.	Claim(s) 23-24 are rejected under 35 U.S.C. 103 as being unpatentable over Raschka in view Shankar and further in view of Kaastra et al. (“Designing a neural network for forecasting financial and economic time series”), hereinafter “Kaastra”.
Regarding Claim 23, the rejection of Claim 21 is incorporated.
Raschka and Shankar teach all of the elements of the current invention as stated above 
except they do not teach the method according to claim 21, further comprising: calculating a learning rate for the selected model.
	Kaastra teaches the method according to claim 21, further comprising: calculating a learning rate for the selected model. (Kaastra teaches on page 217 under [1. Introduction]: “First, the architecture of a backpropagation (BP) neural network is briefly discussed. The BP network is used to illustrate the design steps since it is capable of solving a wide variety of problems and it is the most common type of neural network in time series forecasting.” The BP is the selected model.)
At the time of filing it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Raschka and Shankar with the teachings of Kaastra whose objective was “to provide a practical, non-technical introduction to designing a neural network” while focusing on 8 important points such as: variable selection, data collection, data processing, training, testing and validation sets, neural network paradigms, evaluation, neural network training and implementation (see [Summary] on page 234).
Regarding Claim 24, the rejection of Claim 23 is incorporated.
Raschka and Shankar teach all of the elements of the current invention as stated above 
except they do not teach the method according to claim 23, wherein calculating a learning rate for the selected model further comprises: calculating a set of learning rates corresponding to the set of layers of the selected model.
Kaastra teaches the method according to claim 23, wherein calculating a learning rate for the selected model further comprises: (In table 1 on page 216 Kaastra teaches the parameters of designing a BP network which includes the learning rate per layer. Kaastra teaches on page 217 under [1. Introduction]: “First, the architecture of a backpropagation (BP) neural network is briefly discussed. The BP network is used to illustrate the design steps since it is capable of solving a wide variety of problems and it is the most common type of neural network in time series forecasting.” The BP is the selected model.)
calculating a set of learning rates corresponding to the set of layers of the selected model. (In table 1 on page 216 Kaastra teaches the parameters of designing a BP network which includes the learning rate per layer. Kaastra teaches on page 217 under [1. Introduction]: “First, the architecture of a backpropagation (BP) neural network is briefly discussed. The BP network is used to illustrate the design steps since it is capable of solving a wide variety of problems and it is the most common type of neural network in time series forecasting.” The BP is the selected model.)
At the time of filing it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Raschka and Shankar with the teachings of Kaastra whose objective was “to provide a practical, non-technical introduction to designing a neural network” while focusing on 8 important points such as: variable selection, data collection, data processing, training, testing and validation sets, neural network paradigms, evaluation, neural network training and implementation (see [Summary] on page 234).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Patent Applications: (US-20150294246-A1) by Guven Kaya, and (EP-3101599-A2) by Ethington James
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO 
MONTHS of the mailing date of this final action and the advisory action is not mailed until after 
the end of the THREE-MONTH shortened statutory period, then the shortened statutory period 
will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 
CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, 
however, will the statutory period for reply expire later than SIX MONTHS from the mailing 
date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FRANCOIS A NDIAYE whose telephone number is (571)272-9952.  The examiner can normally be reached on M-F 7:30AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571) 270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/FRANCOIS A NDIAYE/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124