DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-30 are presented for examination.

Response to Amendment
Applicant’s amendment has obviated the rejections of the claims under 35 USC § 112(b).  Therefore, those rejections are withdrawn.

Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1, 7, 9, 17, 21, and 29 are rejected under 35 U.S.C. 103 as being unpatentable over Papernot et al., “Practical Black-Box Attacks against Machine Learning” (“Papernot”) in view of Li et al., “Urban Flood Mapping with an Active Self-Learning Convolutional Neural Network Based on TerraSAR-X Intensity and Interferometric Coherence,” in 152 ISPRS J. Photogrammetry and Remote Sensing 178-91 (2019) (“Li”) and further in view of Fukuda et al. (US 20200034703) (“Fukuda”).
Regarding claim 1, Papernot discloses “[a] method comprising:
training a new neural network to mimic a target neural network without access to the target neural network or its original training dataset (machine learning models, e.g., DNNs, are vulnerable to adversarial examples – Papernot, abstract (note that DNNS are run on computers with processors and memories/non-transitory computer-readable media); see also p. 506, second paragraph of right-hand column (“We assume the adversary (a) has no information about the structure or parameters of the [target] DNN, and (b) does not have access to any large training dataset.”)) by: 
probing the target neural network and the new neural network with input data to generate corresponding data output by one or more layers of the respective target neural network and new neural network (a substitute model F [new neural network] approximating an oracle O [target neural network] is trained by collecting a very small set S0 of inputs [input data] representative of the input domain and for each subsequent iteration ρ the oracle is queried [probed] for the labels [output] in that iteration’s training data set Sρ, and the substitute model is trained using the substitute training set  Sρ [to generate outputs from the substitute] – Papernot, sec. 4.1, first paragraph and five-step algorithm including three bullet points); 
detecting input data (adversary collects a very small set of inputs S0 representative of the input domain – Papernot, sec. 4.1, part (1) of five-step algorithm)…; 
generating a … probe training dataset comprising the input data (adversary collects a very small set of inputs S0 representative of the input domain and generates an initial substitute training set therewith, then inputs each sample in the initial substitute training set to the oracle to query the oracle for the labels output thereby – Papernot, sec. 4.1, parts (1)-(3) of five-step algorithm)…; 
training the new neural network to minimize differences between corresponding data output by the new neural network and the target neural network using the … probe training dataset (by querying the oracle, the adversary labels each sample in the initial substitute training set, then trains the architecture using the substitute training set in conjunction with classical training techniques; the labeling is repeated several times to increase the substitute DNN’s accuracy and the similarity of its decision boundaries with the oracle [so that the substitute/new network thereby minimizes differences between its outputs and those of the target/oracle network] – Papernot, sec. 4.1, five steps of algorithm and paragraph after the algorithm description)…; and 
iteratively training the new neural network using an updated … probe training dataset dynamically adjusted … as the new neural network changes during iterative training (after labeling the substitute training set with the help of the oracle and training the adversary with the substitute training set, the adversary applies an augmentation technique on the initial substitute training set to produce a larger [updated] substitute training set with more synthetic training points; the adversary then iteratively trains more accurate substitute DNNs by repeating the labeling, training, and augmentation steps at up to ρmax timesteps – Papernot, sec. 4.1, three bullet points).”
Papernot appears not to disclose explicitly the further limitations of the claim.  However, Li discloses “detecting input data that generate [a] maximum difference between corresponding data output by the target neural network and the new neural network (unlabeled sample set is predicted by both student and teacher models, and informative samples are selected by an uncertainty criterion according to the disagreement between the student and teacher models – Li, sec. 2.2.2, first paragraph; the informative samples are the union of the samples for which the predicted results with augmentation differ between the student and teacher samples and the samples for which the predicted results without augmentation differ between the student and teacher samples [so the generated set produces maximally different results between student and teacher in that it includes all and only those samples that produce divergent results] – id. at p. 182, paragraph labeled (1)); 
generating a divergent probe training dataset comprising the input data that generate the maximum difference in the corresponding data output by the target neural network (unlabeled sample set is predicted by both student and teacher models, and informative samples are selected by an uncertainty criterion according to the disagreement between the student and teacher models – Li, sec. 2.2.2, first paragraph; the informative samples are the union of the samples for which the predicted results with augmentation differ between the student and teacher samples and the samples for which the predicted results without augmentation differ between the student and teacher samples [so the generated set produces maximally different results between student and teacher in that it includes all and only those samples that produce divergent results] – id. at p. 182, paragraph labeled (1));
training the new neural network … using the divergent probe training dataset detected to generate the maximum difference in the corresponding output data between the new and target neural networks (informative samples are pseudo-labeled using a multi-scale spatial constraint, and consistency regularization is introduced to mitigate noise in the updated samples; the training data set is updated with the pseudo-labeled informative samples [divergent probe dataset] and fed back into the teacher and student networks to output new predictions – Li, parts (2)-(3) on p. 182 and Fig. 3); and
iteratively training the new neural network using a[] dataset … dynamically adjusted to reflect each iteration’s maximum difference training dataset (see Li Fig. 3 and note that the pseudo-labeled informative samples are added to the training dataset at each iteration and the CNN is trained on the new training dataset at each iteration, so the process is iterative; p. 182, paragraph labeled (1) shows that the informative samples that are pseudo-labeled and used to train the CNN iteratively are selected based on the disagreement between the student and teacher models for input data both with and without augmentation [so that at each iteration, the dataset is dynamically adjusted to reflect the new training dataset comprising the informative samples]) ….”
Papernot and Li both relate to teacher-student neural network models and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Papernot to generate a divergent probe dataset consisting of those data points for which the two neural networks disagree and training the new network on those points, as disclosed by Li, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would increase the information gain per training data point by focusing on those training samples that are likely to have the greatest discriminative value.  See Li, p. 182, paragraph labeled (1) (informative samples are selected based on disagreement between student and teacher models).
Neither Papernot nor Li appears to disclose explicitly the further limitations of the claim.  However, Fukuda discloses that “the trained new neural network has a fewer number of layers and a smaller file size than the target neural network (significant computational resources (e.g., calculation power, memory, etc.) are needed for implementing accurate neural networks such as teacher neural networks; a student neural network may be trained to have similar characteristics as the teacher neural network without requiring the same amount of computational resources [i.e., it may require a smaller memory, or have a smaller file size] – Fukuda, paragraph 62; student neural network may have a smaller number of nodes and/or layers than the plurality of teacher neural networks – id. at paragraph 54).”
Fukuda and the instant application both relate to the use of one neural network to train another and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot and Li to give the new network fewer layers and have it take up less memory space than the target network, as disclosed by Fukuda, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would reduce the number of computational resources the new network would have to use relative to the target network.  See Fukuda, paragraph 62.

Claim 21 is a system claim corresponding to method claim 1 and is rejected for the same reasons as given in the rejection of that claim.  Similarly, claim 29 is a non-transitory computer-readable medium claim corresponding to method claim 1 and is rejected for the same reasons as given in the rejection of that claim.

Regarding claim 7, Papernot, as modified by Li and Fukuda, discloses “updating the divergent probe training dataset after every predetermined number of training iterations (Papernot sec. 4.1 indicates that the substitute training set is augmented to produce a larger substitute training set after every epoch)1.” 

Regarding claim 9, Papernot, as modified by Li and Fukuda, discloses that “the difference between corresponding data output by the target and new neural networks is detected for an individual or combination of output layers or one or more hidden layers of the target and new neural networks (since automatic pseudo-labeling can bring label noise to updated training samples, consistency regularization is introduced to mitigate the associated averse effect; consistency regularization is defined as the distance [difference] between the softmax outputs [at an output layer] of the student [new] and teacher [target] models with given input x – Li, p. 182, paragraph labeled (3) and Fig. 2, box labeled “consistency regularization”).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Papernot to detect the difference between the target and the output models at the output layers of the networks, as disclosed by Li, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would provide a measure of the difference between the new and target models’ output that can be exploited to determine the progress to completion of the models’ training.  See Li, p. 182, paragraph labeled (3).

Regarding claim 17, Papernot, as modified by Li and Fukuda, discloses “training the new neural network over multiple epochs with a different … probe training dataset in each of the multiple epochs (adversary network is trained over maxρ epochs wherein, with each iteration of the epoch, the training dataset Sρ is augmented to a training dataset Sρ + 1 that contains more synthetic training points – Papernot, sec. 4.1, five-step training procedure).”  

Claims 2 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Li and Fukuda and further in view of Yang et al., “A Novel Emotion Recognition Approach Based on Ensemble Learning and Rough Set Theory,” in 9th IEEE Int’l Conf. Cognitive Informatics 46-52 (2010) (“Yang”).
Regarding claim 2, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Yang discloses “generating the divergent probe training dataset using an additional neural network trained to output training data, that when input into the new and target neural networks, result in respective outputs that have maximal or above threshold differences therebetween (ensemble feature selection algorithms have the goal of finding feature subsets that will maximize disagreements among base classifiers – Yang, sec. 2.1, first paragraph; selective ensemble feature selection method is proposed that generates candidate base classifiers, and a pair of base classifiers that have the most diversity among classifiers clustered is chosen; the final prediction is based on majority voting of the selected classifiers [so the classifiers are constructed and chosen based on diversity in the training set so as to maximize disagreement among classifiers in the ensemble at testing time] – id. at sec. 3.3).”
Papernot, Li, Fukuda, and Yang all relate to the training of multiple neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to employ neural networks to generate data that will maximize diversity among classifiers, as disclosed by Yang, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would strengthen the resulting networks by ensuring that they have been trained on information-packed data.  See Yang, sec. 3.2.1.

Claim 24 is a system claim corresponding to method claim 2 and is rejected for the same reasons as given in the rejection of that claim.

Claims 3 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Li and Fukuda and further in view of Lu et al., “Exploiting Multiple Classifier Types with Active Learning,” in  Proc. 11th Ann. Conf. Genetic and Evolutionary Computation 1905-06 (2009) (“Lu”).
Regarding claim 3, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Lu discloses “generating the divergent probe training dataset using an evolutionary model that evolves to generate outputs that increase or maximize the output differences between the new and target neural networks (evolutionary algorithm is used to optimize a set of classifiers, and another evolutionary algorithm is used to evolve desired data points that maximize disagreement among classifiers – Lu, p. 1905, end of first paragraph on right-hand column).”
Papernot, Li, Fukuda, and Lu all relate to classifier training and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to generate a divergent probe dataset using evolutionary algorithms that maximize the output differences among classifiers, as disclosed by Lu, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the training set to be chosen such that it is informative enough for learning, but small enough to keep the labeling cost manageable.  See Lu, sec. 1, first paragraph.

Claim 25 is a system claim corresponding to method claim 3 and is rejected for the same reasons as given in the rejection of that claim.

Claims 4-5 and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Li and Fukuda and further in view of Melville et al., “Constructing Diverse Classifier Ensembles using Artificial Training Examples,” in Proc. IJCAI 505-10 (2003) (“Melville”).
Regarding claim 4, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Melville discloses “generating the divergent probe training dataset by testing random seed probes and extrapolating the divergent probe training dataset based on resulting behavior of the target and new neural networks (artificial training data are generated such that their labels differ maximally from the current ensemble’s predictions; the artificial training data are generated by randomly picking data points [seed probes] from an approximation of the training-data distribution, and the artificially generated examples are labeled based on the current ensemble; given an example, the class membership probabilities predicted by the ensemble are found, and labels are selected [extrapolated] such that the probability of selection is inversely proportional to the current ensemble’s predictions [behavior of the target and new networks] – Melville, sec. 3).”
Papernot, Li, Fukuda, and Melville all relate to the training of multiple classifiers and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to generate a divergent probe dataset by creating random seed probes and extrapolating the dataset based on the resulting behavior of multiple networks, as disclosed by Melville, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would increase the robustness of the resulting classifiers by training them with training examples on which they are not in agreement.  See Melville, sec. 3.

Claim 26 is a system claim corresponding to method claim 4 and is rejected for the same reasons as given in the rejection of that claim.

Regarding claim 5, Papernot, as modified by Li, Fukuda, and Melville, discloses that “a plurality of the random seed probes comprise a plurality of respective data types or distributions that are different from each other in an input- 33 -Attorney Docket No. P-597100-US space; and selecting the data type or distribution for the divergent probe training dataset associated with maximum or above threshold differences between corresponding data output by the target neural network and the new neural network in the output space (artificial training data are generated such that their labels differ maximally from the current ensemble’s predictions; the artificial training data are generated by randomly picking data points [seed probes] from an approximation of the training-data distribution, and the artificially generated examples are labeled based on the current ensemble; for a numeric attribute [data type 1], the mean and standard deviation are computed from the training set and values are generated from the Gaussian distribution [distribution 1] defined by these; for a nominal attribute [data type 2], the probability of occurrence of each distinct value in its domain is computed and values are generated based on this distribution [distribution 2]; given an example, the class membership probabilities predicted by the ensemble are found, and labels are selected such that the probability of selection is inversely proportional to the current ensemble’s predictions, and a new classifier [new network] is trained on the diversity data [so that it will be maximally likely to produce a different result from the original ensemble] – Melville, sec. 3).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to select seed probes from multiple distributions and use data points from a distribution with a maximum difference between outputs from multiple neural networks, as disclosed by Melville, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would increase the robustness of the resulting classifiers by training them with training examples on which they are not in agreement.  See Melville, sec. 3.

Claims 6 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Li and Fukuda and further in view of Seung et al., “Query by Committee,” in Proc. 5th Ann. Workshop on Computational Learning Theory 287-94 (1992) (“Seung”).
Regarding claim 6, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Seung discloses “generating the divergent probe training dataset using statistics or heuristics-based methods (in the paradigm of incremental query learning, a training algorithm produces a set of weights satisfying a training set, and a query algorithm is used to select a next example; the only training algorithm considered is the zero temperature Gibbs algorithm, which enables the use of techniques from statistical mechanics; the query by committee algorithm then selects an input classified as positive by half the committee and negative by the other half; by maximizing disagreement among the committee, the information gain of the query can be made high – Seung, second and third full paragraphs on right-hand column of p. 287).”
Papernot, Li, Fukuda, and Seung all relate to machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to generate a divergent probe dataset using statistics, as disclosed by Seung, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would increase the information gain from each data point analyzed relative to a procedure in which the training data are selected randomly.  See Seung, abstract.

Regarding claim 8, Papernot, as modified by Li, Fukuda, and Melville, discloses “updating the divergent probe training dataset upon detecting the output differences of the new and target networks converge for a previous version of the divergent probe training dataset (new classifier is trained on diversity data chosen so as maximally to differ from the current ensemble’s predictions; if adding this classifier to the current ensemble does not increase the classifier training error [i.e., if the new classifier and the original ensemble/target converge on the “correct” label for each training example in the current dataset], then it is added to the current ensemble; the process of generating new artificial training examples and adding them to the training set [updating] repeats itself until the desired committee size is reached or the maximum number of iterations is exceeded – Melville, sec. 3, first paragraph).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to update the divergent probe dataset when a new neural network and existing neural networks converge, as disclosed by Melville, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would increase the robustness of the resulting classifiers by training them with training examples on which they are not in agreement.  See Melville, sec. 3.

Claims 10-11 and 27 are rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Li and Fukuda and further in view of Bowers et al. (US 20160300156) (“Bowers”).
Regarding claim 10, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Bowers discloses “adding new data to the … probe training dataset to incorporate new knowledge into the new neural network that is not present in the target neural network (model tracker service can make a copy of a latent model in response to the latent model being updated into production; production copy can be a verbatim copy of the production model; application operator may edit configurations of the production copy by, for example, adding or removing one or more training datasets or sources of the training datasets, one or more features of interest, one or more parameters of a model training algorithm, etc. – Bowers, paragraph 12; see also Fig. 4 (showing that a production model can be derived from a test model by adding or removing features and datasets)).” 
Papernot, Li, Fukuda, and Bowers all relate to the training of multiple neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot , Fukuda and Li to add data to the new network that were not present in the target network, as disclosed by Bowers, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would expand the capacity of the resulting network and increase its usefulness for end users.  See Bowers, paragraph 12.

Claim 27 is a system claim corresponding to method claim 10 and is rejected for the same reasons as given in the rejection of that claim.  (Note that because claim 27 is stated in the disjunctive, a system that adds new data to the divergent probe training dataset to incorporate new knowledge in the new neural network not present in the target neural network suffices to teach the claim.)

Regarding claim 11, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Bowers discloses “defining data to be omitted from the … probe training dataset to eliminate a category or class from the new neural network that is present in the target neural network (model tracker service can make a copy of a latent model in response to the latent model being updated into production; production copy can be a verbatim copy of the production model; application operator may edit configurations of the production copy by, for example, adding or removing one or more training datasets or sources of the training datasets, one or more features of interest, one or more parameters of a model training algorithm, etc. – Bowers, paragraph 12; see also Fig. 4 (showing that a production model can be derived from a test model by adding or removing features and datasets)).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot and Li to omit data from the new neural network that were present in the target network, as disclosed by Bowers, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would increase the compactness of the model and allow it to run with less processing power and take up less memory.  See Bowers, paragraph 12.

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Fukuda and Li and further in view of Zhou et al., “Private Deep Learning with Teacher Ensembles,” in arXiv preprint arXiv:1906.02303 (2019) (“Zhou”).
Regarding claim 12, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Zhou discloses “removing a correlation from the new neural network linking an input to an output, without accessing at least one of the input or output, by adding to the divergent probe training dataset a plurality of random correlations to the output or input, respectively, to weaken or eliminate the correlation between the input and output (to satisfy a privacy-preserving need, small random perturbations [random correlations] are added to the teacher model to build a private student model; result is that no adversary can recover the original sensitive information from the teacher model even though he has full access to the student model [i.e., the correlations linking input to output that reveal those data are removed in the student model]; knowledge distilled from teacher model is perturbed to satisfy the standard of differential privacy – Zhou, p. 2, five paragraphs before Sec. 2; see also Fig. 1 (showing that perturbations are added to the softened cross entropy loss of the teacher models calculated after the prediction output)).”
Papernot, Li, Fukuda, and Zhou all relate to the training of multiple neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li with to perturb the input/output correlations of the new neural network to weaken them, as disclosed by Zhou, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would ensure that private and sensitive data are not disclosed during the training of the new network.  See Zhou, sec. 1, first paragraph.

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Li and Fukuda and further in view of Heaton et al. (US 20180000385) (“Heaton”).
Regarding claim 13, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Heaton discloses “after training the divergent probe training dataset, re-training the new neural network using the … probe training dataset to mimic re-training the target neural network (remote computer system may label sensor data from a wearable device and append a training set with these labeled sensor data, then retrain a compressed [new] and complete [target] model on this training set; the remote computer system can update the complete model based on labeled training data received from deployed wearable devices based on feedback provided by a user through the wearable device and similarly update the compressed network [so the training of the compressed network mimics that of the complete network] – Heaton, paragraph 71; model may be a neural network – id. at paragraph 15).”
Papernot, Li, Fukuda, and Heaton all relate to the training of multiple neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to retrain the new network to mimic the training of the target network, as disclosed by Heaton, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would give the new neural network predictive ability that is similar to the target network.  See Heaton, paragraph 71.

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Fukuda, Li, and Heaton and further in view of Darvish Rouhani et al. (US 20190197406) (“Darvish Rouhani”).
Regarding claim 14, Papernot, as modified by Li, Fukuda, Heaton, and Darvish Rouhani, discloses “sparsifying the new neural network to mimic the target neural network to generate a sparse new neural network (original [target] DNN may have neurons pruned as a function of neural entropies to create a sparse [new] DNN [that mimics the original] – Darvish Rouhani, paragraph 65).”
Papernot, Li, Heaton, Fukuda, and Darvish Rouhani all relate to neural network training and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Li, and Heaton to sparsify the new neural network, as disclosed by Darvish Rouhani, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would reduce the computational burden of the neural network.  See Darvish Rouhani, paragraph 1.

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Li, Fukuda, and Heaton and further in view of Susskind et al. (US 20180157992) (“Susskind”).
Regarding claim 15, neither Papernot, Li, Fukuda, nor Heaton appears to disclose explicitly the further limitations of the claim.  However, Susskind discloses “evolving the new neural network by applying evolutionary algorithms to mimic the target neural network (model training unit configures a model such as a student model to emulate (mimic) the behavior of another model, such as the teacher model, based on a comparison of matrices using differentiable functions optimized using search methods such as evolutionary search algorithms – Susskind, paragraph 185).”
Papernot, Li, Heaton, Fukuda, and Susskind all relate to the training of multiple neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Li, Fukuda, and Heaton to apply evolutionary algorithms to mimic the target neural network, as disclosed by Susskind, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the new network to behave similarly to the target network while ensuring that it can run in environments with limited computational resources.  See Susskind, paragraphs 3, 185. 

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Li and Fukuda and further in view of Ferguson et al. (US 20030130899) (“Ferguson”).
Regarding claim 16, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Ferguson discloses “generating or re-training the new neural network after all copies of the original training dataset are deleted at the training device (second training set may be generated by removing at least a subset of parameter values of the first training set and adding new parameter values from the training data; the process may be repeated, successively updating the training set to generate new training sets by removing old data and adding new data and training each non-linear model with each training set – Ferguson, paragraph 38).”
Papernot, Li, Fukuda, and Ferguson all relate to neural network training and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to re-train the new network after the original training dataset is deleted, as disclosed by Ferguson, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the network to remain up-to-date even in the absence of the original training data with which it or another network from which it derives was trained.  See Ferguson, paragraph 38.

Claims 18, 20, 28, and 30 are rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Fukuda and Li and further in view of Li et al. (US 20160078339) (“Li 2”).
Regarding claim 18, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Li 2 discloses “setting the structure of the new neural network to have a number of neurons, synapses, or layers, to be less than that of the target neural network (student DNN has fewer nodes [neurons] in each of its layers than the teacher DNN – Li 2, paragraph 42).”  
Papernot, Li, Fukuda, and Li 2 all relate to the training of multiple neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to develop a less complicated architecture for the new network than for the target network, as disclosed by Li 2, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the network to be run on devices with fewer computational resources, such as smart phones, wearable devices, or entertainment systems.  See Li 2, paragraph 2.

Regarding claim 20, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Li 2 discloses “after training the new neural network, executing the new neural network in a run-time phase by inputting new data into the new neural network and generating corresponding data output by the new neural network (using an embodiment of a student DNN model, a client device and student DNN model process inputted data to determine computer-usable information; for example, camera-derived information may be processed to determine shapes, features, objects, or other elements in the image or video – Li 2, paragraph 24).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to use the trained neural network to generate output, as disclosed by Li 2, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would put the trained new network to practical use in performing predictive tasks.  See Li 2, paragraph 24.

Claim 28 is a system claim corresponding to method claim 20 and is rejected for the same reasons as given in the rejection of that claim.  Similarly, claim 30 is a non-transitory computer-readable medium claim corresponding to method claim 20 and is rejected for the same reasons as given in the rejection of that claim.

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Li and Fukuda and further in view of Courville et al. (US 20170308324) (“Courville”).
Regarding claim 19, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Courville discloses “training the new neural network layer-by-layer in a plurality of sequential stages, each stage training a respective sequential layer of the new neural network (in a feedforward phase of neural network training, each layer of a neural network architecture performs a computation on a mini-batch producing a new intermediate data point set which is then fed into the next layer; for instance, the feedforward stage of a convolution layer receives an input data set and generates a first intermediate data set which is then fed into the feedforward stage of the ReLU layer; however, since the backpropagation stage of a layer does not happen immediately after its feedforward stage, the intermediate data set must be stored or otherwise retained so that it is available for the corresponding backpropagation stage – Courville, paragraph 29; see also Figs. 1-2).”
Papernot, Li, Fukuda, and Courville all relate to neural network training and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to train the network layer-by-layer, as disclosed by Courville, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would ensure that errors created by faulty training of one layer of the network do not propagate through the remainder of the network.  See Courville, paragraph 29.

Claims 22-23 are rejected under 35 U.S.C. 103 as being unpatentable over Papernot in view of Li and Fukuda and further in view of Bendre et al. (US 20180322417) (“Bendre”).
Regarding claim 22, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Bendre discloses “one or more memories configured to store the … probe training dataset (temporary data storage may be configured to store training data temporarily; once the training device obtains training data, a trainer device may store those data in the temporary data storage; once the trainer device determines that the ML trainer process completed the serving of the corresponding ML training request, the trainer service may delete the training data from the temporary data storage – Bendre, paragraph 132).”
Papernot, Li, Fukuda, and Bendre all relate to classifier training and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to store the training dataset in memory, as disclosed by Bendre, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would give the system a place to hold the data while the network is waiting to be trained.  See Bendre, paragraph 132.

Regarding claim 23, neither Papernot, Fukuda, nor Li appears to disclose explicitly the further limitations of the claim.  However, Bendre discloses that “the one or more memories are temporary memories configured to store samples of the … probe training dataset on-the-fly and delete the samples on-the-fly after the samples are used to train the new neural network (temporary data storage may be configured to store training data temporarily; once the training device obtains training data, a trainer device may store those data in the temporary data storage; once the trainer device determines that the ML trainer process completed the serving of the corresponding ML training request, the trainer service may delete the training data from the temporary data storage – Bendre, paragraph 132).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot, Fukuda, and Li to store the training dataset only until it is used to train the network and then delete it, as disclosed by Bendre, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would secure the training data against unauthorized access.  See Bendre, paragraph 132.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1, 10-23, and 27-29 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-8, 10, 12, 15-20, and 29-30 of copending Application No. 16/910,744 (“reference application”) in view of Papernot and further in view of Li and Fukuda.  A comparison chart of the claims follows, followed by an analysis.
Instant Application
Reference Application
1. A method comprising:
training a new neural network to mimic a target neural network without access to the target neural network or its original training dataset by: 
probing the target neural network and the new neural network with input data to generate corresponding data output by one or more layers of the respective target neural network and new neural network; 
detecting input data that generate maximum difference between corresponding data output by the target neural network and the new neural network; 
generating a divergent probe training dataset comprising the input data that generate the maximum difference and the corresponding data output by the target neural network; 
training the new neural network to minimize differences between corresponding data output by the new neural network and the target neural network using the divergent probe training dataset detected to generate the maximum difference in the corresponding output data between the new and target neural networks; and 
iteratively training the new neural network using an updated divergent probe training dataset dynamically adjusted to reflect each iteration’s maximum difference training dataset as the new neural network changes during iterative training, wherein the trained new neural network has a fewer number of layers and a smaller file size than the target neural network.
12. The method of claim 1, comprising removing a correlation from the new neural network linking an input to an output, without accessing at least one of the input or output, by adding to the divergent probe training dataset a plurality of random correlations to the output or input, respectively, to weaken or eliminate the correlation between the input and output.
1. A method to mimic a pre-trained target model at a device without access to the pre-trained target model or its original training dataset, the method comprising, at the device: 
sending a set of random or semi-random input data to a remote device to randomly probe the pre-trained target model remotely by inputting the set of random or semi-random input data into the pre-trained target model; 
receiving from the remote device a set of corresponding output data generated by applying the pre-trained target model to the set of random or semi-random input data; 
generating a random probe training dataset comprising the set of random or semi-random input data and corresponding output data generated by randomly probing the pre-trained target model; 
training a new model with the random probe training dataset so that the new model generates substantially the same corresponding output data in response to said input data to mimic the pre-trained target model; and 
removing a correlation in the new model based on training data linking an input to an output, without accessing at least one of the input or output, by adding to the random probe training dataset a plurality of random correlations to the output or input, respectively, to weaken or eliminate the correlation between the input and output.  

10. The method of claim 1, comprising adding new data to the divergent probe training dataset to incorporate new knowledge into the new neural network that is not present in the target neural network.
2. The method of claim 1 comprising adding new data to the random probe training dataset to incorporate new knowledge not present in the pre-trained target model.
11. The method of claim 1, comprising defining data to be omitted from the divergent probe training dataset to eliminate a category or class from the new neural network that is present in the target neural network.
3. The method of claim 1 comprising defining data to be omitted from the random probe training dataset to eliminate a category or class present in the pre-trained target model.
13. The method of claim 1, comprising, after training the divergent probe training dataset, re- training the new neural network using the divergent probe training dataset to mimic re- training the target neural network.
4. The method of claim 1 comprising re-training the new model using the random probe training dataset to mimic re-training the target pre-trained model.
14. The method of claim 13, comprising sparsifying the new neural network to mimic the target neural network to generate a sparse new neural network.
5. The method of claim 4 comprising sparsifying the new model to mimic the pre-trained target model to generate a sparse new model.
15. The method of claim 13, comprising evolving the new neural network by applying evolutionary algorithms to mimic the target neural network.
6. The method of claim 4 comprising evolving the new model by applying evolutionary algorithms to mimic the pre-trained target model.
16. The method of claim 1, comprising generating or re-training the new neural network after all copies of the original training dataset are deleted at the training device.
7. The method of claim 1 comprising generating or re-training the new model after all copies of the original training dataset are deleted at the remote device.
17. The method of claim 1, comprising training the new neural network over multiple epochs with a different divergent probe training dataset in each of the multiple epochs.
8. The method of claim 1 comprising training the new model over multiple epochs with a different random probe training dataset in each of the multiple epochs.
18. The method of claim 1, comprising setting the structure of the new neural network to have a number of neurons, synapses, or layers, to be less than that of the target neural network.
10. The method of claim 1, wherein the models are neural networks, and comprising setting the new model to have a number of neurons, synapses, or layers, to be less than that of the pre-trained target model.
19. The method of claim 1, comprising training the new neural network layer-by-layer in a plurality of sequential stages, each stage training a respective sequential layer of the new neural network.
12. The method of claim 1, wherein the models are neural networks each comprising a plurality of layers, and comprising training the new model layer-by-layer in a plurality of sequential stages, each stage training a respective sequential layer of the new model neural network.
20. The method of claim 1, comprising, after training the new neural network, executing the new neural network in a run-time phase by inputting new data into the new neural network and generating corresponding data output by the new neural network.
15. The method of claim 1 comprising, after training the new model, executing the new model in a run-time phase by inputting new data into the new model and generating corresponding data output by the new model.
21. A system comprising:
one or more processors configured to train a new neural network to mimic a target neural network without access to the target neural network or its original training dataset by: 
probing the target neural network and the new neural network with input data to generate corresponding data output by one or more layers of the respective target neural network and new neural network;
detecting input data that generate maximum difference between corresponding data output by the target neural network and the new neural network; 
generating a divergent probe training dataset comprising the input data that generate the maximum difference and the corresponding data output by the target neural network; 
training the new neural network to minimize differences between corresponding data output by the new neural network and the target neural network using the divergent probe training dataset detected to generate the maximum difference in the corresponding output data between the new and target neural networks; and - 35 -Attorney Docket No. P-597100-US 
iteratively training the new neural network using an updated divergent probe training dataset dynamically adjusted to reflect each iteration’s maximum difference training dataset as the new neural network changes during iterative training, wherein the trained new neural network has a fewer number of layers and a smaller file size than the target neural network.
16. A system for performing machine learning to generate a new model to mimic a pre-trained target model without obtaining the pre-trained target model or its original training dataset, the system comprising: 
one or more processors configured to: - 27 -Attorney Docket No. P-578711-US1 
send a set of random or semi-random input data to a remote device to randomly probe the pre-trained target model remotely by inputting the set of random or semi-random input data into the pre-trained target model, 
receive from the remote device a set of corresponding output data generated by applying the pre-trained target model to the set of random or semi- random input data, 
generate a random probe training dataset comprising the set of random or semi-random input data and corresponding output data generated by randomly probing the pre-trained target model, 
train a new model with the random probe training dataset so that the new model generates substantially the same corresponding output data in response to said input data to mimic the pre-trained target model, and 
remove a correlation in the new model based on training data linking an input to an output, without accessing at least one of the input or output, by adding to the random probe training dataset a plurality of random correlations to the output or input, respectively, to weaken or eliminate the correlation between the input and output.
22. The system of claim 21, comprising one or more memories configured to store the divergent probe training dataset.
17. The system of claim 16 comprising one or more memories to store one or more samples of the random probe training dataset.
23. The system of claim 22, wherein the one or more memories are temporary memories configured to store samples of the divergent probe training dataset on-the-fly and delete the samples on-the-fly after the samples are used to train the new neural network.
18. The system of claim 17, wherein the one or more memories are temporary memories that store samples of the random probe training dataset on-the-fly and delete the samples on- the-fly after the samples are used to train the new model.
27. The system of claim 21, wherein the one or more processors are configured to add new data to, or define data to be omitted from, the divergent probe training dataset to incorporate new knowledge into the new neural network that is not present in, or eliminate pre-existing knowledge from the new neural network that is present in, the target neural network.
19. The system of claim 16, wherein the one or more processors are configured to add new data to the random probe training dataset to incorporate new knowledge not present in the pre-trained target model.
27. The system of claim 21, wherein the one or more processors are configured to add new data to, or define data to be omitted from, the divergent probe training dataset to incorporate new knowledge into the new neural network that is not present in, or eliminate pre-existing knowledge from the new neural network that is present in, the target neural network.
20. The system of claim 16, wherein the one or more processors are configured to define data to be omitted from the random probe training dataset to eliminate a category or class present in the pre-trained target model.
28. The system of claim 21, wherein the one or more processors are configured to, after training the new neural network, execute the new neural network in a run-time phase by inputting new data into the new neural network and generating corresponding data output by the new neural network.
29. The system of claim 16, wherein the one or more processors are configured to, after training the new model, execute the new model in a run-time phase by inputting new data into the new model and generating corresponding data output by the new model.  

29. A non-transitory computer-readable medium comprising instructions which, when implemented in one or more processors in a computing device, cause the one or more - 36 -Attorney Docket No. P-597100-US processors to:
train a new neural network to mimic a target neural network without access to the target neural network or its original training dataset by: 
probing the target neural network and the new neural network with input data to generate corresponding data output by one or more layers of the respective target neural network and new neural network; 
detecting input data that generate maximum difference between corresponding data output by the target neural network and the new neural network; 
generating a divergent probe training dataset comprising the input data that generate the maximum difference and the corresponding data output by the target neural network; 
training the new neural network to minimize differences between corresponding data output by the new neural network and the target neural network using the divergent probe training dataset detected to generate the maximum difference in the corresponding output data between the new and target neural networks; and 
iteratively training the new neural network using an updated divergent probe training dataset dynamically adjusted to reflect each iteration’s maximum difference training data as the new neural network changes during iterative training, wherein the trained new neural network has a fewer number of layers and a smaller file size than the target neural network.
30. A non-transitory computer-readable medium comprising instructions which, when implemented in one or more processors in a computing device, cause the one or more - 29 -Attorney Docket No. P-578711-US1 processors to mimic a pre-trained target model at a device without access to the pre-trained target model or its original training dataset by: 
sending a set of random or semi-random input data to a remote device to randomly probe the pre-trained target model remotely by inputting the set of random or semi-random input data into the pre-trained target model; 
receiving from the remote device a set of corresponding output data generated by applying the pre-trained target model to the set of random or semi- random input data; 
generating a random probe training dataset comprising the set of random or semi-random input data and corresponding output data generated by randomly probing the pre-trained target model; 
training a new model with the random probe training dataset so that the new model generates substantially the same corresponding output data in response to said input data to mimic the pre-trained target model; and 
removing a correlation in the new model based on training data linking an input to an output, without accessing at least one of the input or output, by adding to the random probe training dataset a plurality of random correlations to the output or input, respectively, to weaken or eliminate the correlation between the input and output.


	The independent claims of the instant application differ from those of the reference application in that the instant application recites the following limitations, taught by Li:  “detecting input data that generate [a] maximum difference between corresponding data output by the target neural network and the new neural network (unlabeled sample set is predicted by both student and teacher models, and informative samples are selected by an uncertainty criterion according to the disagreement between the student and teacher models – Li, sec. 2.2.2, first paragraph; the informative samples are the union of the samples for which the predicted results with augmentation differ between the student and teacher samples and the samples for which the predicted results without augmentation differ between the student and teacher samples [so the generated set produces maximally different results between student and teacher in that it includes all and only those samples that produce divergent results] – id. at p. 182, paragraph labeled (1)); 
generating a divergent probe training dataset comprising the input data that generate the maximum difference and the corresponding data output by the target neural network (unlabeled sample set is predicted by both student and teacher models, and informative samples are selected by an uncertainty criterion according to the disagreement between the student and teacher models – Li, sec. 2.2.2, first paragraph; the informative samples are the union of the samples for which the predicted results with augmentation differ between the student and teacher samples and the samples for which the predicted results without augmentation differ between the student and teacher samples [so the generated set produces maximally different results between student and teacher in that it includes all and only those samples that produce divergent results] – id. at p. 182, paragraph labeled (1)); [and]
training the new neural network … using the divergent probe training dataset detected to generate the maximum difference in the corresponding output data between the new and target neural networks (informative samples are pseudo-labeled using a multi-scale spatial constraint, and consistency regularization is introduced to mitigate noise in the updated samples; the training data set is updated with the pseudo-labeled informative samples [divergent probe dataset] and fed back into the teacher and student networks to output new predictions – Li, parts (2)-(3) on p. 182 and Fig. 3); and
iteratively training the new neural network using an updated divergent probe training dataset dynamically adjusted to reflect each iteration’s maximum difference training dataset as the new neural network changes during iterative training (see Li Fig. 3 and note that the pseudo-labeled informative samples are added to the training dataset at each iteration and the CNN is trained on the new training dataset at each iteration, so the process is iterative; p. 182, paragraph labeled (1) shows that the informative samples that are pseudo-labeled and used to train the CNN iteratively are selected based on the disagreement between the student and teacher models for input data both with and without augmentation [so that at each iteration, the dataset is dynamically adjusted to reflect the new training dataset comprising the informative samples])….”
The reference application and Li both relate to teacher-student neural network models and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the reference application to generate a divergent probe dataset consisting of those data points for which the two neural networks disagree and training the new network on those points, as disclosed by Li, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would increase the information gain per training data point by focusing on those training samples that are likely to have the greatest discriminative value.  See Li, p. 182, paragraph labeled (1) (informative samples are selected based on disagreement between student and teacher models).
Further, the independent claims of the instant application differ from those of the reference application in that the instant application recites the following limitations, taught by Papernot: “training the new neural network to minimize differences between corresponding data output by the new neural network and the target neural network (by querying the oracle, the adversary labels each sample in the initial substitute training set, then trains the architecture using the substitute training set in conjunction with classical training techniques; the labeling is repeated several times to increase the substitute DNN’s accuracy and the similarity of its decision boundaries with the oracle [so that the substitute/new network thereby minimizes differences between its outputs and those of the target/oracle network] – Papernot, sec. 4.1, five steps of algorithm and paragraph after the algorithm description) …; [and]
iteratively training the new neural network using an updated … probe training dataset dynamically adjusted … as the new neural network changes during iterative training (after labeling the substitute training set with the help of the oracle and training the adversary with the substitute training set, the adversary applies an augmentation technique on the initial substitute training set to produce a larger [updated] substitute training set with more synthetic training points; the adversary then iteratively trains more accurate substitute DNNs by repeating the labeling, training, and augmentation steps at up to ρmax timesteps – Papernot, sec. 4.1, three bullet points).”
	The reference application, Li, and Papernot all relate to the training of multiple neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of the reference application and Li to continue to train the network using training data that adjust as the network adjusts during training, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the network to become more accurate over time.  See Papernot, paragraph below step (5) in sec. 4.1.
Neither the reference application, Papernot, nor Li appears to disclose explicitly the further limitations of the claim.  However, Fukuda discloses that “the trained new neural network has a fewer number of layers and a smaller file size than the target neural network (significant computational resources (e.g., calculation power, memory, etc.) are needed for implementing accurate neural networks such as teacher neural networks; a student neural network may be trained to have similar characteristics as the teacher neural network without requiring the same amount of computational resources [i.e., it may require a smaller memory, or have a smaller file size] – Fukuda, paragraph 62; student neural network may have a smaller number of nodes and/or layers than the plurality of teacher neural networks – id. at paragraph 54).”
Fukuda and the instant application both relate to the use of one neural network to train another and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Papernot and Li to give the new network fewer layers and have it take up less memory space than the target network, as disclosed by Fukuda, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would reduce the number of computational resources the new network would have to use relative to the target network.  See Fukuda, paragraph 62.
	Instant claim 12 is substantially repeated in independent reference claim 1 except that the “new model” of the reference claim is recited as a “neural network” in the instant claim, such difference being taught by Papernot as above.
	Except insofar as the instant claims recite a “divergent probe dataset” rather than a “random probe dataset” and a “neural network” rather than a “model”, such differences being taught by Li and Papernot, respectively, instant claims 10-11, 13-20, 22-23, and 28 are substantially identical to reference claims 2-8, 10, 12, 15, 17-18, and 29, respectively.
	Instant claim 27 is stated in the alternative: “the one or more processors are configured to add new data to, or define data to be omitted from, the divergent probe training dataset to incorporate new knowledge into the new neural network that is not present in, or eliminate pre-existing knowledge from the new neural network that is present in, the target neural network.”  (Emphasis added.)  However, as can be seen above, reference claim 19 discloses the “add” alternative and reference claim 20 discloses the “omit” alternative.  Because instant claim 27 is written in the disjunctive, each of reference claims 19 and 20 individually reads on instant claim 27.
This is a provisional nonstatutory double patenting rejection.

Response to Arguments
Applicant's arguments filed July 9, 2021 (“Remarks”) have been fully considered but they are, except insofar as rendered moot by the introduction of a new ground of rejection, not persuasive.
Applicant first argues that the Papernot/Li combination allegedly does not teach iteratively training the new network using an updated divergent probe dataset dynamically adjusted to reflect each iteration’s maximum difference training dataset because the predicted results with augmentation and without augmentation allegedly do not guarantee maximally different results and the union of the two sets of samples that disagree between student and teacher models with and without augmentation allegedly does not guarantee maximum differences between the corresponding data output by the two models.  Remarks at 10-11.  However, the new network of the claim appears to do the opposite of what Applicant seems to be arguing it does.  The new network, according to a plain reading of the claim language, must minimize differences between data output by the new network and the target network, not maximize them.  Thus, to the extent that Applicant is arguing that the use of the divergent probe dataset of Li does not guarantee the maximization of differences in output between the new network and the target network, such argument is unconvincing because the claim requires the opposite.  What the claim does require is that the divergent probe dataset comprise those data points that generate a maximum difference between the new and target networks.  That is, the new network is trained predominantly on those samples that have a maximum difference in output between the new network and the target network so that the overall number of differences between the outputs of the new and target networks can be minimized.  But Papernot already discloses the minimization of differences between the new model and the target model, and Li discloses generating a dataset that maximizes differences between the two models.  
Applicant’s argument that the informative sample set does not comprise those samples for which the difference between student and teacher models is maximal rests on a misunderstanding of the broadest reasonable interpretation of the term “maximum”.  The claim does not require that each data point of the divergent probe dataset produce a maximal magnitude of difference between the new model’s output and the target model’s output.  The claim only requires that the divergent probe training dataset comprise the input data that “generate [a] maximum difference between corresponding data output by the target neural network and the new neural network”.  That is, the data set as a whole must maximize the difference in outputs, not each individual data point therein.  The informative samples of Li are selected based upon disagreement between the student and teacher models with and without augmentation.  That is, the informative samples are precisely those in which the output of the teacher model and the output of the student model differ.  No samples that produce common outputs are included.  Thus, insofar as the informative sample set contains all and only those samples that produce different outputs between the two models, the dataset thereby maximizes those differences relative to, say, a dataset containing only some of the data points that produce divergent outputs or a dataset that includes some samples that produce common outputs.
The further argument that neither Papernot nor Li discloses explicitly that the new network contains fewer layers and is of a smaller file size than the target network, Remarks at 11, is moot in light of the addition of Fukuda to the rejection.
Finally, Applicant requests that the provisional double patenting rejection be held in abeyance until allowable subject matter is found.  Remarks at 12.  However, note that MPEP § 804(I)(B)(1) requires, as a reply to a provisional nonstatutory double patenting rejection, either a showing that the claims subject to the rejection are patentably distinct from the reference claims or the filing of a terminal disclaimer.  That section further indicates that only objections and requirements as to form may be held in abeyance until allowable subject matter is found.  Examiner has updated the rejection to incorporate the new claim language and would request that Applicant reply to it in one of the two permitted manners noted above.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RYAN C VAUGHN whose telephone number is (571)272-4849.  The examiner can normally be reached on M-R 7:50a-5:50p ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on 571-272-7796.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/R.C.V./             Examiner, Art Unit 2125

/KAMRAN AFSHAR/             Supervisory Patent Examiner, Art Unit 2125                                                                                                                                                                                           


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 Note here that Li discloses that the probe dataset is “divergent” and that an ordinary artisan would be motivated to modify Papernot to make the dataset “divergent” for the same reason enunciated in the rejection of claim 1.  This reasoning applies, mutatis mutandis, to every claim that mentions “divergent” datasets for which the reference cited does not explicitly disclose datasets whose predicted labels diverge for two or more neural networks.  For clarity, in all such claims except for this one the word “divergent” is removed and replaced with an ellipsis.