Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Status of Claims
Claims 1-20 are pending in the present application. Claims 1, 3-5, 11, and 13-15 are newly amended.

Response to Arguments
Applicant's arguments filed 3/24/2022 have been fully considered but they are not persuasive. 
	Interview Summary (page 8):
	The examiner notes that they agreed that the previous rejections under §102 appeared to be overcome by the proposed amendments, but would like to clarify that they noted persisting issues with the combination of Dasgupta and Shen and encouraged Applicant to further clarify the selection of training samples in view of the importance scores.

Arguments regarding Claim Objections (page 8):
	In view of amendments, the claim objections are withdrawn.

	Arguments regarding rejections under §112 (page 9):
	In view of amendments, the rejections are withdrawn.



	Arguments regarding rejections under §102 and §103 (pages 9-11):
	 
Per the Applicant’s argument that “Dasgupta only mentions scoring with respect to tracking model performance as performance relates to metrics such as precision, recall, and F1 scores, etc.” and that “Dasgupta does not mention in any way a performance score based on mutual information between a vector of model parameters of the machine learning model and the training examples.” (page 11):
	The examiner thanks the Applicant for their response, but respectfully disagrees with Applicant’s interpretation of the reference and would like to provide clarification. Dasgupta discloses the selection of a batch of training samples from unlabeled training data based on an informativeness (i.e., importance) score assigned to the samples. However, Dasgupta does not explicitly disclose doing so based on mutual information between the vector of model parameters and the training data. Shen discloses selection of the most uncertain samples for training (i.e., as the most important), and the determination of that uncertainty based on the distribution of model parameters in relation to the training data set (i.e., based on mutual information between a vector of model parameters and the training examples) (see at least p. 289 section 2 and p. 290 section 3.5). When Shen is applied to Dasgupta, the resulting system would incorporate both Shen’s importance based selection and Dasgupta’s scoring, and would both select a batch of training samples for training based on importance scores assigned to samples and generate importance scores based on the mutual information between the model parameters and the training samples. Thus, the combination of Dasgupta and Shen discloses each and every element of claim 1. Accordingly, the rejections are upheld.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-4, 6, 9, 11-14, 16, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over US 10719301 B1 to Dasgupta et al (hereinafter, Dasgupta), in view of “K-COVERS FOR ACTIVE LEARNING IN IMAGE CLASSIFICATION” to Shen et al (hereinafter, Shen).

As per claim 1, Dasgupta teaches A system for training a machine learning model using a batch based active learning approach, the system comprising: an information source; and an electronic processor, the electronic processor configured to (Column 60, lines 52-62, “FIG. 30 is a block diagram illustrating an example computer system that can be used to one or more portions of an MDE that allows users to develop models through iterative model experiments, according to some embodiments. Computer system 3000 may include or be configured to access one or more nonvolatile computer-accessible media. In the illustrated embodiment, computer system 3000 includes one or more processors 3010 coupled to a system memory 3020 via an input/output (I/O) interface 3030. Computer system 3000 further includes a network interface 3040 coupled to I/O interface 3030.”)
(i) receive a machine learning model to be trained, an unlabeled training data set, a labeled training data set, and an identifier of the information source (Column 10, lines 36-48, “In some embodiments, the model experiment interface 144 may allow the user to specify a variety of model experiment parameters, and then launch a model experiment. For example, an experiment definition user interface may allow a user to select a model for the experiment, which may be a model that was the result of a previous experiment, stored in the model repository 164. The experiment definition interface may also allow the user to select one or more data sets to use for the experiment. In some embodiments, the experiment definition interface may allow the user to specify one or more validation runs of the model, using a validation data set that is separate from the training or testing data sets.”
Column 27, lines 15-27, “At operation 1120, a training data set of media data for a machine learning media model is annotated according to the user input. In some embodiments, the media data management interface may contain user control elements that allow a user to import, export, and label image sets managed by the MDE. In some embodiments, the annotations may be performed by one or more components described in the data preparation layer 230 of FIG. 2. In some embodiments, the media data management interface may allow a user to create, a training data set and a test data set from the media data to perform model experiments on ML media models. In some embodiments, other data sets may also be created, for example, one or more validation data sets”
Column 34, lines 7-26, “In some embodiments, the user 1320 may examine the training images 1352 and the labels selected by the classifier, and correct 1356 training images that were incorrectly classified by the classifier. In some embodiments, the user may interact with the training interface 1350 using user controls to correct the classifier-provided labels of individual images. The user corrected annotations are then used to update or train 1354 the classifier. In some embodiments, all of the training images selected for the training interface 1350 may be moved from the unlabeled images set 1316 to the labeled images set 1314, as the user annotates the training images. In some embodiments, the moving may be accomplished by updating an indication or designation of a training image in the unlabeled set to indicate that the image is now labeled. Depending on the embodiment, the move from the unlabeled set to the labeled set may be performed either just before or after the user actually performs the annotation. In some embodiments, the training interface 1350 may be used multiple times to train the classifier before moving on to the next step of the process.”);
(ii) select a batch of training examples from the unlabeled training data set, based on importance scores of training examples from the unlabeled training data set, [wherein the importance scores are based on mutual information between a vector of model parameters of the machine learning model and the training examples] (Column 32, lines 3-15, “In some embodiments, the annotation system employs active learning techniques to interactively select the most informative samples to be annotated by a human. In some embodiments, the selection is done from a large corpus of unlabeled samples. Initially the active learner is seeded with data points that are chosen by identifying the centroid of unique clusters in the unlabeled pool of data. With the seed, the learner builds a classifier which is then executed over all the unlabeled examples. In some embodiments, samples that are difficult to classify are selected for labeling. Once human(s) annotate new samples, the classifier may be retrained with the new data which are the most confusing samples to the classifier's current state. ” Column 33, lines 12-24 “In some embodiments, the annotation process involves an active learning procedure where labels for samples are iteratively acquired from the user 1320 which are used to train the classifier 1380. In each iteration, a set of the training samples may be selected, which are then presented to the user for annotation. In some embodiments, the training samples may be selected based on a confidence metric of the classifier's annotations. In some embodiments, the samples may be selected based an informative metric, selecting the most informative samples to train the classifier. As the iterations progress, the classifier becomes better, and can ultimately be used to predict on the rest of the media samples via an extrapolation operation 1370.” Examiner Note: The examiner sees Dasgupta’s scoring of a sample based on an informative metric as equivalent to assigning a sample an importance score.);
(iii) send, to the information source, a request for, for each training example included in the batch, a label for the training example (Column 32, lines 3-15, “In some embodiments, the annotation system employs active learning techniques to interactively select the most informative samples to be annotated by a human. In some embodiments, the selection is done from a large corpus of unlabeled samples. Initially the active learner is seeded with data points that are chosen by identifying the centroid of unique clusters in the unlabeled pool of data. With the seed, the learner builds a classifier which is then executed over all the unlabeled examples. In some embodiments, samples that are difficult to classify are selected for labeling. Once human(s) annotate new samples, the classifier may be retrained with the new data which are the most confusing samples to the classifier's current state.”);
(iv) for each training example included in the batch receive a label, associate the training example with the label, and add the training example to the labeled training data set (Column 32, lines 3-15, “In some embodiments, the annotation system employs active learning techniques to interactively select the most informative samples to be annotated by a human. In some embodiments, the selection is done from a large corpus of unlabeled samples. Initially the active learner is seeded with data points that are chosen by identifying the centroid of unique clusters in the unlabeled pool of data. With the seed, the learner builds a classifier which is then executed over all the unlabeled examples. In some embodiments, samples that are difficult to classify are selected for labeling. Once human(s) annotate new samples, the classifier may be retrained with the new data which are the most confusing samples to the classifier's current state.”); and
(v) train the machine learning model using the labeled training data included in the batch (Column 32, lines 3-15, “In some embodiments, the annotation system employs active learning techniques to interactively select the most informative samples to be annotated by a human. In some embodiments, the selection is done from a large corpus of unlabeled samples. Initially the active learner is seeded with data points that are chosen by identifying the centroid of unique clusters in the unlabeled pool of data. With the seed, the learner builds a classifier which is then executed over all the unlabeled examples. In some embodiments, samples that are difficult to classify are selected for labeling. Once human(s) annotate new samples, the classifier may be retrained with the new data which are the most confusing samples to the classifier's current state.”).

Dasgupta does not explicitly teach (ii) select a batch of training examples from the unlabeled training data set, based on importance scores of training examples from the unlabeled training data set, wherein the importance scores are based on mutual information between a vector of model parameters of the machine learning model and the training examples.

Shen teaches (ii) select a batch of training examples from the unlabeled training data set, based on importance scores of training examples from the unlabeled training data set, wherein the importance scores are based on mutual information between a vector of model parameters of the machine learning model and the training examples (p. 289, 2, “The core idea in [7] is to estimate the posterior distribution of model parameters by the intrinsic randomness of the neural networks, e.g. randomized dropout.” p.290 3.5, “Four different uncertainty functions are used in our study: entropy [21], variation ratio [7] and Monte-Carlo dropout (MCdropout) [7]. It is also worth mentioning that our algorithm is agnostic to the choice of the uncertainty function” p.290 3.5, “Monte-Carlo Dropout [7] is a variational bayesian approximation of the uncertainty function p˜(y = c|x, D), which interprets the random distribution driven by the randomized dropout [22] to be the approximation of the posterior distribution of the parameters p˜(θ|D)” Examiner Note: Shen discloses determining uncertainty (e.g., informativeness) based on the approximation (i.e., mutual information) of the training data distribution to the model parameters (which one of ordinary skill in the art may be formatted as a vector). Dasgupta discloses the selection of a batch of training samples from an unlabeled training data set based on an importance score (see at least Column 33, lines 12-24, cited above). When Shen is applied to Dasgupta, the resulting system would select a batch of training samples for training based on importance scores assigned to samples, and would generate importance scores based on the mutual information between the model parameters and the training samples.).

Dasgupta and Shen are analogous art because they are both directed to Active Learning systems. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Dasgupta’s active learning system with Shen’s centroid determination. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to increase the performance of the learning model, which can be accomplished by selecting the best samples for training (Shen, p. 288, Introduction, “In each round, we use a data acquisition function to determine which subset of unlabeled samples to get labeled. The goal of the data acquisition function is to maximize the performance of the deep learning models trained the chosen labeled samples.”).

As per claim 2, the combination of Dasgupta and Shen thus far teaches The system according to claim 1.
Dasgupta teaches wherein the electronic processor is configured to repeat acts (ii)-(v) until training of the machine learning model is complete (Column 35, lines 11-23,  “However, if the classifier is not yet performing sufficiently well, the user may indicate 1368 that the process should continue for more training. In some embodiments, the annotation system may go back to the training step, and generate the training interface 1350 once again, to allow the user to train the classifier with more images from another training set 1342 selected from the sample set 1310. The process thus repeats with repeated trainings and evaluations of the classifier, until the classifier is performing sufficiently well to label all of the images in the image set 1310. By using the active learning classifier, the annotation time for media data sets are vastly reduced.”).

As per claim 3, the combination of Dasgupta and Shen thus far teaches The system according to claim 2.
Dasgupta teaches wherein the training of the machine learning model is complete when at least one condition is met from the group comprising: the machine learning model achieves a desired success rate, the machine learning model achieves a desired failure rate, the electronic processor has sent at least a predetermined number of requests for labels to the information source, and at least a predetermined amount of processing power is used to query the information source (Column 34, line 63 – Column 35, line 4, “As shown, in some embodiments, the validation interface 1360 may also allow the user to indicate whether the classifier 1380 should be allowed to proceed to the extrapolation process 1370. The extrapolation process may be initiated, for example, because the accuracy level of the classifier in predicting user annotations have reached a certain threshold level. When the accuracy level of the classifier is satisfactory, the user may indicate 1369 that the extrapolation process may proceed.”
Column 35, lines 11-23, “However, if the classifier is not yet performing sufficiently well, the user may indicate 1368 that the process should continue for more training. In some embodiments, the annotation system may go back to the training step, and generate the training interface 1350 once again, to allow the user to train the classifier with more images from another training set 1342 selected from the sample set 1310. The process thus repeats with repeated trainings and evaluations of the classifier, until the classifier is performing sufficiently well to label all of the images in the image set 1310. By using the active learning classifier, the annotation time for media data sets are vastly reduced.”).

As per claim 4, Dasgupta teaches The system according to claim 1, wherein the electronic processor is configured to select a batch of training examples from the unlabeled training data set by (vi) assigning an importance score to a training example for each training example included in the unlabeled training data set (Column 33, lines 12-24 “In some embodiments, the annotation process involves an active learning procedure where labels for samples are iteratively acquired from the user 1320 which are used to train the classifier 1380. In each iteration, a set of the training samples may be selected, which are then presented to the user for annotation. In some embodiments, the training samples may be selected based on a confidence metric of the classifier's annotations. In some embodiments, the samples may be selected based an informative metric, selecting the most informative samples to train the classifier. As the iterations progress, the classifier becomes better, and can ultimately be used to predict on the rest of the media samples via an extrapolation operation 1370.” Examiner Note: The examiner sees Dasgupta’s scoring of a sample based on an informative metric as equivalent to assigning a sample an importance score.); and
(vii) clustering the training examples (Column 36, lines 1-12,  “In some embodiments, the annotation system may cluster the images in the image set using a clustering technique. This clustering may be used to roughly determine different clusters of images with similar features in the image set. Thus, when data sets (e.g. seed image, unlabeled images, test images, etc.) are created from the image set for the annotation system, these data sets will each have a diversified sample of the image set. Moreover, the data sets may be generated so that their proportion of images from a particular feature cluster are approximately the same. This matching of the feature composition across the data sets reduces the risk of bias within any data set.”).
Dasgupta discloses clustering, selection of cluster centroids, and selection of training samples based on an importance score, but does not explicitly disclose (viii) determining a centroid for each cluster based on importance scores of training examples included in the cluster associated with the centroid; (iv) for each centroid, selecting one or more training examples associated with the centroid to include in the batch.

Shen teaches (viii) determining a centroid for each cluster based on importance scores of training examples included in the cluster associated with the centroid (Page 289, Section 3. “Our method tries to find the data samples with the largest uncertainty in each cluster which is determined by our proposed K-Covers clustering approach.”
Page 290, Section 3.3. “Therefore, one sample in each cluster is enough to the represent the whole cluster. Similar to [12] and [14] which also combine the cluster information and the uncertainty information in their data acquisition function, we choose the sample with the largest uncertainty in each cluster to get labeled.”
 Examiner Note: Dasgupta discloses scoring a label based on its informativeness (i.e., importance) as above, as well as the selection of cluster centroids as in Column 32, lines 3-15 and Column 33, lines 46-51. Shen discloses selection of a most important sample to be an exemplar of a training set. When Shen is applied to Dasgupta, the resulting system would determine the centroid of clusters based on the importance scores of the samples in the cluster.);
(ix)  for each centroid, selecting one or more training examples associated with the centroid to include in the batch (Page 289, Section 3. “Our method tries to find the data samples with the largest uncertainty in each cluster which is determined by our proposed K-Covers clustering approach.”
Page 290, Section 3.3. “Therefore, one sample in each cluster is enough to the represent the whole cluster. Similar to [12] and [14] which also combine the cluster information and the uncertainty information in their data acquisition function, we choose the sample with the largest uncertainty in each cluster to get labeled.” Examiner Note: Dasgupta discloses creating a batch of training samples for active learning from an unlabeled data set through clustering, assigning an importance score, and selecting cluster centroids as training samples in the batch of training examples (i.e., the seed images) (see especially Column 32, lines 3-15 and Column 33, lines 46-51 as cited below). Dasgupta does not explicitly utilize the importance score in the determining of the cluster centroids. When Shen is applied to Dasgupta, the resulting system would use the importance score in the determining of the cluster centroids.
Column 32, lines 3-15, “In some embodiments, the annotation system employs active learning techniques to interactively select the most informative samples to be annotated by a human. In some embodiments, the selection is done from a large corpus of unlabeled samples. Initially the active learner is seeded with data points that are chosen by identifying the centroid of unique clusters in the unlabeled pool of data. With the seed, the learner builds a classifier which is then executed over all the unlabeled examples.”
Column 33, lines 46-51 “…In some embodiments, these feature vectors are then used to obtain a set of diversified examples from the image set as the seed images. For example, a clustering technique may be used in some embodiments. In some embodiments, techniques such as k-medoids centroids are used to choose the seed images.”).

Dasgupta and Shen are analogous art because they are both directed to Active Learning systems. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Dasgupta’s active learning system with Shen’s centroid determination. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to increase the performance of the learning model, which can be accomplished by selecting the best samples for training (Shen, p. 288, Introduction, “In each round, we use a data acquisition function to determine which subset of unlabeled samples to get labeled. The goal of the data acquisition function is to maximize the performance of the deep learning models trained the chosen labeled samples.”).

As per claim 6, the combination of Dasgupta and Shen thus far teaches The system according to claim 4.
Dasgupta teaches wherein the importance score of a training example is indicative of reduction in uncertainty of the machine learning model when the machine learning model is trained using the training example (Column 33, lines 12-24, “In some embodiments, the annotation process involves an active learning procedure where labels for samples are iteratively acquired from the user 1320 which are used to train the classifier 1380. In each iteration, a set of the training samples may be selected, which are then presented to the user for annotation. In some embodiments, the training samples may be selected based on a confidence metric of the classifier's annotations. In some embodiments, the samples may be selected based an informative metric, selecting the most informative samples to train the classifier. As the iterations progress, the classifier becomes better, and can ultimately be used to predict on the rest of the media samples via an extrapolation operation 1370.”).

As per claim 9, the combination of Dasgupta and Shen thus far teaches The system according to claim 1.
Dasgupta teaches wherein the electronic processor is further configured to when training of the machine learning model is complete , input an image or an audio recording to the machine learning model for classification into one of a plurality of classes, and based on the classification of the image perform one selected from the group comprising control an action of a vehicle, allow access to an electronic device, and output an alert to a user (Column 18, lines 55-65, “As shown, in some embodiments, the production model 512 may be operating in a production environment, for example, a live web service or web site, and making machine-learned decisions based on production input media samples 505. For example, a production model may be a model that is actually deployed on self-driving vehicles that is being used to make decisions about road images. In some embodiments, the production model 512 may be configured to make the same prediction tasks as the MUD 522, which may be a next version of the production model being developed.”).

Claim 11 is a method claim corresponding to system claim 1. Claim 11 is rejected for the same reasons as claim 1.

Claim 12 is a method claim corresponding to system claim 2. Claim 12 is rejected for the same reasons as claim 2. 

Claim 13 is a method claim corresponding to system claim 3. Claim 13 is rejected for the same reasons as claim 3.

Claim 14 is a method claim corresponding to system claim 4. Claim 14 is rejected for the same reasons as claim 4. 
Claim 16 is a method claim corresponding to system claim 6. Claim 16 is rejected for the same reasons as claim 6.
Claim 19 is a method claim corresponding to system claim 9. Claim 19 is rejected for the same reasons as claim 9.

Claims 5 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Dasgupta in view of Shen, further in view of “An Improved Fuzzy c-Means Clustering Algorithm Based on Shadowed Sets and PSO” to Zhang et Shen (hereinafter, Zhang).
As per claim 5, the combination of Dasgupta and Shen thus far teaches The system according to claim 4.
Dasgupta teaches wherein the electronic processor is configured to repeat act (ix)  until a desired number of training examples are included in the batch (Column 9, lines 29-39, “In some embodiments, once a set of media samples are labelled, the interface 142 may allow users to divide the media samples into data sets for model development processes. For example, the interface 142 may allow users to specify how to create one or more training sets, validation sets, or test sets of media samples for a given model development projection. In some embodiments, the creation of data sets may be performed in a largely automated fashion, based on certain user-specified parameters, such as the size of the data sets, proportions of classes in each set, etc.”
Column 36, lines 1-12, “In some embodiments, the annotation system may cluster the images in the image set using a clustering technique. This clustering may be used to roughly determine different clusters of images with similar features in the image set. Thus, when data sets (e.g. seed image, unlabeled images, test images, etc.) are created from the image set for the annotation system, these data sets will each have a diversified sample of the image set. Moreover, the data sets may be generated so that their proportion of images from a particular feature cluster are approximately the same. This matching of the feature composition across the data sets reduces the risk of bias within any data set.”
Examiner Note: Dasgupta discloses the selection of a data set size in Column 9, lines 29-39 and that the seed images are a data set in Column 36, lines 1-12. Thus, Dasgupta would select seed images until the desired number of seed training examples set in Column 9, lines 29-39 is reached.).

The combination of Dasgupta and Shen does not explicitly teach each time act (ix) is repeated modify a predetermined threshold for selecting one or more training examples to include in a batch.

Zhang teaches each time act (ix) is repeated modify a predetermined threshold for selecting one or more training examples to include in a batch (Page 4, Section 2.4, “For a fuzzy set with discrete membership function, the balance equation is modified as [Eq. 11]... In order to find the best 𝛼𝑗, it should satisfy the following optimal problem: [Eq. 12] where 𝑢𝑖𝑗 ∈ [0, 1] is the membership of 𝑥𝑖 in a cluster with prototype 𝛽𝑗; 𝑢𝑗max and 𝑢𝑗min denote the highest and lowest membership values to the 𝑗th cluster; and 𝛼𝑗 is the threshold of the 𝑗th cluster. The range of feasible values of threshold 𝛼𝑗 is [𝑢𝑗min , (𝑢𝑗min + 𝑢𝑗max )/2] [19].”
Page 4, Section 3, “Thus, an optimal threshold 𝛼𝑗 (𝑗 = 1, 2, . . . 𝐶) for each column should be found to create a harder partition by (12). The amount of data which are assigned membership value equal to 1 is identified as the cardinality of corresponding cluster. According to 𝛼𝑗, the cardinality of the 𝑗th column is [Eq. 14] Here, the threshold is not subjectively user-defined but it is established on the balance of uncertainty and can be adjusted automatically in the clustering process”
Examiner Note: Dasgupta teaches the selection of training samples based on the systems confidence in classifying the selected samples and a confidence threshold (see especially Column 41, lines 35-45 and Column 43, lines 50-57, cited below), but does not explicitly teach adjusting a threshold used for selection of training images. Zhang teaches modifying membership thresholds during iterations of a cluster classification algorithm. When Zhang is applied to Dasgupta, the resulting system would select a sample based on that sample meeting a confidence threshold that is adjusted during the learning process.
Column 41, lines 35-45, “In some embodiments, the displayed media samples (e.g. images) may be selected as the most informative samples for training or testing the classification model. Depending on the embodiment, different sampling strategies may be used. In some embodiments, a sampling may be performed using a confidence metric associated with the classifier model's annotation decisions, so that samples associated with lower confidence metrics are selected as training samples. In some embodiments, an entropy measure may be used to perform the selection, so that a diverse set of samples in terms of feature sets are selected.”
Column 43, lines 50-57, “For example, only a selection of the samples with the highest confidence metrics (or those meeting a confidence threshold) may be selected for export. In some embodiments, any samples that were annotated by the user is selected for the export. By using the confidence threshold, the annotation ensures that the exported images are correctly labeled to a high degree of probability.”).

Dasgupta, Shen, and Zhang are analogous art because they are directed to machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Dasgupta’s active learning system with Shen’s centroid determination, and Zhang’s threshold modification. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to reduce the computational demand of the system, which can be accomplished by modifying the threshold for inclusion (Zhang, p. 4, 2.4, “The main merits of shadowed sets involve the optimization mechanism for choosing separate threshold and the reduction of the burden of plain numeric computations”).

Claim 15 is a method claim corresponding to system claim 5 and is rejected for the same reasons.

Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Dasgupta in view of Shen, further in view of “Pairwise Data Clustering by Deterministic Annealing” to Hofmann et Buhmann  (hereinafter, Hofmann).

As per claim 7, the combination of Dasgupta and Shen thus far teaches The system according to claim 4. 

Dasgupta teaches wherein the electronic processor is configured to determine an uncertainty of the machine learning model (Column 33, line 57 - Column 34, line 6 Next, as shown, the annotation system may display a set of training images 1352 from the labeled images set 1314 from the sample set 1310, as shown. The training images may be displayed via a training interface 1350, as shown. In some embodiments, the training images displayed in the training interface 1350 may be displayed with labels selected by the classifier 1380. In some embodiments, the training images that are displayed represents a set of the most confusing or informative samples for the classifier. For example, the degree of confusion of individual training images may be indicated via a confusion metric, or an uncertainty metric obtained from the classifier. In some embodiments, the confusion or uncertainty metric may be determined based on a class match probability computed by the classifier. In some embodiments, the uncertainty metric may be determined based on a degree of disagreement among of a number of different classifier models.).
Dasgupta does not explicitly teach use a Gibbs distribution with a temperature coefficient that is the inverse of the uncertainty of the machine learning model to determine a probability that a training example is associated with a centroid; and when the probability that the training example is associated with the centroid is greater than a predetermined threshold include the training example in the batch.

Hofmann teaches use a Gibbs distribution with a temperature coefficient that is the inverse of the uncertainty of the machine learning model to determine a probability that a training example is associated with a centroid (Page 4, section 4, “Following the strategy of stochastic optimization as discussed in Section 3.1 for central clustering, we estimate the expectation values for the assignment of data to clusters at a specified uncertainty level parametrized by the computational temperature T. Assignments M of data to clusters are randomly drawn from the set of admissible configurations (7) according to the Gibbs distribution [Eq. 18] where fl are the costs for a pairwise clustering solution (16)”
Page 6, section 4.2, “The true expected assignments are given by [Eq. 27] being defined in (24). The fraction exp(-8, / T) /~,e~p(-6~ / 2') in (27) implements a partition of unity. The system of the N x K equations (27) is computationally intractable since we have to carry out the averaging of the partition of unity over an exponential number of assignment configurations. The smoothness of a transition from one cell of the partition to a neighboring cell is controlled by the inverse temperature 1 /T.”
Examiner Note: Dasgupta teaches uncertainty of an ML model in classifying a sample, selection of the most uncertain samples for use as training examples, and selection of centroids as training samples, but does not explicitly disclose using a Gibbs distribution or temperature coefficient in that selection. Hofmann teaches the use of a Gibbs distribution where the uncertainty of the sample is characterized by a temperature coefficient, and assignment of a sample to a cluster is based on an inverse temperature (and thus also uncertainty) coefficient. When Hofmann is applied to Dasgupta, the resulting system would use a Gibbs distribution with a temperature coefficient that is the inverse of the uncertainty of the machine learning model in estimating the probability that a sample belongs to a given cluster.); and 
when the probability that the training example is associated with the centroid is greater than a predetermined threshold include the training example in the batch (Page 4, section 4, “Following the strategy of stochastic optimization as discussed in Section 3.1 for central clustering, we estimate the expectation values for the assignment of data to clusters at a specified uncertainty level parametrized by the computational temperature T. Assignments M of data to clusters are randomly drawn from the set of admissible configurations (7) according to the Gibbs distribution [Eq. 18] where fl are the costs for a pairwise clustering solution (16)”
Page 6, section 4.2, “The true expected assignments are given by [Eq. 27] being defined in (24). The fraction exp(-8, / T) /~,e~p(-6~ / 2') in (27) implements a partition of unity. The system of the N x K equations (27) is computationally intractable since we have to carry out the averaging of the partition of unity over an exponential number of assignment configurations. The smoothness of a transition from one cell of the partition to a neighboring cell is controlled by the inverse temperature 1 /T.” Examiner Note: Dasgupta teaches selection of a training example for inclusion in a training data set based on that training example being associated with a centroid, a confidence threshold in the probability that the training example belongs to a class, and determining a confidence that an example belongs to a certain class (see especially Column 32, lines 3-15, as cited above), but does not specify determining the probability as above. Hofmann teaches determining the probability as above. When Hofmann is applied to Dasgupta, the resulting system would include a training sample in a training dataset based on that example meeting a probability (i.e., confidence) threshold.).

Dasgupta, Shen, and Hofmann are analogous art because they are directed to machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Dasgupta’s active learning system with Shen’s centroid determination, and Hofmann’s Gibbs based clustering. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to increase the performance of the system, which can be accomplished by clustering via entropy (Hofmann, page 10, section 7, “Benchmark clustering experiments support our claim that deterministic annealing yields substantially better results than conventional clustering concepts based on gradient descent minimization. The outlined strategy for analyzing stochastic algorithms for pairwise clustering should be considered as a general program for deriving robust optimization algorithms which are based on the maximum entropy principle”).

Claim 17 is a method claim corresponding to system claim 7. Claim 17 is rejected for the same reasons as claim 7.

Claims 8 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Dasgupta in view of Shen, further in view of US 20080154848 A1 to Haslam et al  (hereinafter, Haslam).

As per claim 8, the combination of Dasgupta and Shen thus far teaches The system according to claim 4.

Dasgupta does not explicitly teach wherein an importance score of a centroid of a cluster is a median of each of the training examples included in the cluster.

Haslam teaches wherein an importance score of a centroid of a cluster is a median of each of the training examples included in the cluster ([0068] “Lastly, report 1100 includes an overall score 1117. The overall score 1117 may be calculated as an average (e.g. mean, mode, median) of the other similarity scores, or it may be calculated as the best similarity score, or it may be computed using appropriate weights for each similarity score. This is particularly so if some similarity scores have more probative value than others. Embodiments for calculation of UPC similarity, IPC similarity, logical scores, and centroid scores are discussed below.”
[0093] “To measure this phenomenon, the centroid score takes a set of keywords to search, and determines not only if the keywords are in a reference, but also the distance the keywords cluster around a center point. The center point represents the average position of a cluster of keywords, and is called a centroid. The average distance of the keywords from the centroid is calculated. Then this average distance is optionally normalized to some scale, for example from 0 to 1, where 1 indicates high correlation and 0 indicates no correlation. This normalized score is the centroid score. The details of calculating and interpreting the centroid score are described below.”
Examiner Note: Dasgupta discloses K-means clustering, K-medoid clustering, selection of cluster centers for inclusion in a training set, and selection of samples for inclusion in a training set based on an importance score, but does not disclose equivocating the importance score of a centroid to the median of the examples in the cluster. Haslam discloses scoring the centroid of a cluster as the average of the samples in the cluster, and further teaches that the average can be a median. When Haslam is applied to Dasgupta, the resulting system would score the importance of a centroid using the median of the samples in the cluster.).

Dasgupta, Shen, and Haslam are analogous art because they are directed to data processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Dasgupta’s active learning system with Shen’s centroid determination, and Haslam’s centroid scoring. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to increase the efficiency of the system, which can be accomplished through more accurate centroid scoring (Haslam [0002], “Techniques for content search, analysis and comparison are described herein. In one embodiment, these methods may be used to enhance the efficiency and quality of prior art search, analysis of patents, and comparison of patents with reference content. However, the techniques may be used for any other type of content search, comparison and analysis.”).

Claim 18 is a method claim corresponding to system claim 8. Claim 18 is rejected for the same reasons as claim 8.

Claims 10 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Dasgupta in view of Shen, further in view of US 20190354810 A1 to Samel et al (hereinafter, Samel).

As per claim 10, the combination of Dasgupta and Shen thus far teaches The system of claim 1. 

The combination of Dasgupta and Shen does not explicitly teach wherein the electronic processor is further configured to append a denoising layer onto the machine learning model, wherein the denoising layer determines noise associated with a training example and the machine learning model is trained using output from the denoising layer.

Samel teaches wherein the electronic processor is further configured to append a denoising layer onto the machine learning model, wherein the denoising layer determines noise associated with a training example and the machine learning model is trained using output from the denoising layer ([0032] More specifically, denoising engine 214 generates groupings 214 of features 210 and labels 232 in the training data by clustering the training data by internal representations 212. For example, denoising engine 214 may use k-means clustering, spectral clustering, balanced iterative reducing and clustering using hierarchies (BIRCH), and/or another type of clustering technique to generate groupings 214 of the training data by values of internal representations 212. Because internal representations 212 are used by machine learning model 208 to discriminate between different labels 232 based on the corresponding features 210, clustering of the training data by internal representations 212 allows denoising engine 214 to identify groupings 214 of features 210 that produce different labels 232, even when significant noise and/or inconsistency is present in the original labels 232.
[0033] Prior to generating groupings 214, denoising engine 204 optionally reduces a dimensionality of internal representations 212 by which the training data is clustered. For example, denoising engine 204 may use principal components analysis (PCA), linear discriminant analysis (LDA), matrix factorization, autoencoding, and/or another dimensionality reduction technique to reduce the complexity of internal representations 212 prior to clustering the training data by internal representations 212.
[0034] After groupings 214 are generated, denoising engine 204 generates updated labels 216 for training data in each grouping based on the occurrences of label values 218 of original labels 232 in the grouping. For example, denoising engine 204 may select an updated label as the most frequently occurring label value in a given cluster of training data. Denoising engine 204 then replaces label values 218 in the cluster with the updated label.).

Dasgupta, Shen and Samel are analogous art because they are both directed to processing training data. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Dasgupta’s active learning system and Shen’s data analysis with Samel’s data denoising. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to increase the performance of the machine learning, which can be accomplished by denoising the input data. (Samel [0009], “At least one advantage and technological improvement of the disclosed techniques is a reduction in noise, inconsistency, and/or inaccuracy in labels used to train machine learning models, which provide additional improvements in the training and performance of the machine learning models. Consequently, the disclosed techniques provide technological improvements in the training, execution, and performance of machine learning models and/or the execution and performance of applications, tools, and/or computer systems for performing cleaning and/or denoising of data.”).
Claim 20 is a method claim corresponding to system claim 10. Claim 20 is rejected for the same reasons as claim 10.



Conclusion
	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. “Active Learning Literature Survey” to Burr Settles.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.  
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL G SMITH whose telephone number is (571)272-9730. The examiner can normally be reached M-F 9:30-18:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on 5712729767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Respectfully Submitted,



/P.G.S./Examiner, Art Unit 2126                                                                                                                                                                                                        
/NICHOLAS KLICOS/Primary Examiner, Art Unit 2145