DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Acknowledgement is made of Applicant’s claim amendments on 09/27/2021. The claim amendments are entered. Presently, claims 1, 4-5, 7, and 9-10 remain pending. Claims 1, 4-5, 7, and 9-10 have been amended.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1, 9, and 10 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Regarding the 35 U.S.C 101 rejection of claims 1, 9, and 10, Applicant has sufficiently amended the claims to overcome the 101 rejection. Accordingly, the 35 U.S.C 101 rejections are withdrawn.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 4, 7, 9, and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (US-20060112028-A1) in view of Wu (US-20080255844-A1), Laws et al. (“Stopping Criteria for Active Learning of Named Entity Recognition”), and Pennacchiotti et al. (US 20110022550 A1).
Regarding Claim 1,
Xiao teaches a learning apparatus comprising: 
at least one memory storing instructions (para [0003] Von Neumann type computers include a memory and a processor.); 
and at least one processor configured to access the at least one memory and execute the instructions (para [0003] In operation, instructions and data are read from the memory and executed by the processor.) to: 
update a dictionary used by a classifier (para [0123] In step 722 the average of the derivatives of the objective function that are computed in step block 720 are processed with an optimization algorithm in order to calculate new values of the weights. Examiner note: Examiner interprets a dictionary as neural network weights. para [0150] As mentioned above in classification problems it is appropriate to apply the sigmoid function at the output nodes. (Alternatively, other threshold functions are used in lieu of the sigmoid function.) Aside from the special case in which what is desired is a yes or no answer as to whether a particular input belongs to a particular class); 
compare a loss calculated by using the updated dictionary and a loss calculated by using the dictionary before updating with respect to all the samples with labeling before adding a new sample with labeling (Fig. 7; para [0130] The stopping condition preferably requires that the difference between the value of the objective function evaluated with the new weights and the value of the objective function calculated with the old weights is less than a predetermined small number. And [0133] [0133] OBJ.sup.NEW and OBJ.sup.OLD are the values of the objective function e.g., Equation Five for the current and preceding values of the weights. and [0135]); 
output the updated dictionary to the classifier, the updated dictionary improving data classification by the classifier (para [0173] in which, .lamda. is a user chosen parameter that determines the relative priority of the sub-objective of minimizing the differences between actual and expected values, and the sub-objective of minimizing the number of weights of significant value. Lambda is preferably chosen in the range of 0.01 to 0.1, and is more preferably approximately equal to 0.05. Too high a value of lambda can lead to reduction of the complexity of the neural network at the expense of its prediction or classification performance, whereas too low of a value can lead to a network that is excessively complex and in some cases prone to over training.).
Xiao does not explicitly disclose
select a sample that is likely to be discriminated as a class not being a correct-answer class based on values calculated for each class by a discriminant function, from one or more samples not assigned with a label, as a sample being a target of labeling; 
acquire, when a label is assigned to the sample being a target of labeling, the samples with labeling including the sample being a target of labeling;
update a dictionary used by a classifier, by using the acquired samples with labeling; 
calculate, by using the updated dictionary and the one or more samples, a ratio to a number of the samples with labeling as a loss with respect to all the samples with labeling; 
update the dictionary by using the samples with labeling added with the new sample with labeling when the loss calculated by using the dictionary before updating is less than the loss calculated by using the updated dictionary by using the samples with labeling added with the new sample with labeling; 
terminate labeling work of a correct answer class when the loss calculated by using the updated dictionary decreases in inverse proportion to a number of the samples with labeling, and 
However, Wu teaches
calculate, by using the updated dictionary and one or more samples with labeling being samples assigned with labels (para [0026] It should be noted, in the normal scenario, the L is averaged over the entire training corpus, where N is the number of training samples.), a ratio to a number of the samples with labeling as a loss with respect to all the samples with labeling (para [0023] The overall criteria to minimize empirical error rate can then be defined over the entire training corpus as the objective function, 
L = 1 N i = 1 N l ( X i , W c , W r ) . ##EQU00001## 
where the loss function is, 
l ( X i ; W c , W r ) = 1 1 + exp ( - S ( X i , W c ) + S ( X i , W r ) ) ##EQU00002## and ##EQU00002.2## S ( X , W ) = p ( W ) p ( X W ) ##EQU00002.3## 
and where W.sub.c is the correct word transcription based on user correction activity and/or pre-labeled training corpus); 
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine Xiao et al.’s method of training a neural network with Wu’s method of training a neural network.
Doing so would allow for reducing empirical error (Abs. Architecture for minimizing an empirical error rate by discriminative adaptation of a statistical language model in a dictation and/or dialog application.).
Law teaches
update a dictionary used by a classifier, by using the acquired samples with labeling (pg. 466, section 2; For a given measure Mi,X, we select in each iteration the unlabeled example(s) in the pool that have the smallest value for Mi,X (corresponding to the maximum uncertainty). And pg. 466, section 2.1; We start with a seed set of ten consecutive tokens randomly selected from the training pool and label it.); 
pg. 467, col. 1; We then label these tokens and add them to the labeled training set. The classifiers are retrained with the new training set and the AL loop repeats.) when the loss calculated by using the dictionary before updating is less than the loss calculated by using the updated dictionary by using the samples with labeling added with the new sample with labeling (pg. 467; Furthermore we find that after the baseline performance is reached the increase in performance quickly levels off to a point where using more training data does not yield performance improvements anymore. In fact, our experiments show that there is a peak in performance reached at about 12% of the training data and performance decreases again after this point (see Figure 1). The peak is more prominent if the pool is large. On a pool of 30,000 tokens, peak performance is about 2.5% F-Score better than the baseline; on a 6000 token pool, the difference is only about 1.7%. Therefore, once the peak is reached, the AL process should stop, even if the annotation budget is not yet used up. The f1 score reads on “loss”. As shown in figure 1, as the f1 score increases (ie. old f1 score is less than the new f1 score) training is continued until a peak is hit, then declines (“decreases”) after. The f1 score represents the accuracy/performance of the algorithm.);

    PNG
    media_image1.png
    460
    468
    media_image1.png
    Greyscale

terminate labeling work of a correct answer class when the loss calculated by using the updated dictionary decreases in inverse proportion to a number of the samples with labeling (pg. 467; In fact, our experiments show that there is a peak in performance reached at about 12% of the training data and performance decreases again after this point (see Figure 1)), (pg. 467; Therefore, once the peak is reached, the AL process should stop, even if the annotation budget is not yet used up.), and 
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of stopping training of Xiao et al. with the method of determining when to stop training of Laws et al.
Doing so would reduce computational costs (pg. 472; This might lead to an approach to reduce the computational cost of AL).
	Pennacchiotti teaches
select a sample that is likely to be discriminated as a class not being a correct-answer class (para [0053] Given a target class c, T(c) denotes its training data, and respectively P(c) and N(c) the positive and negative subsets of the training. Unlabeled data is denoted as U(c), the set of instances collected by the aggregator that must be decoded by the ranker's learning algorithm. P(c) being the correct-answer class and N(c) not being a correct-answer class.) based on values calculated for each class by a discriminant function (para [0070] In order to rely upon quality (1), the system preferably selects only instances that have been extracted by a trusted KE of C, i.e. the confidence of them being positive is very high. To enforce (2), the system selects instances that have never been extracted by any KE of c. More formally, we define N(c) as follows: 
N ( c ) = c i .di-elect cons. C P ( c i ) \ U ( c ) ( 3 ) ##EQU00001## U (c) denotes the samples not assigned a label.), from one or more samples not assigned with a label, as a sample being a target of labeling (para [0065] Near-class negatives: Near class negatives N(c) are selected from the population U(C) of the set of classes C which are semantically similar to c.);
acquire, when a label is assigned to the sample being a target of labeling, the samples with labeling including the sample being a target of labeling (para [0071] The main advantage of this method is that it acquires negatives that are semantic near-misses of the target class, thus allowing the learning algorithm to focus on these borderline cases.);

the acquired samples with labeling (para [0049] In certain embodiments, the training set is also built based upon sources of negative instances and sources of positive instances (in contrast to extracted instances). As mentioned above, the ranker's model is trained on either a manually annotated random sample of entities taken from Aggregator 116, or automatically trained (auto learning) using the features generated by the feature generators. The weights (dictionary) are updated through training using the positive and negative acquired training samples.);
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the method of training a classifier of Xiao with the method of selecting training samples of Pennacchiotti.
Doing so would allow for a faster decoding time and improved accuracy. (para [0087] The above described embodiments have several advantages. They compete with systems incorporating state-of-the-art machine learning techniques, such as support vector machines, but have much smaller resulting models and faster decoding time. They also therefore improve the accuracy of search results or advertisements provided to a user. Embodiments outperform prior state of the art systems by up to 22% in mean average precision.)
Regarding Claim 4,
Xiao et al., Wu, Laws et al., and Pennacchiotti teach the learning apparatus according to claim 1. 
Laws et al. (Stopping Criteria for Active Learning of Named Entity Recognition) teaches
pg. 471; We achieve this with a moving median approach. At each step, we compute the median of w2 = {an−k, . . . , an} (the last n values) and of w1 = {an−k−1, . . . , an−1} (the previous last n values). Each value ai is the performance at iteration i (for the performance gradient) or the uncertainty of the instance selected in iteration i (for the uncertainty gradient). We then estimate the gradient using the medians of the two windows: g = (median(w2) − median(w1))/1 (4) For the performance estimate, which is less noisy, we can also use the arithmetic mean instead of the median. In this case, we simply replace “median” with “mean” in Equation 4.), determine not to update the dictionary (pg. 471; We stop the AL process when (i) the current certainty or estimated performance is a new maximum and (ii) the newly calculated gradient g is positive and (iii) g falls below a predefined level .).
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of stopping training of Xiao et al. with the method of determining when to stop training of Laws et al.
Doing so would reduce computational costs (pg. 472; This might lead to an approach to reduce the computational cost of AL).
Regarding claim 7,
to the classifier when the dictionary is determined not to update (para [0135] After process 700 has finished or after process 800 (described below) has been completed if the latter is used, the final values of the weights are used to construct a neural network.).
Regarding Claim 9,
Xiao teaches a learning method comprising: First named inventor: Atsushi SatoPage 4 Serial no. 15/536,783 Filed 06/16/2017 
updating a dictionary used by a classifier (para [0123] In step 722 the average of the derivatives of the objective function that are computed in step block 720 are processed with an optimization algorithm in order to calculate new values of the weights. Examiner note: Examiner interprets a dictionary as neural network weights) used by a classifier (para [0150] As mentioned above in classification problems it is appropriate to apply the sigmoid function at the output nodes. (Alternatively, other threshold functions are used in lieu of the sigmoid function.) Aside from the special case in which what is desired is a yes or no answer as to whether a particular input belongs to a particular class); 
comparing a loss calculated by using the updated dictionary and a loss calculated by using the dictionary before updating with respect to all the samples with labeling before adding the new sample with labeling (Fig. 7; para [0130] The stopping condition preferably requires that the difference between the value of the objective function evaluated with the new weights and the value of the objective function calculated with the old weights is less than a predetermined small number. And [0133] [0133] OBJ.sup.NEW and OBJ.sup.OLD are the values of the objective function e.g., Equation Five for the current and preceding values of the weights. and [0135]); 
outputting the updated dictionary to the classifier, the updated dictionary improving data classification by the classifier (para [0173] in which, .lamda. is a user chosen parameter that determines the relative priority of the sub-objective of minimizing the differences between actual and expected values, and the sub-objective of minimizing the number of weights of significant value. Lambda is preferably chosen in the range of 0.01 to 0.1, and is more preferably approximately equal to 0.05. Too high a value of lambda can lead to reduction of the complexity of the neural network at the expense of its prediction or classification performance, whereas too low of a value can lead to a network that is excessively complex and in some cases prone to over training.).
Xiao does not explicitly disclose
selecting a sample that is likely to be discriminated as a class not being a correct-answer class based on values calculated for each class by a discriminant function, from one or more samples not assigned with a label, as a sample being a target of labeling; 
acquiring, when a label is assigned to the sample being a target of labeling, the samples with labeling including the sample being a target of labeling;
updating a dictionary used by a classifier by using the acquired with labeling;

updating the dictionary by using the samples with labeling added with the new sample with labeling when the loss calculated by using the dictionary before updating is less than the loss calculated by using the updated dictionary by using the samples with labeling added with the new sample with labeling; 
terminating labeling work of a correct answer class with the loss calculated by using the updated dictionary decreases in inverse proportion to a number of the samples with labeling; and 
However, Wu teaches
calculating, by using the updated dictionary and the one or more samples (para [0026] It should be noted, in the normal scenario, the L is averaged over the entire training corpus, where N is the number of training samples.), a ratio to a number of the samples with labeling as a loss with respect to all the samples with labeling (para [0023] The overall criteria to minimize empirical error rate can then be defined over the entire training corpus as the objective function, 
L = 1 N i = 1 N l ( X i , W c , W r ) . ##EQU00001## 
where the loss function is, 
l ( X i ; W c , W r ) = 1 1 + exp ( - S ( X i , W c ) + S ( X i , W r ) ) ##EQU00002## and ##EQU00002.2## S ( X , W ) = p ( W ) p ( X W ) ##EQU00002.3## 
and where W.sub.c is the correct word transcription based on user correction activity and/or pre-labeled training corpus); 
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine Xiao et al.’s method of training a neural network with Wu’s method of training a neural network.
Doing so would allow for reducing empirical error (Abs. Architecture for minimizing an empirical error rate by discriminative adaptation of a statistical language model in a dictation and/or dialog application.).
	Law teaches 
updating a dictionary used by a classifier by using the acquired with labeling, (pg. 466, section 2; For a given measure Mi,X, we select in each iteration the unlabeled example(s) in the pool that have the smallest value for Mi,X (corresponding to the maximum uncertainty). And pg. 466, section 2.1; We start with a seed set of ten consecutive tokens randomly selected from the training pool and label it.); 
updating the dictionary by using the samples with labeling added with the new sample with labeling (pg. 467, col. 1; We then label these tokens and add them to the labeled training set. The classifiers are retrained with the new training set and the AL loop repeats.) when the loss calculated by using the dictionary before updating is less than the loss calculated by using the updated dictionary by using the samples with labeling added with the new sample with labeling (pg. 467; Furthermore we find that after the baseline performance is reached the increase in performance quickly levels off to a point where using more training data does not yield performance improvements anymore. In fact, our experiments show that there is a peak in performance reached at about 12% of the training data and performance decreases again after this point (see Figure 1). The peak is more prominent if the pool is large. On a pool of 30,000 tokens, peak performance is about 2.5% F-Score better than the baseline; on a 6000 token pool, the difference is only about 1.7%. Therefore, once the peak is reached, the AL process should stop, even if the annotation budget is not yet used up. The f1 score reads on “loss”. As shown in figure 1, as the f1 score increases (ie. old f1 score is less than the new f1 score) training is continued until a peak is hit, then declines (“decreases”) after. The f1 score represents the accuracy/performance of the algorithm.); 
terminating labeling work of a correct answer class with the loss calculated by using the updated dictionary decreases in inverse proportion to a number of the samples with labeling (pg. 467; In fact, our experiments show that there is a peak in performance reached at about 12% of the training data and performance decreases again after this point (see Figure 1)), (pg. 467; Therefore, once the peak is reached, the AL process should stop, even if the annotation budget is not yet used up.); and 
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of stopping training of Xiao et al. with the method of determining when to stop training of Laws et al.
Doing so would reduce computational costs (pg. 472; This might lead to an approach to reduce the computational cost of AL).
Pennacchiotti teaches
selecting a sample that is likely to be discriminated as a class not being a correct-answer class (para [0053] Given a target class c, T(c) denotes its training data, and respectively P(c) and N(c) the positive and negative subsets of the training. Unlabeled data is denoted as U(c), the set of instances collected by the aggregator that must be decoded by the ranker's learning algorithm. P(c) being the correct-answer class and N(c) not being a correct-answer class.) based on values calculated for each class by a discriminant function (para [0070] In order to rely upon quality (1), the system preferably selects only instances that have been extracted by a trusted KE of C, i.e. the confidence of them being positive is very high. To enforce (2), the system selects instances that have never been extracted by any KE of c. More formally, we define N(c) as follows: 
N ( c ) = c i .di-elect cons. C P ( c i ) \ U ( c ) ( 3 ) ##EQU00001## U (c) denotes the samples not assigned a label.), from one or more samples not assigned with a label, as a sample being a target of labeling (para [0065] Near-class negatives: Near class negatives N(c) are selected from the population U(C) of the set of classes C which are semantically similar to c.); 
acquiring, when a label is assigned to the sample being a target of labeling, the samples with labeling including the sample being a target of labeling (para [0071] The main advantage of this method is that it acquires negatives that are semantic near-misses of the target class, thus allowing the learning algorithm to focus on these borderline cases.);
updating a dictionary used by a classifier by using the acquired with labeling (para [0049] In certain embodiments, the training set is also built based upon sources of negative instances and sources of positive instances (in contrast to extracted instances). As mentioned above, the ranker's model is trained on either a manually annotated random sample of entities taken from Aggregator 116, or automatically trained (auto learning) using the features generated by the feature generators. The weights (dictionary) are updated through training using the positive and negative acquired training samples.);
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the method of training a classifier of Xiao with the method of selecting training samples of Pennacchiotti.
Doing so would allow for a faster decoding time and improved accuracy. (para [0087] The above described embodiments have several advantages. They compete with systems incorporating state-of-the-art machine learning techniques, such as support vector machines, but have much smaller resulting models and faster decoding time. They also therefore improve the accuracy of search results or advertisements provided to a user. Embodiments outperform prior state of the art systems by up to 22% in mean average precision.)
Regarding Claim 10,
Xiao teaches a computer-readable non-transitory recording medium storing a program causing a computer to (para [0190] The processes depicted in FIGS. 7,13 are preferably embodied in the form of one or more programs that can be stored on a computer-readable medium which can be used to load the programs into a computer for execution.) perform: 
para [0123] In step 722 the average of the derivatives of the objective function that are computed in step block 720 are processed with an optimization algorithm in order to calculate new values of the weights. Examiner note: Examiner interprets a dictionary as neural network weights) used by a classifier (para [0150] As mentioned above in classification problems it is appropriate to apply the sigmoid function at the output nodes. (Alternatively, other threshold functions are used in lieu of the sigmoid function.) Aside from the special case in which what is desired is a yes or no answer as to whether a particular input belongs to a particular class); 
processing of comparing a loss calculated by using the updated dictionary and a loss calculated by using the dictionary before updating with respect to all the samples with labeling before adding the new sample with labeling (Fig. 7; para [0130] The stopping condition preferably requires that the difference between the value of the objective function evaluated with the new weights and the value of the objective function calculated with the old weights is less than a predetermined small number. And [0133] [0133] OBJ.sup.NEW and OBJ.sup.OLD are the values of the objective function e.g., Equation Five for the current and preceding values of the weights. and [0135]); 
processing of outputting the updated dictionary to the classifier, the updated dictionary improving data classification by the classifier (para [0173] in which, .lamda. is a user chosen parameter that determines the relative priority of the sub-objective of minimizing the differences between actual and expected values, and the sub-objective of minimizing the number of weights of significant value. Lambda is preferably chosen in the range of 0.01 to 0.1, and is more preferably approximately equal to 0.05. Too high a value of lambda can lead to reduction of the complexity of the neural network at the expense of its prediction or classification performance, whereas too low of a value can lead to a network that is excessively complex and in some cases prone to over training.).
Xiao does not explicitly disclose
processing of selecting a sample that is likely to be discriminated as a class not being a correct-answer class based on values calculated for each class by a discriminant function, from one or more samples not assigned with a label, as a sample being a target of labeling; First named inventor: Atsushi SatoPage 5 Serial no. 15/536,783 Filed 06/16/2017 
 processing of acquiring, when a label is assigned to the sample being a target of labeling, the samples with labeling including the sample being a target of labeling;
processing of updating a dictionary used by a classifier by using the acquired samples with labeling; 
processing of calculating, by using the updated dictionary and one or more samples with labeling, a ratio to a number of the samples with labeling as a loss with respect to all the samples with labeling; First named inventor: Atsushi SatoPage 5 Serial no. 15/536,783 Filed 06/16/2017 
processing of updating the dictionary by using the samples with labeling added with the new sample with labeling when the loss calculated by using the dictionary before updating is less than the loss calculated by using the updated dictionary by using the samples with labeling added with the new sample with labeling;

However, Wu teaches
processing of calculating, by using the updated dictionary and one or more samples with labeling (para [0026] It should be noted, in the normal scenario, the L is averaged over the entire training corpus, where N is the number of training samples.), a ratio to a number of the samples with labeling as a loss with respect to all the samples with labeling (para [0023] The overall criteria to minimize empirical error rate can then be defined over the entire training corpus as the objective function, 
L = 1 N i = 1 N l ( X i , W c , W r ) . ##EQU00001## 
where the loss function is, 
l ( X i ; W c , W r ) = 1 1 + exp ( - S ( X i , W c ) + S ( X i , W r ) ) ##EQU00002## and ##EQU00002.2## S ( X , W ) = p ( W ) p ( X W ) ##EQU00002.3## 
and where W.sub.c is the correct word transcription based on user correction activity and/or pre-labeled training corpus); First named inventor: Atsushi SatoPage 5 Serial no. 15/536,783 Filed 06/16/2017 
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine Xiao et al.’s method of training a neural network with Wu’s method of training a neural network.
Abs. Architecture for minimizing an empirical error rate by discriminative adaptation of a statistical language model in a dictation and/or dialog application.).
Law teaches
processing of updating a dictionary used by a classifier by using the acquired samples with labeling (pg. 466, section 2; For a given measure Mi,X, we select in each iteration the unlabeled example(s) in the pool that have the smallest value for Mi,X (corresponding to the maximum uncertainty). And pg. 466, section 2.1; We start with a seed set of ten consecutive tokens randomly selected from the training pool and label it.); 
processing of updating the dictionary by using the samples with labeling added with the new sample with labeling (pg. 467, col. 1; We then label these tokens and add them to the labeled training set. The classifiers are retrained with the new training set and the AL loop repeats.)  when the loss calculated by using the dictionary before updating is less than the loss calculated by using the updated dictionary by using the samples with labeling added with the new sample with labeling (pg. 467; Furthermore we find that after the baseline performance is reached the increase in performance quickly levels off to a point where using more training data does not yield performance improvements anymore. In fact, our experiments show that there is a peak in performance reached at about 12% of the training data and performance decreases again after this point (see Figure 1). The peak is more prominent if the pool is large. On a pool of 30,000 tokens, peak performance is about 2.5% F-Score better than the baseline; on a 6000 token pool, the difference is only about 1.7%. Therefore, once the peak is reached, the AL process should stop, even if the annotation budget is not yet used up. The f1 score reads on “loss”. As shown in figure 1, as the f1 score increases (ie. old f1 score is less than the new f1 score) training is continued until a peak is hit, then declines (“decreases”) after. The f1 score represents the accuracy/performance of the algorithm.);
processing of terminating labeling work of a correct answer class when the loss calculated by using the updated dictionary decreases in inverse proportion to a number of the samples with labeling (pg. 467; In fact, our experiments show that there is a peak in performance reached at about 12% of the training data and performance decreases again after this point (see Figure 1)), (pg. 467; Therefore, once the peak is reached, the AL process should stop, even if the annotation budget is not yet used up.); and 
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of stopping training of Xiao et al. with the method of determining when to stop training of Laws et al.
Doing so would reduce computational costs (pg. 472; This might lead to an approach to reduce the computational cost of AL).
Pennacchiotti teaches
processing of selecting a sample that is likely to be discriminated as a class not being a correct-answer class (para [0053] Given a target class c, T(c) denotes its training data, and respectively P(c) and N(c) the positive and negative subsets of the training. Unlabeled data is denoted as U(c), the set of instances collected by the aggregator that must be decoded by the ranker's learning algorithm. P(c) being the correct-answer class and N(c) not being a correct-answer class.) based on values calculated for each class by a discriminant function (para [0070] In order to rely upon quality (1), the system preferably selects only instances that have been extracted by a trusted KE of C, i.e. the confidence of them being positive is very high. To enforce (2), the system selects instances that have never been extracted by any KE of c. More formally, we define N(c) as follows: 
N ( c ) = c i .di-elect cons. C P ( c i ) \ U ( c ) ( 3 ) ##EQU00001## U (c) denotes the samples not assigned a label.), from one or more samples not assigned with a label, as a sample being a target of labeling (para [0065] Near-class negatives: Near class negatives N(c) are selected from the population U(C) of the set of classes C which are semantically similar to c.); First named inventor: Atsushi SatoPage 5 Serial no. 15/536,783 Filed 06/16/2017 
 processing of acquiring, when a label is assigned to the sample being a target of labeling, the samples with labeling including the sample being a target of labeling (para [0071] The main advantage of this method is that it acquires negatives that are semantic near-misses of the target class, thus allowing the learning algorithm to focus on these borderline cases.);
processing of updating a dictionary used by a classifier by using the acquired samples with labeling (para [0049] In certain embodiments, the training set is also built based upon sources of negative instances and sources of positive instances (in contrast to extracted instances). As mentioned above, the ranker's model is trained on either a manually annotated random sample of entities taken from Aggregator 116, or automatically trained (auto learning) using the features generated by the feature generators. The weights (dictionary) are updated through training using the positive and negative acquired training samples.); 
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the method of training a classifier of Xiao with the method of selecting training samples of Pennacchiotti.
Doing so would allow for a faster decoding time and improved accuracy. (para [0087] The above described embodiments have several advantages. They compete with systems incorporating state-of-the-art machine learning techniques, such as support vector machines, but have much smaller resulting models and faster decoding time. They also therefore improve the accuracy of search results or advertisements provided to a user. Embodiments outperform prior state of the art systems by up to 22% in mean average precision.)


Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (US-20060112028-A1) in view of Wu (US-20080255844-A1), Laws et al. (“Stopping Criteria for Active Learning of Named Entity Recognition”), Nguyen et al. (“Stopping criteria for ensemble of evolutionary artificial neural networks”), and Pennacchiotti et al. (US 20110022550 A1).
Regarding Claim 5, 
Xiao et al., Wu, Laws et al., and Pennacchiotti. teach the learning apparatus according to claim 1.
	Xiao et al., Wu, and Laws et al. do not explicitly disclose

However, Nguyen et al. teaches
Wherein, the at least one processor is further configured to execute the instructions to:  calculate a correlation function between a ratio of a number of the samples with labeling to a first number of samples being smaller than the number of the samples with labeling by a predetermined number, and a ratio of a loss when a number of the samples with labeling is the first number of samples to a loss with respect to all the samples with labeling (pg. 103; FpðmÞ is called the penalty function of network m and pattern p. This represents the correlation between the networks. FpðmÞ¼ðYˆ p ðmÞ F pÞ X l 6¼ m ðYˆ p ðlÞ F pÞ (11)), and, when the correlation function is greater than a predetermined threshold value, determine not to update the dictionary (pg. 104; The second criterion is to choose the ensemble corresponding to the minimum validation error.).
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the stopping criterion of Laws et al. with the stopping criterion of Nguyen et al.
pg. 101 Thus, it is also desirable to stop the training phase in the right moment before overfitting happens.).
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Yang et al. “Unbiased Active Learning” (US 20100217732 A1) This art discloses a method for labeling unlabeled instances using a loss and stop condition.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to HENRY K NGUYEN whose telephone number is (571)272-0217. The examiner can normally be reached Mon - Fri 7:00am-4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 5712723768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/H.N./Examiner, Art Unit 2121                                                                                                                                                                                                        

/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121