DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination
2.	A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 07 January 2022 has been entered [hereinafter Response], where:
Claims 1 and 11 have been amended.
Claim 20 has been previously cancelled.
Claims 1-19 and 21 are pending.
Claims 1-19 and 21 are rejected.
Claim Rejections - 35 U.S.C. § 103
3.	The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
4.	The factual inquiries for determining obviousness under 35 U.S.C. § 103 are summarized as follows:
1. 	Determining the scope and contents of the prior art.
2. 	Ascertaining the differences between the prior art and the claims at issue.
3. 	Resolving the level of ordinary skill in the pertinent art.
4. 	Considering objective evidence present in the application indicating obviousness or nonobviousness.
5.	This application currently names joint inventors. In considering patentability of the claims, the Examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the Examiner to consider the applicability of 35 U.S.C. § 102(b)(2)(C) for any potential 35 U.S.C. § 102(a)(2) prior art against the later invention.
6.	Claims 1-5, 7, 10-15 and 17 are rejected under 35 U.S.C. § 103 as being unpatentable over US Patent 10255300 to Jhingran et al. [hereinafter Jhingran] in view of Andrei Alexandrescu, "Scalable Graph-Based Learning Applied to Human Language Technology," University of Washington (2009) [Thesis] [hereinafter Alexandrescu].
Regarding claim 1, Jhingran teaches [a] method comprising performing, by a server computer (Jhingran 2:58-59 teaches a server):
determining a dataset collected over a time period (Jhingran 2:34-37 teaches a collected set of event data (that is, determining a dataset collected over a time period));
determining a plurality of feature and label pairs associated with the dataset (Jhingran 12:65 to 13:10 & FIG. 11 teaches feature and label pairs associated with the dataset, where the feature [(spending)] for which attribute data is extracted (associated with the dataset) . . . and . . . attributes (that is, labels) have been established that include “$10,” “$20,” “$30,” and “$40.” (that is, determining a plurality of feature and label pairs associated with the dataset));
for each of the plurality of feature and label pairs:
determining whether a feature and label pair is significant (Jhingran 17:14-22 teaches, with respect to extracting profile feature attribute data from event data [(see Jhingran 12:65 to 13:10 & FIG. 11)], a pattern type of . . . events meets the threshold number of occurrences and is therefore significant enough to be used as attribute data for a profiling technique (determining whether the feature and label pair is significant)), . . . ; 
* * *
and initiating an action using the set of predicted labels (Jhingran 18:60-64 teaches a highest ranked users may be sent relevant information (initiating an action using the set of predicted labels)) that were predicted for the entity with the updated training model (Jhingran 4:46-51 each profiling technique is configured to generate (that is, the updated training model) using the input attribute data. For example, a prediction target that a certain profiling technique is configured to generate is whether a for the entity) is likely to purchase a high-end vehicle given the input attribute data of the user's income range (that is, that were predicted for the entity with the updated training model)).
Though Jhingran teaches the feature of thresholds and statistical significance relating to metrics associated with an entity, Jhingran, however, does not explicitly teach -
for each of the plurality of feature and label pairs:
determining whether a feature and label pair is significant, wherein significant is when a probability that the feature and label pair will occur is greater than a probability that a label will occur; 
upon determining that the feature and label pair is significant, adding the feature and label pair to a training model;
after determining that the feature and label pair is significant, updating the training model with the feature and label pair that is determined to be significant;
predicting a set of labels for the entity with the updated training model; and
* * *
But Alexandrescu teaches -
for each of the plurality of feature and label pairs:
determining whether a feature and label pair is significant, wherein significant is when a probability that the feature and label pair will occur is greater than a probability that a label will occur Alexandrescu at p. 83, “4.4.1 Architecture of Contemporary Phrase-Based SMT Systems: Training,” first partial paragraph, teaches [t]he basic approach aims at computing parameters that maximize p(y|x), which, after applying the Bayes rule, becomes:

    PNG
    media_image1.png
    56
    442
    media_image1.png
    Greyscale

Of the two factors, p(y) (that is, a probability that a label will occur) is computed by using a language model on the target language side . . . . The translation model p(x|y) (that is, a probability that the feature and label pair will occur) is the more difficult subsystem to train; a variety of training methods are being used . . . . Additional models may be used in the rescoring process, and the weights of the log-linear model associated with them are trained on the training set; Alexandrescu at p. 65, “4.2.1 Learning With Only Positive Examples,” first full paragraph, teaches the train[ing] set contains examples and counter-examples, i.e. "good" training pairs ((x, y)) + (that is, significant) and "bad" training pairs ((x, y)) -. In such situations, a common approach is to assign each positive training sample a constant high score s+, and each negative training sample a constant low score s -; Alexandrescu at p. 4, “1.1 What is Human Language Technology?,” first full paragraph, teaches [t]raditional statistical learning methods use a supervised approach, meaning that a model's parameters are adjusted (trained) by using labeled data, i.e., data for which both ; 
upon determining that the feature and label pair is significant, adding the feature and label pair to a training model; after determining that the feature and label pair is significant, updating the training model with the feature and label pair that is determined to be significant (Alexandrescu at p. 65, “4.2.1 Learning Only With Positive Examples,” first full paragraph, teaches a common approach is to assign each positive training sample a constant high score s+, and each negative training sample a constant low score s -. Then regression learns a real-valued function with range [s-, s+]. A given test sample will be "pulled" towards the positive or negative vertices as dictated by the graph structure. The actual constants s- and s+ dictate the highest and lowest score received by any test sample—in label propagation, all learned scores will fall in between these limits by the maximum principle of harmonic functions [1]. Aside from the obvious requirement s- < s+, (that is, after determining that the feature and label pair is significant) there are no other restrictions with regard to choosing these values; we are only interested in their ordering; Alexandrescu at p. 65, “4.2.1 Learning Only With Positive Examples,” first full paragraph, teaches [m]any structured learning problems, however, only define a training set containing only positive examples, that is, correct pairs                         
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            y
                                        
                                        
                                            i
                                        
                                    
                                
                            
                             
                            
                                
                                    ∀
                                
                                
                                    i
                                
                            
                             
                            ∈
                             
                            
                                
                                    1
                                    ,
                                     
                                    .
                                     
                                    .
                                     
                                    .
                                     
                                    ,
                                    t
                                
                            
                        
                    . Moreover, all training pairs are equally realizable, desirable, or "good" (that is, updating the training model based on with the feature and label pair that is determined to be significant); Alexandrescu at p. 111, “5.2 Stochastic Label Propagation,” first paragraph, teaches a new algorithm for label propagation. Instead of using matrix algebra to update all elements of iv in one epoch, Algorithm 2 updates exactly one randomly-chosen element at a time (that is, updating the trained model));
predicting a set of labels for an entity with the updated training model (Alexandrescu at p. 4, “1.1 What is Human Language Technology?,” first full paragraph, After the model has been trained, it is able to predict correct labels when presented with formerly-unseen features, as long as there exists correlation between features and labels and the correlation is the same for both training and test data (that is, predicting a set of labels for an entity using the updated training model)); and
* * *
Jhingran and Alexandrescu are from the same or similar field of endeavor. Jhingran teaches automatically extracting attribute data from the collected event data. Alexandrescu teaches using graph-based semi-supervised learning techniques for natural language processing, and addresses issues of distance measure learning, scalability, and structured inputs and outputs. Thus, it would have been obvious to one of ordinary skill in the art as of the effective filing date of Applicant’s invention to modify Jhingran pertaining to categorization prediction relative to an attribute with the feature-label training set partitioning of Alexandrescu.
Alexandrescu, at p. 173, “Conclusions”).
Regarding claim 2, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 1, as described above.
Jhingran teaches wherein determining the plurality of feature and label pairs associated with the dataset comprises:
determining a feature dataset based on a first subset of the dataset (Jhingran, 3:15-20 teaches bins of quantized values associated with one or more features (feature dataset) . . . to represent a characteristic of the input information (based on a first subset of the dataset)); 
determining a label dataset based on a second subset of the dataset (Jhingran 12:65 to 13:10 & FIG. 11 teaches feature and label pairs associated with the dataset, where the feature [(spending)] for which attribute data is extracted (associated with the dataset) . . . and . . . attributes (that is, labels) have been established (determining a label dataset based on a second subset of the dataset) that include “$10,” “$20,” “$30,” and “$40”), wherein the first subset corresponds to data collected at an earlier time within the time period than that of the data corresponding to the second subset (Jhingran 5:34-37 teaches [e]vent data comprising user actions with temporal aspects (e.g., timestamps and/or chronological sequences) associated with various users is collected from a variety of sources (the first subset corresponds to data collected at an earlier time within the time period than that of the data corresponding to the second subset)); and
determining the plurality of feature and label pairs that co-occur based on the feature dataset and the label dataset (Jhingran 15:43-44 teaches [i]nstances of the same sequence (co-occur) of events (the feature dataset and the label dataset) that are performed by various users may be grouped together).
Regarding claim 3, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 2, as described above.
Alexandrescu teaches -
wherein determining that the feature and label pair, including a feature and the label, is significant further comprises:
determining a conditional probability of the label occurring in the dataset given the feature has occurred in the dataset (Alexandrescu at p. 83, “4.4.1 Architecture of Contemporary Phrase-Based SMT Systems: Training,” first partial paragraph, teaches [t]he basic approach aims at computing parameters that maximize p(y|x), which, after applying the Bayes rule, becomes:

    PNG
    media_image1.png
    56
    442
    media_image1.png
    Greyscale

[arg max p(y|x) is determining a conditional probability of the label occurring in the dataset given the feature has occurred in the dataset]; Of the two factors, p(y) (that is, a probability that a label will occur) is computed by using a language model on the target language side . . . . The translation model p(x|y) (that is, a probability that the feature and label pair will occur) is the more difficult subsystem to train; a variety of training methods are being used . . . . Additional models may be used in the rescoring process, and the weights of the log-linear model associated with them are trained on the training Alexandrescu at p. 4, “1.1 What is Human Language Technology?,” first full paragraph, teaches [t]raditional statistical learning methods use a supervised approach, meaning that a model's parameters are adjusted (trained) by using labeled data, i.e., data for which both inputs (also known as features [(that is “x”)]) and correct outputs (often referred to as labels [(that is “y”)]) are known)); and
determining that the conditional probability is greater than the probability of the label occurring in the dataset (Alexandrescu at p. 61, “4.1 Structured Inputs and Outputs,” first full paragraph, teaches [a] possibility is to forego analytic definition for p(y|x) and instead focus on regressing a real-valued scoring function s (that is, the “scoring function” facilitates determining that the conditional probability). Such a scoring function accepts a pair of input (that is, feature) and output data (that is, label) and computes a real-valued score; Alexandrescu at p. 65, “4.2.1 Learning With Only Positive Examples,” first full paragraph, teaches the train[ing] set contains examples and counter-examples, i.e. "good" training pairs ((x, y)) + (that is, significant) and "bad" training pairs ((x, y)) -. In such situations, a common approach is to assign each positive training sample a constant high score s+, and each negative training sample a constant low score s - (that is, via the scores determining that the conditional probability is greater than the probability of the label occurring in the dataset); Alexandrescu at p. 4, “1.1 What is Human Language Technology?,” first full paragraph, teaches [t]raditional statistical learning methods use a supervised approach, meaning that a model's parameters are adjusted (trained) by using labeled data, i.e., data for which both inputs (also known as features [(that is “x”)]) and correct outputs (often referred to as labels [(that is “y”)]) are known)).
Regarding claim 4, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 3, as described above. 
Jhingran teaches wherein determining the set of predicted labels for the entity comprises:
determining a set of features associated with the entity (Jhingran 2:31-33 teaches event data from one or more sources (entity) is collected. . . . [A] feature associated with a profiling technique is received (determining a set of features associated with the entity));
for each feature in the set of features:
determining a feature and label pair in the training model that includes the feature; and determining the label in the feature and label pair (Jhingran 12:65 to 13:10 & FIG. 11 teaches feature and label pairs associated with the dataset, where the feature [(spending)] for which attribute data is extracted (associated with the dataset) . . . and . . . attributes (that is, labels) have been established that include “$10,” “$20,” “$30,” and “$40.”).
Regarding claim 5, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 4, as described above.
Jhingran teaches wherein determining the set of predicted labels for the entity further comprises:
for each feature in the set of features:
determining multiple feature and label pairs in the training model that include the feature (Jhingran 12:65 to 13:10 & FIG. 11 teaches feature and label pairs associated with the dataset, where the feature [(spending)] for which associated with the dataset) . . . and . . . attributes (that is, labels) have been established that include “$10,” “$20,” “$30,” and “$40.”); and
determining, from the multiple feature and label pairs, the feature and label pair that is associated with a largest conditional probability of the label occurring in the dataset given the feature has occurred in the dataset (Jhingran 4:44-47 teaches a prediction (that is, conditional probability) target (that is, a label) that a certain profiling technique is configured to generate is whether a user is likely to purchase a high-end vehicle (feature and label pair that is associated with the largest conditional probability of the label occurring in the dataset) given the input attribute data of the user's income range (determining, from the multiple feature and label pairs, the feature and label pair that is associated with the largest conditional probability of the label occurring in the dataset given the feature has occurred in the dataset)).
Regarding claim 7, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 3, as described above.
Jhingran teaches - 
bucketizing data associated with the entity to determine bucketized values (Jhingran 3:34-38 teaches extracting attribute data from the collected event data includes determining an appropriate number of attribute bins (that is, buckets) to establish for a feature (bucketizing data associated with the entity to determine bucketized values));
generating combined bucketized values including the bucketized values (Jhingran 3:56-59 teaches a user record for a user's feature value is created in the extracted attribute data such that multiple (e.g., neighboring or adjacent) attribute bins are indicated as being associated with the user (generating combined bucketized values including the bucketized values)); and
storing the combined bucketized values (Jhingran 5:22-24 teaches [a]ttribute data storage 208 (storing) is configured to store the attribute data extracted by attribute data extraction engine 206 (storing the combined bucketized values)).
Regarding claim 10, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 1, as described above.
Jhingran teaches wherein the action comprises sending one or more of a recommendation, an alert, or an offer to the entity (Jhingran 18:60-64 teaches a highest ranked users may be sent relevant information (e.g., promotional offers associated with ABC brand vehicles) (wherein the action comprises sending one or more of a recommendation, an alert, or an offer to the entity)).
Regarding claim 11, Jhingran teaches [a] server computer (Jhingran 2:58-59 teaches a server) comprising:
a processor (Jhingran 2:9-11 teaches a processor); and
a computer readable medium coupled to the processor, the computer readable medium comprising code executable to perform a method (Jhingran 1:63-66 teaches a computer readable storage medium; and/or a processor such as a comprising:
determining a dataset collected over a time period (Jhingran 2:34-37 teaches a collected set of event data (determining a dataset collected over a time period));
determining a plurality of feature and label pairs associated with the dataset (Jhingran 12:65 to 13:10 & FIG. 11 teaches feature and label pairs associated with the dataset, where the feature [(spending)] for which attribute data is extracted (associated with the dataset) . . . and . . . attributes (that is, labels) have been established that include “$10,” “$20,” “$30,” and “$40.” (that is, determining a plurality of feature and label pairs associated with the dataset));
for each of the plurality of feature and label pairs:
determining whether a feature and label pair is significant (Jhingran 17:14-22 teaches, with respect to extracting profile feature attribute data from event data [(see Jhingran 12:65 to 13:10 & FIG. 11)], a pattern type of . . . events meets the threshold number of occurrences and is therefore significant enough to be used as attribute data for a profiling technique (determining whether the feature and label pair is significant)), . . . ;
* * *
and initiating an action using the set of predicted labels (Jhingran 18:60-64 teaches a highest ranked users may be sent relevant information (initiating an action based on the set of predicted labels)) that were predicted for the entity using the updated training model.
Jhingran teaches the feature of thresholds and statistical significance relating to metrics associated with an entity, the Jhingran, however, does not explicitly teach -
* * *
for each of the plurality of feature and label pairs:
* * *
determining whether a feature and label pair is significant, wherein significant is when a probability that the feature and label pair will occur is greater than a probability that a label will occur; 
upon determining that the feature and label pair is significant, adding the feature and label pair to a training model;
after determining that the feature and label pair is significant, updating the training model with the feature and label pair that is determined to be significant;
predicting a set of labels for an entity using the training model; and
* * *
But Alexandrescu teaches -
for each of the plurality of feature and label pairs:
determining whether a feature and label pair is significant, wherein significant is when a probability that the feature and label pair will occur is greater than a probability that a label will occur (Alexandrescu at p. 83, “4.4.1 Architecture of Contemporary Phrase-Based SMT Systems: Training,” first 

    PNG
    media_image1.png
    56
    442
    media_image1.png
    Greyscale

Of the two factors, p(y) (that is, a probability that a label will occur) is computed by using a language model on the target language side . . . . The translation model p(x|y) (that is, a probability that the feature and label pair will occur) is the more difficult subsystem to train; a variety of training methods are being used . . . . Additional models may be used in the rescoring process, and the weights of the log-linear model associated with them are trained on the training set; Alexandrescu at p. 65, “4.2.1 Learning With Only Positive Examples,” first full paragraph, teaches the train[ing] set contains examples and counter-examples, i.e. "good" training pairs ((x, y)) + (that is, significant) and "bad" training pairs ((x, y)) -. In such situations, a common approach is to assign each positive training sample a constant high score s+, and each negative training sample a constant low score s -; Alexandrescu at p. 4, “1.1 What is Human Language Technology?,” first full paragraph, teaches [t]raditional statistical learning methods use a supervised approach, meaning that a model's parameters are adjusted (trained) by using labeled data, i.e., data for which both inputs (also known as features [(that is “x”)]) and correct outputs (often referred to as labels [(that is “y”)]) are known); 
upon determining that the feature and label pair is significant, adding the feature and label pair to a training model; after determining that the feature and label pair is significant, updating the training model with the feature and label pair that is determined to be significant (Alexandrescu at p. 65, “4.2.1 Learning Only With Positive Examples,” first full paragraph, teaches a common approach is to assign each positive training sample a constant high score s+, and each negative training sample a constant low score s -. Then regression learns a real-valued function with range [s-, s+]. A given test sample will be "pulled" towards the positive or negative vertices as dictated by the graph structure. The actual constants s- and s+ dictate the highest and lowest score received by any test sample—in label propagation, all learned scores will fall in between these limits by the maximum principle of harmonic functions [1]. Aside from the obvious requirement s- < s+, (that is, after determining that the feature and label pair is significant) there are no other restrictions with regard to choosing these values; we are only interested in their ordering; Alexandrescu at p. 65, “4.2.1 Learning Only With Positive Examples,” first full paragraph, teaches [m]any structured learning problems, however, only define a training set containing only positive examples, that is, correct pairs                         
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            y
                                        
                                        
                                            i
                                        
                                    
                                
                            
                             
                            
                                
                                    ∀
                                
                                
                                    i
                                
                            
                             
                            ∈
                             
                            
                                
                                    1
                                    ,
                                     
                                    .
                                     
                                    .
                                     
                                    .
                                     
                                    ,
                                    t
                                
                            
                        
                    . Moreover, all training pairs are equally realizable, desirable, or "good" (that is, updating the training model based on with the feature and label pair that is determined to be significant); Alexandrescu at p. 111, “5.2 Stochastic Label Propagation,” first paragraph, teaches a new algorithm for label propagation. Instead of using matrix algebra to update all elements of iv in one epoch, Algorithm 2 updates exactly one randomly-chosen element at a time (that is, updating the trained model));
predicting a set of labels for an entity using the updated training model (Alexandrescu at p. 4, “1.1 What is Human Language Technology?,” first full paragraph, After the model has been trained, it is able to predict correct labels when presented with formerly-unseen features, as long as there exists correlation between features and labels and the correlation is the same for both training and test data (that is, predicting a set of labels for an entity using the updated training model));
* * *
Jhingran and Alexandrescu are from the same or similar field of endeavor. Jhingran teaches automatically extracting attribute data from the collected event data. Alexandrescu teaches using graph-based semi-supervised learning techniques for natural language processing, and addresses issues of distance measure learning, scalability, and structured inputs and outputs. Thus, it would have been obvious to one of ordinary skill in the art as of the effective filing date of Applicant’s invention to modify Jhingran pertaining to categorization prediction relative to an attribute with the the feature-label training set partitioning of Alexandrescu.
The motivation for doing so is to implement a graph reduction technique that reduces the labeled sample size to one vertex per distinct label without affecting learning results. (Alexandrescu, at p. 173, “Conclusions”). 
Regarding claim 12, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 11, as described above.
Jhingran teaches wherein the step of determining the plurality of feature and label pairs associated with the dataset in the method comprises:
determining a feature dataset based on a first subset of the dataset (Jhingran, 3:15-20 teaches bins of quantized values associated with one or more features (feature dataset) . . . to represent a characteristic of the input information (based on a first subset of the dataset));
determining a label dataset based on a second subset of the dataset (Jhingran 12:65 to 13:10 & FIG. 11 teaches feature and label pairs associated with the dataset, where the feature [(spending)] for which attribute data is extracted (associated with the dataset) . . . and . . . attributes (that is, labels) have been established (determining a label dataset based on a second subset of the dataset) that include “$10,” “$20,” “$30,” and “$40”), wherein the first subset corresponds to data collected at an earlier time within the time period than that of the data corresponding to the second subset (Jhingran 5:34-37 teaches [e]vent data comprising user actions with temporal aspects (e.g., timestamps and/or chronological sequences) associated with various users is collected from a variety of sources (the first subset corresponds to data collected at an earlier time within the time period than that of the data corresponding to the second subset)); and
determining the plurality of feature and label pairs that co-occur based on the feature dataset and the label dataset (Jhingran 15:43-44 teaches [i]nstances of the same sequence (co-occur) of events (the feature dataset and the label dataset) that are performed by various users may be grouped together).
Regarding claim 13, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 12, as described above.
Alexandrescu teaches -

determining a conditional probability of the label occurring in the dataset given the feature has occurred in the dataset (Alexandrescu at p. 83, “4.4.1 Architecture of Contemporary Phrase-Based SMT Systems: Training,” first partial paragraph, teaches [t]he basic approach aims at computing parameters that maximize p(y|x), which, after applying the Bayes rule, becomes:

    PNG
    media_image1.png
    56
    442
    media_image1.png
    Greyscale

[arg max p(y|x) is determining a conditional probability of the label occurring in the dataset given the feature has occurred in the dataset]; Of the two factors, p(y) (that is, a probability that a label will occur) is computed by using a language model on the target language side . . . . The translation model p(x|y) (that is, a probability that the feature and label pair will occur) is the more difficult subsystem to train; a variety of training methods are being used . . . . Additional models may be used in the rescoring process, and the weights of the log-linear model associated with them are trained on the training set; Alexandrescu at p. 4, “1.1 What is Human Language Technology?,” first full paragraph, teaches [t]raditional statistical learning methods use a supervised approach, meaning that a model's parameters are adjusted (trained) by using labeled data, i.e., data for which both inputs (also known as features [(that is “x”)]) and correct outputs (often referred to as labels [(that is “y”)]) are known)); and
determining that the conditional probability is greater than the probability of the label occurring in the dataset (Alexandrescu at p. 61, “4.1 Structured Inputs p(y|x) and instead focus on regressing a real-valued scoring function s (that is, the “scoring function” faclititates determining that the conditional probability). Such a scoring function accepts a pair of input (that is, feature) and output data (that is, label) and computes a real-valued score; Alexandrescu at p. 65, “4.2.1 Learning With Only Positive Examples,” first full paragraph, teaches the train[ing] set contains examples and counter-examples, i.e. "good" training pairs ((x, y)) + (that is, significant) and "bad" training pairs ((x, y)) -. In such situations, a common approach is to assign each positive training sample a constant high score s+, and each negative training sample a constant low score s - (that is, via the scores determining that the conditional probability is greater than the probability of the label occurring in the dataset); Alexandrescu at p. 4, “1.1 What is Human Language Technology?,” first full paragraph, teaches [t]raditional statistical learning methods use a supervised approach, meaning that a model's parameters are adjusted (trained) by using labeled data, i.e., data for which both inputs (also known as features [(that is “x”)]) and correct outputs (often referred to as labels [(that is “y”)]) are known).
Regarding claim 14, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 13, as described above.
Jhingran teaches wherein the step of determining the set of predicted labels for the entity in the method comprises:
determining a set of features associated with the entity (Jhingran 2:31-33 teaches event data from one or more sources (entity) is collected. . . . [A] feature determining a set of features associated with the entity));
for each feature in the set of features:
determining a feature and label pair in the training model that includes the feature ; and determining the label in the feature and label pair (Jhingran 12:65 to 13:10 & FIG. 11 teaches feature and label pairs associated with the dataset, where the feature [(spending)] for which attribute data is extracted (associated with the dataset) . . . and . . . attributes (that is, labels) have been established that include “$10,” “$20,” “$30,” and “$40.”).
Regarding claim 15, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 14, as described above.
Jhingran teaches wherein the step of determining the set of predicted labels for the entity in the method further comprises:
for each feature in the set of features:
determining multiple feature and label pairs in the training model that include the feature (Jhingran 12:65 to 13:10 & FIG. 11 teaches feature and label pairs associated with the dataset, where the feature [(spending)] for which attribute data is extracted (associated with the dataset) . . . and . . . attributes (that is, labels) have been established that include “$10,” “$20,” “$30,” and “$40.”); and
determining, from the multiple feature and label pairs, the feature and label pair that is associated with the largest conditional probability of the label occurring in the dataset given the feature has occurred in the dataset (Jhingran 4:44-47 teaches a prediction (that is, conditional probability) target (that is, a label) that a certain profiling technique is configured to generate is whether a user is likely to purchase a high-end vehicle (feature and label pair that is associated with the largest conditional probability of the label occurring in the dataset) given the input attribute data of the user's income range (determining, from the multiple feature and label pairs, the feature and label pair that is associated with the largest conditional probability of the label occurring in the dataset given the feature has occurred in the dataset)) .
Regarding claim 17, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 13, as described above.
Jhingran teaches wherein the method further comprises:
bucketizing data associated with the entity to determine bucketized values (Jhingran 3:34-38 teaches extracting attribute data from the collected event data includes determining an appropriate number of attribute bins (that is, buckets) to establish for a feature (bucketizing data associated with the entity to determine bucketized values));
generating combined bucketized values including the bucketized values (Jhingran 3:56-59 teaches a user record for a user's feature value is created in the extracted attribute data such that multiple (e.g., neighboring or adjacent) attribute bins are indicated as being associated with the user (generating combined bucketized values including the bucketized values)); and
storing the combined bucketized values (Jhingran 5:22-24 teaches [a]ttribute data storage 208 (storing) is configured to store the attribute data extracted by attribute data extraction engine 206 (storing the combined bucketized values)).
7.	Claims 6 and 16 are rejected under 35 U.S.C. § 103 as being unpatentable over US Patent 10255300 to Jhingran et al. [hereinafter Jhingran] in view of Andrei Alexandrescu, "Scalable Graph-Based Learning Applied to Human Language Technology," University of Washington (2009) [Thesis] [hereinafter Alexandrescu] and US Published Application 20130006991 to Nagano et al. [hereinafter Nagano].
Regarding claims 6 and 16, the combination of Jhingran and Alexandrescu teaches all of the limitations of claims 1 and 11, respectively, as described above.
Jhingran teaches wherein the dataset is a first dataset, further comprising:
determining a second dataset corresponding to data collected after the time period (Jhingran 5:4-8 teaches attribute data extraction engine 206 is configured to extract sequences of predetermined numbers of events from the event data. A sequence of events comprises a series of events that was performed by a user in a temporal/chronological order (that is, being a temporal order is determining a second dataset corresponding to data collected after the time period); 
Though Jhingran and Alexandrescu teach the features of feature and attribute extraction from event data, the combination of Jhingran and Alexandrescu, however, does not explicitly teach -
* * *
and evaluating the training model by determining a number of labels from the set of predicted labels occurring in the second dataset associated with the entity.
But Nagano teaches -
* * *
and evaluating the training model by determining a number of labels from the set of predicted labels occurring in the second dataset associated with the entity (Nagano ¶ 0028 teaches a hierarchical structure, which is a result of hierarchical clustering, is evaluated with the use of label information (that is, evaluating the training model) indicating a pair specified by a user as one having the highest similarity among three contents of the triplet, and the weight of each feature is updated on the basis of a result of the evaluation. Therefore, it is possible to effectively and accurately learn the weight of each physical feature so that the degree of subjective similarity can be reflected on a clustering result. Furthermore, by using the learned weight of each feature, it is possible to perform clustering so that degrees of subjective similarity among pieces of emotional content that a person (that is, an entity) feel can be reflected (that is, evaluating the training model by determining a number of labels from the set of predicted labels occurring in the second dataset associated with the entity); [Examiner notes that the Specification recites “The entity may be an individual, an object, an organization, etc. (Specification ¶ 0097)])..
Jhingran, Alexandrescu, and Nagano are from the same or similar field of endeavor. Jhingran teaches automatically extracting attribute data from the collected event data. Alexandrescu teaches using graph-based semi-supervised learning Nagano teaches finding degrees of subjective similarity among pieces of emotional content that a person feels from the content can be reflected. Thus, it would have been obvious to one of ordinary skill in the art as of the effective filing date of Applicant’s invention to modify the combination of Jhingran and Alexandrescu pertaining to categorization prediction relative to an attribute with the subjective similarity of feelings of Nagano.
The motivation for doing so is to make it possible to learn the weights of physical features for content expressed as a combination of physical features so as to be capable of determining how a person feels from the content. (Nagano ¶ 0019).
8.	Claims 8, 9, 18, and 19 are rejected under 35 U.S.C. § 103 as being unpatentable over US Patent 10255300 to Jhingran et al. [hereinafter Jhingran] in view of Andrei Alexandrescu, "Scalable Graph-Based Learning Applied to Human Language Technology," University of Washington (2009) [Thesis] [hereinafter Alexandrescu] and US Published Application 20160063389 to Fuchs et al. [hereinafter Fuchs].
Regarding claims 8 and 18, the combination of Jhingran and Alexandrescu teaches all of the limitations of claims 7 and 17, respectively, as described above.
Though Jhingran and Alexandrescu teach the features of feature and attribute extraction from event data, the combination of Jhingran and Alexandrescu, however, does not explicitly teach -
wherein bucketizing data associated with the entity comprises:

bucketizing the conditional probability of the label occurring in the dataset given the feature has occurred in the dataset.
Fuchs teaches -
wherein bucketizing data associated with the entity comprises:
for each feature and label pair determined to be significant:
bucketizing the conditional probability of the label occurring in the dataset given the feature has occurred in the dataset (Fuchs ¶ 0007 teaches to systems and methods for partitioning sets of features for a Bayesian classifier (that is, bucketizing the conditional probability of the label), finding partitions that make the classification process faster and more accurate, while discovering and taking into account feature dependence among certain sets of features in the data set)).
Jhingran, Alexandrescu, and Fuchs are from the same or similar field of endeavor. Jhingran teaches automatically extracting attribute data from the collected event data. Alexandrescu teaches using graph-based semi-supervised learning techniques for natural language processing, and addresses issues of distance measure learning, scalability, and structured inputs and outputs. Fuchs teaches structure learning system and method uses only the training set to find good partitions. Thus, it would have been obvious to one of ordinary skill in the art as of the effective filing date of Applicant’s invention to modify the combination of Jhingran and Alexandrescu pertaining to categorization prediction relative to an attribute with structure learning system of Fuchs.
Fuchs, Abstract).
Regarding claims 9 and 19, the combination of Jhingran, Alexandrescu and Fuchs teaches all of the limitations of claims 8 and 18, respectively, as described above.
Jhingran teaches -
receiving a query for information associated with the entity (Jhingran 6:66-7:2 teaches if the feature queried the number of times that a user had purchased an item from a certain online store in the last month, then each attribute bin can be an integer number of items or a range of number of items (receiving a query for information associated with the entity));
obtaining the combined bucketized values (Jhingran 6:57-59 teaches [e]xamples of an attribute bin comprise . . . an aggregate value (obtaining the combined bucketized values) . . . ); and
determining the information based on the combined bucketized values (Jhingran 7:8-10 teaches determined that a plurality of events associated with a user in the set of event data corresponds to a first attribute corresponding to the feature (determining the information based on the combined bucketized values)).
9.	Claim 21 is rejected under 35 U.S.C. § 103 as being unpatentable over US Patent 10255300 to Jhingran et al. [hereinafter Jhingran] in view of Andrei Alexandrescu, "Scalable Graph-Based Learning Applied to Human Language Technology," University of Washington (2009) [Thesis] [hereinafter Alexandrescu] and US Published Application 20170091270 to Guo et al. [hereinafter Guo].
Regarding claim 21, the combination of Jhingran and Alexandrescu teaches all of the limitations of claim 1, as described above.
Though Jhingran and Alexandrescu teach the feature of partitioning training data based on probabilities, the combination of Jhingran and Alexandrescu does not explicitly teach -
wherein each of the plurality of feature and label pairs are associated with resource provider information, and wherein the dataset is a transaction dataset.
But Guo teaches -
wherein each of the plurality of feature and label pairs are associated with resource provider information, and wherein the dataset is a transaction dataset (Guo ¶ 0070 teaches a key field is one that is deemed to be necessary for an organization record to be useful for whatever purpose is attached to the organization record, such as by a governance application. In another example embodiment, a key field is one that is necessary in order for the clustering/fusing component 326 to accurately act on the record (i.e., be able to determine whether or not the record overlaps sufficiently with another record to fuse the records). In an example embodiment, there are six key fields for organization records: name, address, phone number, organization website (also known as Uniform Resource Locator (URL)), description, and logo) (that is, the plurality of feature and label pairs are associated with resource provider information) and wherein the dataset is a transaction data set (Guo ¶ 0045 teaches upon detecting a particular interaction (that is, a transaction), the a transaction data set), in a member activity and behavior database 222; Guo ¶ 0157 teaches description for an organization can be automatically deduced by fetching the company web page(s) using the company URL. An issue exists, however, in determining which text on the company web page(s) correspond to a company description. A machine learning algorithm can be used to train a description model in a similar fashion to how the classifier model 914 is trained, as described above. Sample descriptions from company websites may be data mined to construct features for classifying descriptions. Each candidate description can be labeled as “good” or “bad” and the training data passed to a machine learning algorithm (that is, sample descriptions are the dataset is a transaction data set);
[Examiner notes that the specification recites dataset 700 may be a transaction dataset that comprises data field values as well as metadata (e.g., data field names, etc.) for transactions conducted over the time period spanning over time 0 to time T2 (Specification ¶ 0075); also, though not claimed, Guo teaches a time period spanning over time 0 to time T2 (Guo ¶ 0117 teaches [a] threshold may be set, such as 500, indicating the minimum number of occurrences in the training set until the URL is deemed “frequently occurring.”)]).
Jhingran, Alexandrescu and Guo are from the same or similar field of endeavor. Jhingran teaches automatically extracting attribute data from the collected event data. Alexandrescu teaches using graph-based semi-supervised learning techniques for natural language processing, and addresses issues of distance measure learning, scalability, and structured inputs and outputs. Guo teaches label prediction on Jhingran and Alexandrescu pertaining to categorization prediction relative to an attribute with the predictive correlation model changes with the resource provider information classification predictions of Guo.
The motivation for doing so is to so that extracted features can be passed to a machine learning algorithm to train an organization name confidence model to provide a confidence score that an organization name for the member profile is accurate. (Guo ¶ 0087).
Response to Arguments
10.	Examiner has fully considered Applicant’s arguments, and responds below, accordingly.
11.	Examiner notes that the base claims recite “predicting a set of labels for an entity.” (see claim 1, line 12). The Specification recites further detail regarding “for an entity,” in which 
the server computer may determine a set of features associated with the entity. The server computer may first determine (e.g., obtain) a dataset associated with the entity. The dataset may not be the same as the dataset determined at step 601. For example, the dataset determined at step 601 may include data associated with multiple entities (e.g., multiple accounts, multiple vehicles, etc.), while the dataset determined at step 1001 may be associated with the single entity (e.g., an account, a vehicle, etc.). Further, the dataset determined at step 601 and the dataset determined at step 1001 may be associated with different time periods. However, typically, the dataset determined at step 1001 may span the same length of time as that of subset 702 of FIG. 7.
(Specification ¶ 0099). The claims may be further clarified in view of the with such features.
12.	Applicant argues the prior art of Fuchs does not teach the claim limitation “‘wherein significant is when a probability that the feature and label pair will occur is greater than a probability that a label will occur.’” (Response at p. 9).
Examiner cites to the features of Alexandrescu as teaching this feature. With regard to Fuchs, Examiner agrees to the extent that Fuchs does not explicitly teach a “significant” status of a feature-label pair. Fuchs, however, does teach the feature of data partition (or segmentation) on a basis of class (that is, label) derived through feature entropy, as taught by Fuchs ¶ 0058 & Fig. 3. In view thereof, Fuchs is from the same or similar field of endeavor of Jhingran and Alexandrescu, as set out in detail in the rejections above. 
13.	With regard to Claim 21, Applicant argues Guo does not teach the limitations of Applicant’s claims in that “there is no teaching or suggestion that each of the plurality of feature and label pairs associated with resource provider information, and wherein the dataset is a transaction dataset.” (Response at p. 12).
Examiner respectfully disagrees because Guo teaches the features of Applicant’s invention. Applicant appears to argue that the ordinary meaning of “resource provider information” and “transaction dataset” does not cover the teachings of Guo. Guo, as set out above in detail in the rejections above. 
Also, the Specification recites “a ‘feature and label pair’ may include a feature indicating a resource provider name (e.g., merchant name) and a label indicating a resource provider category (e.g., merchant category code). An exemplary analysis can provide a prediction of a future occurrence of the merchant category code based on the occurrence of the resource provider name in a dataset.” (Specification ¶ 0033). The BRI of a “resource provider name” to cover the “organization records” of Guo is not inconsistent with the Applicant’s specification. 
Conclusion
14.	The prior art made of record and not relied upon is considered pertinent to Applicant's disclosure.
(Nguyen et al., “A Bayesian Nonparametric Approach for Multi-label Classification,” JMLR (2016)) a Bayesian nonparametric (BNP) framework for multi-label classification that can automatically learn and exploit the unknown number of multi-label correlations.
(US Published Application 20150178596 to Bengio et al.) teaches a relation score may be calculated for a first label option and a second label option corresponding to a second object in an image. The relation score may be based on a frequency, probability, or observance corresponding to the co-occurrence of text associated with the first option and the second option in a text corpus such as the World Wide Web.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, Applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, KAKALI CHAKI can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/K.L.S./
Examiner, Art Unit 2122
/ERIC NILSSON/Primary Examiner, Art Unit 2122