DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2021-12-06 has been entered.  The status of claims is as follows:
Claims 1-20 remain pending in the application.
Claims 1, 8, and 15 are amended.
Response to Arguments
Applicant’s arguments with respect to rejections under 35 USC 103 have been considered but are not persuasive.  Applicant argues that the prior art recited in the Final Rejection mailed 2021-07-30 (Li, Friedman, and McDermott) and in the Advisory Action mailed 2021-11-15 (Mont-Reynaud, Breck, and Yoo) do not teach the amended limitation of label values following a sequential order.   Examiner respectfully disagrees.  McDermott discloses label values indicated by “j” and “k” on Pg. 294 (“For a classification problem of M categories, we consider a matrix A of inter-category symbolic distances djk between categories j and k.”).  Furthermore, Mont-Reynaud discloses label values in a sequential order, related to education 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Li (“A gentle introduction to gradient boosting”) in view of Friedman (“Greedy Function Approximation:  A Gradient Boosting Machine”), McDermott et. al. (“Prototype-based MCE/GPD training for word spotting and connected word recognition”; hereinafter “McDermott”), Mont-Reynaud et. al. (US 9,564,123 B1; hereinafter “Mont-Reynaud”), Breck et. al. (US 2015/0302755 A1; hereinafter “Breck”), Yoo et. al. (US 2015/0006259 A1; hereinafter “Yoo”), and Carter et. al. (US 2015/0006422 A1; hereinafter “Carter”).
As per Claim 1, Li teaches a method for training a classification model, the method comprising (Li, Page 25, discloses training a model:  “We are improving the predictions of training data”.  Li, Page 48, discloses applying the technique for a classification model: “Gradient Boosting for Classification” in the title, and “Multi-class classification” in the slide text.)
(Li, Page 69, discloses a training parameter:  “a matrix of parameters to optimize”.  Li, Page 49, discloses a true classification, showing for a chosen example “Label = G”.  Li implies the use of a computer device, which comprises memory and a processor, since Li supplies a hyperlink to an online location for the data set, and also on Page 2 Li supplies a Github link to a code implementation.)
performing, by the device, classification training on an initial classification model by using the training parameter, to obtain a predicted classification (Li, Page 25, discloses training a model to obtain a prediction:  “We are improving the predictions of training data”.  Li, Page 48, discloses applying the technique for a classification model: “Gradient Boosting for Classification” in the title, and “Multi-class classification” in the slide text.  Li, Page 48, discloses obtaining a predicted classification:  “Recognize the given hand written capital letter”.  Li, Page 69, discloses a training parameter:  “a matrix of parameters to optimize”.  Li, Page 70, indicates an initial model, disclosing “Start with initial models”).
determining, by the device, a residual between the true classification and the predicted classification according to a gradient loss function of the initial classification model (Li, Page 27, discloses residual:  “yi – F(xi) are called residuals”.  Li, Page 13, indicates that yi is the true value and F(xi) is the predicted value:  “You are given (x1; y1); (x2; y2); …; (xn; yn), and the task is to fit a model F(x)”.  Li, Page 68, discloses “Give any differentiable loss function L”.  Li, Page 70, discloses that differentiating this loss function L produces a gradient:  “calculate negative gradients for class A”, and discloses an equation that shows this gradient is produced by differentiating the loss function.  Li, Page 71, discloses that this negative gradient equivalent to the true value minus the predicted value: “(YA(xi) – PA(xi))”, and this is known as a residual.)
and modifying, by the device, the initial classification model according to the residual to obtain a final classification model (Li, Page 71, discloses “calculate negative gradients”, which are residuals.  Li, Page 71, also discloses modifying the initial classification model according to the residual:  “fit a regression tree hA to the negative gradients” and “FA := FA + rhoA*hA”.  Li, Page 71, also discloses a final classification model, as they state “iterate until converge”.  Upon convergence, the updates stop and one has the final classification model.)
However, Li does not explicitly teach the gradient loss function being based on a prediction function.
Friedman teaches the gradient loss function being based on a prediction function (First note that Li, Page 55, discloses under “Loss Function for each data point”:  “Step 3 Calculate the difference between the true probability distribution and the predicted probability distribution.  Here we use KL divergence.”
Examiner notes that KL divergence is defined as:

    PNG
    media_image1.png
    60
    289
    media_image1.png
    Greyscale

Which is equivalent to:

    PNG
    media_image2.png
    49
    411
    media_image2.png
    Greyscale

Friedman, Pg 10-11 Section 4.6 “Multi class logistic regression and classification”, teaches:
“Here we develop a gradient descent boosting algorithm for the K-class problem.  The loss function is”

    PNG
    media_image3.png
    104
    936
    media_image3.png
    Greyscale

When considering the K-class problem (such as Li’s character recognition which is a 26-class problem), it is the case that the true probability distribution is a vector of several 0’s and one 1.  Therefore, when applied to the Wikipedia DKL equation, one of ordinary skill in the art can see that the second term of the equation in this case is always zero, because either p(x) itself is 0 or p(x) is 1, which then makes log p(x) to be equal to 0.  So, the product p(x)*log p(x) is always 0.  Then one arrives at Friedman’s loss function which is the first term of the KL equation.  Therefore, Friedman is also using KL divergence in the loss function for the K-class problem.  It is therefore established that Li (“Here we use KL divergence”) is simply using Friedman’s method (and Li cites Friedman on Page 80). 
	Friedman continues:  “where yk = 1 (class = k) e {0, 1}, and pk(x)  = Pr (yk = 1 | x). Following FHT00, we use the symmetric multiple logistic transform”

    PNG
    media_image4.png
    122
    909
    media_image4.png
    Greyscale

“or equivalently”

    PNG
    media_image5.png
    113
    970
    media_image5.png
    Greyscale

“Substituting (30) into (28) and taking first derivatives one has”

    PNG
    media_image6.png
    133
    1127
    media_image6.png
    Greyscale

Here, the equations correspond to the Instant Specification, where Fk(x) is called a “prediction function”.  One can see from Friedman that the prediction function F is actually a rearrangement of the loss function as seen below:

    PNG
    media_image7.png
    144
    935
    media_image7.png
    Greyscale


    PNG
    media_image8.png
    114
    897
    media_image8.png
    Greyscale

Thus, Friedman discloses the gradient loss function being based on a prediction function.
Li and Friedman are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the gradient boosting of Li, with the function of Friedman. The modification would have been obvious because one of ordinary skill in the art would be motivated to improve accuracy of the classification model even when using suboptimal data (Friedman, Pg 1 Abstract:  “Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data.”)
However, the combination of Li and Friedman does not teach the gradient loss function being based on a modification prediction function which is modified from a prediction function according to a distance factor, the distance factor representing a distance between a first category and a second category, the first category being a category to which the predicted classification belongs, and the second category being a category to which the true classification belongs.
McDermott teaches the gradient loss function [being based on a modification prediction function] which is modified [from a prediction function] according to a distance factor, the distance factor representing a distance between a first category and a second category, the first category being a category to which the predicted classification belongs, and the second category being a category to which the true classification belongs.  (McDermott, Pg 291 Abstract Lines 7-10, discloses:  “Furthermore, we define a new MCE/GPD loss function that can incorporate word spotting errors and other measures of symbolic distance between correct and incorrect categories.” Here, McDermott discloses that the loss function comprises a distance between a predicted category and the true category.  McDermott, Pg 291 Intro Lines 5-7, further discloses a classification problem in which gradient descent is used:  “GPD allows us to perform gradient descent on a classification loss measure that closely reflects the misclassification rate”  McDermott, pg 294 Top of left column, discloses more details on possible distance metrics:  “For a classification problem of M categories, we consider a matrix A of inter-category symbolic distances djk between categories j and k. These distances could be many kinds of distance, such as distances between syntactic parse trees”) McDermott, pg 294 3rd column, also discloses:  “Multiplying this expression, for each incorrect category a, by the inter-category symbolic distance djk, and summing over all incorrect categories thus gives an aggregate, weighted symbolic distance between the correct category k and all other categories. Multiplying this aggregate distance by the usual loss L() gives the new loss L2().”  Here, McDermott discloses scaling the Loss function to create a new Loss function, by multiplying by a distance between the correct category and incorrect categories. Thus, McDermott discloses the gradient loss function which is modified according to a distance factor.  
Recall that Friedman teaches the gradient loss function being based on a prediction function, and that the prediction function F is a rearrangement of the Loss function L:

    PNG
    media_image7.png
    144
    935
    media_image7.png
    Greyscale


    PNG
    media_image8.png
    114
    897
    media_image8.png
    Greyscale

Thus, modifying a prediction function according to a distance factor would also be modifying a loss function according to a distance factor.  Thus, the combination of Li, Friedman, and McDermott teaches the gradient loss function being based on a modification prediction function which is modified from a prediction function according to a distance factor, the distance factor representing a distance between a first category and a second category, the first category being a category to which the predicted classification belongs, and the second category being a category to which the true classification belongs.
McDermott further teaches the distance factor is based on the difference between a label value of the first category and a label value of the second category (McDermott, pg 294 Top of left column, discloses more details on possible distance metrics:  “For a classification problem of M categories, we consider a matrix A of inter-category symbolic distances djk between categories j and k.”  Here, McDermott discloses the distance factor (“inter-category symbolic distances”) is based on a difference between label value of the first category (“j”) and a label value of the second category (“k”)).
McDermott and the combination of Li and Friedman are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the gradient boosting of Li and Friedman, with the distance-based loss function of McDermott. The modification would have been obvious because one of ordinary skill in the art would be motivated to accelerate training by using more informative loss values (McDermott, Pg 293 Sec 3.2 Para 2:  “In MCE/GPD so far, whenever a category is misrecognized as another category, the (ideal) loss is considered to be 1. In a grammar-constrained task, where the categories are taken to be the strings allowed by the grammar, any difference, however slight, between the correct string and incorrect strings will be treated as a loss of 1; a very large difference between correct and incorrect strings will be counted in the same way, as a loss of 1. Thus, the above approach to word spotting in continuous speech is optimizing correct string recognition. It may be desirable to consider more revealing error counts when comparing correct and incorrect strings. For instance, a word spotting error count is a practical way of evaluating the performance of a system attempting to recognize continuous speech. The 
However, the combination of Li, Friedman, and McDermott thus far fails to teach a method for education level classification; wherein the training parameter comprises the following parameters: a type of reading; a type of joined activity; and a subscribed official account; wherein: the first category and the second category each comprises the following categories: a doctoral degree category; a master's degree category, a bachelor degree category; 2Application No. 16/286,894Atty Docket No. 514935.5000479 a college degree category; a high school degree category; a junior high school degree category; and an elementary school degree category; the doctoral degree category, the master's degree category, the bachelor degree category, the college degree category, the high school degree category, the junior high school degree category, and the elementary school degree category correspond to label values following a sequential order
Mont-Reynaud teaches a method for education level classification (Mont-Reynaud, Col 14 Lines 24-26, discloses:  “Several formulas exist that can power a reading level classifier 634, education level classifier 635, and English proficiency classifier 636.”)
wherein the training parameter comprises the following parameters: a type of reading (Mont-Reynaud, Col 14 Lines 24-51, discloses:  “Several formulas exist that can power a reading level classifier 634, education level classifier 635, and English proficiency classifier 636. In some English speaking countries or regions, there is a strong correlation between speech patterns and socio-economic status (SES). In one implementation, information from reading level, education level and English proficiency classifiers, as well as semantic features, may be used as input to a SES classifier 637. A reading level classifier 634 generates a numeric value derived from analyzing a body of text, which approximates the minimum American grade-level education that one would need to understand the text. Many such formulas have been developed, including “Flesh-Kincaid Grade Level”, “Dale-Chill Formula”, and “Fry Readability Graph”. These tools generally infer a readability rating of some given text based on its mean sentence length, mean syllables per word, and whether the words it contains are classified as “easy” or “hard”. In a typical implementation of this system, language classifiers may be based on one or more of these formulas to estimate the readability rating of user utterances. The value can be interpreted as an American school grade level (such as the value 8.0 for grade 8), or it can be used to estimate directly the attained education level of the user. Hence, reading level and education level may have similar values, and in one implementation, the reading level classifier 634 and education level classifier 635 may be one and the same or otherwise grouped together.”  Here, Mont-Reynaud discloses an “reading level classifier” and “education level classifier” that “may be one and the same”, and the reading level classifier “generates a numeric value derived from analyzing a body of text, which approximates the minimum American grade-level education that one would need to understand the text.”  Thus, Mont-Reynaud discloses an education level classifier with a type of reading as one of the parameters.)
the high school degree category, the junior high school degree category, and the elementary school degree category correspond to label values following a sequential order (Mont-Reynaud, Col 14 Lines 24-51, discloses:  “The value can be interpreted as an American school grade level (such as the value 8.0 for grade 8)”.  Here, Mont-Reynaud discloses label values following a sequential order (“American school grade level”).  One of ordinary skill in the art will appreciate that these values of American school grade levels consist of 1-12, and thus these labels comprise at least one high school degree category, junior high school degree category, and elementary school degree category.)
Mont-Reynaud and the combination of Li, Friedman, and McDermott are analogous art because they are both in the field of endeavor of machine learning.
It would have been obvious before the effective filing date of the claimed invention to combine the education level classifier of Reynaud with the distance based classifier of the combination of Li, Friedman, and McDermott.  One of ordinary skill in the art would be motivated to do so in order to be able to profitably tailor content to a user (Mont-Reynaud, Col 2 Lines 8-12: “A user profile augmented in this way may be used by a variety of applications that tailor user interactions based on profile information. One kind of application that benefits from a rich user profile is one that selects advertising that would be relevant to the user.”)
However, the combination of Li, Friedman, McDermott, and Mont-Reynaud fails to teach wherein the training parameter comprises the following parameters: a type of joined activity; and a subscribed official account; and the following categories: a doctoral degree category; a master's degree category, a bachelor degree category; 2Application No. 16/286,894Atty Docket No. 514935.5000479 a college degree category; 
Breck teaches wherein the training parameter comprises the following parameters: a type of joined activity; (Breck, Para [0043], discloses: "In one embodiment, the behavior analysis module 430 predicts an educational skill level of the user by applying a statistical attribution model to one or more indicators of user behavior". Breck indicates this can be done by machine learning in [0046]: "The statistical attribution model may be developed by using supervised machine learning techniques (e.g., support vector machines, neural networks, etc.) to train models to predict outcomes based on the features extracted”. Finally, Breck [0021] discloses: "As used herein, the behavior of the user refers to activities or usages of a user device (e.g., a mobile handheld device) by the user. The behavior may include activities related to the educational content item in addition to activities unrelated to the educational content item". Here, Breck discloses that one of the parameters is a "type of joined activity" ("activities related to the educational content item"). By using the educational content item, the user has "joined" the activities related to it, as one person still can "join" something.)
Breck and the combination of Li, Friedman, McDermott, and Mont-Reynaud are analogous art because they are both in the field of endeavor of machine learning.
It would have been obvious before the effective filing date of the claimed invention to combine the education level classifier of Breck with the distance based education classifier of the combination of Li, Friedman, McDermott, and Mont-Reynaud.  This would allow one to gauge the educational skill level of a user, and one of ordinary skill in the art would be motivated to do so in order to be able to measure the effectiveness of educational content (Breck [0007]: “Despite the new educational model stemming from the development of educational content tailored to mobile handheld devices, current systems lack an evaluation system to measure the effectiveness of such educational content.”)
However, the combination of Li, Friedman, McDermott, Mont-Reynaud, and Breck fails to teach wherein the training parameter comprises the following parameters: a subscribed official account; and the following categories: a doctoral degree category; a master's degree category, a bachelor degree category; 2Application No. 16/286,894Atty Docket No. 514935.5000479 a college degree category; 
Yoo teaches wherein the training parameter comprises the following parameters: a subscribed official account (Yoo, Para [0064], discloses: "In another of these embodiments, the prediction engine 208 predicts a future level of expertise of the at least one of professional and the entity. For example, the analysis engine 204 may receive a profile of a mentor to a profiled professional and compare the mentor's profile with the generated profile. Based on the comparison, the prediction engine 208 may generate a prediction of a modification to the generated profile --for example, the analysis may indicate that every one of the mentor's previous mentees who attained a certain level of education went on to obtain jobs at a prestigious institution, as well as indicate that the profiled professional attained that level of education”. Here, Yoo discloses that a future level of education can be predicted based on a professional's profile. Yoo [0050] discloses the following profile information: "In some embodiments, the profile generator 202 accesses data including, without limitation, a level of education; an affiliation with an educational institution; a type of profession; an area of specialization within a profession; an identification of a professor; an identification of a mentor; an identification of an employer, publications, presentations, professional affiliations, memberships, types of clients, and of office buildings; " Here it is disclosed that "an affiliation with an educational institution", "an identification of an employer", " professional affiliations", and "memberships", which all may be considered "a subscribed official account". Yoo [0065] discloses the use of machine learning: "and use of a feedback loop and/or machine learning to improve the quality of the predictive model.")

It would have been obvious before the effective filing date of the claimed invention to combine the education level classifier of Yoo with the distance based education classifier of the combination of Li, Friedman, McDermott, Mont-Reynaud, and Breck.  This would allow one to gauge the educational level of a professional, and one of ordinary skill in the art would be motivated to do so in order to be able to measure up a professional against their peers, and make recommendations based on the results (Yoo [0058]: “In one embodiment, the analysis engine 204 analyzes a network of professionals to which the profiled professional belongs. The analysis engine 204 may identify ways in which the profiled professional stands out from peers in the network of professionals. The analysis engine 204 may identify characteristics that the profiled professional has in common with peers in the network of professionals. The analysis engine 204 may identify professionals in the network who are farther along in their careers than the profiled professional and compare and contrast the two. In some embodiments, the analysis engine 204 may analyze any or all of the data accessed by the profile generator 202 including, but not limited to, information listed above in connection with FIG. 2.”)
However, the combination of Li, Friedman, McDermott, Mont-Reynaud, Breck, and Yoo fails to teach the following categories: a doctoral degree category; a master's degree category, a bachelor degree category; 2Application No. 16/286,894Atty Docket No. 514935.5000479 a college degree category; 
Carter teaches the following categories: a doctoral degree category; a master's degree category, a bachelor degree category; 2Application No. 16/286,894Atty Docket No. 514935.5000479 a college degree category; (Carter, Para [0183], discloses:  “Once an employer has entered a search term, if the automatic category is incorrect, the employer may select the correct category from a drop down menu (any parameter can be selected multiple times) from options including job title, location, key word/phrase to include, key word/phrase to exclude, skill, range, zip code, education level (Phd, masters, bachelors, associates, some college, high school) or others. When an employer starts typing a search term, an importance level may appear to the right. The default importance may be medium. The employer will be able to change the importance between medium, high, low, must have, or others.”  Here, Carter discloses a doctoral degree category (“Phd”), a master’s degree category, a bachelor degree category, and a college degree category (“associates”)).
The combination of Carter with the combination of Li, Friedman, McDermott, Mont-Reynaud, Breck, and Yoo, as a result, teaches  the doctoral degree category, the master's degree category, the bachelor degree category, the college degree category, the high school degree category, the junior high school degree category, and the elementary school degree category correspond to label values following a sequential order.  Recall above that Mont-Reynaud teaches label values following a sequential order for American grade values 1-12, which comprises high school, junior high school, and elementary school.  One of ordinary skill in the art will appreciate that the set of integers can be extended (13, 14, 15…) to account for the higher levels of degrees.
Carter and the combination of Li, Friedman, McDermott, Mont-Reynaud, Breck, and Yoo are analogous art because the problem of Carter, which is using education level in employment analysis, is reasonably pertinent to the problem solved by the combination above, which includes Yoo’s professional analysis in which education level is predicted (see MPEP 
It would have been obvious before the effective filing date of the claimed invention to combine the education levels of Carter with the distance based education classifier of the combination of Li, Friedman, McDermott, Mont-Reynaud, Breck, and Yoo.  This would allow one to gauge the educational level of a professional, and one of ordinary skill in the art would be motivated to do so in order to make better hiring decisions, as education level may be of high importance to employers (Carter [0183]: “Once an employer has entered a search term, if the automatic category is incorrect, the employer may select the correct category from a drop down menu (any parameter can be selected multiple times) from options including job title, location, key word/phrase to include, key word/phrase to exclude, skill, range, zip code, education level (Phd, masters, bachelors, associates, some college, high school) or others. When an employer starts typing a search term, an importance level may appear to the right. The default importance may be medium. The employer will be able to change the importance between medium, high, low, must have, or others.”)

As per Claim 2, the combination of Li, Friedman, McDermott, Mont-Reynaud, Breck, Yoo, and Carter teaches the method according to claim 1.  Li teaches further comprising: receiving, by the device, to-be-classified data (Li, Page 48, discloses receiving to-be-classified data:  “Data Set” with “20000 data points”)
(Li, Page 48, discloses classifying the data: “Recognize the given hand written capital letter”.  Li, Page 71, discloses using the final classification model:  “Iterate until converge”)
and outputting, by the device, the classification result. (Li, Page 66, displays an output of the classification result, which is a graph of all 26 letters, showing “G” with the highest probability)

As per Claim 3, the combination of Li, Friedman, McDermott, Mont-Reynaud, Breck, Yoo, and Carter teaches the method according to claim 1.  Li teaches wherein: the performing classification training on an initial classification model by using the training parameter, to obtain the predicted classification comprises: performing, by the device, iterative calculation on the initial classification model by using the training parameter, to obtain a predicted classification generated by a classification model used in each iteration (Li, Page 48, discloses obtaining a training sample, as Li discloses a “Data Set” with “20000 data points”.  Li, Page 48, also discloses “16 features”, being incorporated on Page 49 into one “feature vector”.  In machine learning training, a weight, or parameter, must be applied to this feature, as it is this which is trained.  Li, Page 71, discloses starting with an initial classification model and performing an iterative calculation:  “Start with initial models…iterate until converge”.  Li discloses obtaining a predicted classification in each iteration in Pages 58-66, where Li shows the probability distribution at each iteration.)
the determining the residual between the true classification and the predicted classification according to the gradient loss function of the initial classification model (Li, Page 27, discloses residual:  “yi – F(xi) are called residuals”.  Li, Page 13, indicates that yi is the true value and F(xi) is the predicted value:  “You are given (x1; y1); (x2; y2); …; (xn; yn), and the task is to fit a model F(x)”.  Li, Page 68, discloses “Give any differentiable loss function L”.  Li, Page 70, discloses that differentiating this loss function L produces a gradient:  “calculate negative gradients for class A”, and discloses an equation that shows this gradient is produced by differentiating the loss function.  Li, Page 71, discloses that this negative gradient equivalent to the true value minus the predicted value: “(YA(xi) – PA(xi))”, and this is known as a residual.)
and the modifying the initial classification model according to the residual to obtain the final classification model comprises: modifying, by the device according to a residual determined in the Mth iteration, a classification model used in the Mth iteration to obtain a classification model used in the (M+1)th iteration, and obtaining, by the device, the final classification model by means of at least one iterative modification, the classification model used in the Mth iteration being obtained by modifying a classification model used in the (M-1)th iteration according to a residual determined in the (M-1)th iteration, and M being a positive integer greater than 1.  (Li, Page 71, discloses “calculate negative gradients”, which are residuals.  Li, Page 71, also discloses modifying the initial classification model according to the residual:  “fit a regression tree hA to the negative gradients” and “FA := FA + rhoA*hA”.  Here, the assignment operator “:=” means that model F is updated for the next iteration M+1, based on the previous model F from the previous iteration M and the residual h also from the previous iteration M.  Li, Page 71, also discloses a final classification model, as they state “iterate until converge”.  Upon convergence, the updates stop and one has the final classification model. If the final model is considered iteration M, then it was created with the same assignment operator “:=”, with previous model F and residual h from previous iteration M-1).

As per Claim 4, the combination of Li, Friedman, McDermott, Mont-Reynaud, Breck, Yoo, and Carter teaches the method according to claim 3. Li teaches further comprising: receiving, by the device, to-be-classified data (Li, Page 48, discloses receiving to-be-classified data:  “Data Set” with “20000 data points”)
classifying, by the device, the to-be-classified data by using the final classification model;  (Li, Page 48, discloses classifying the data: “Recognize the given hand written capital letter”.  Li, Page 71, discloses using the final classification model:  “Iterate until converge”)
and outputting, by the device, the classification result. (Li, Page 66, displays an output of the classification result, which is a graph of all 26 letters, showing “G” with the highest probability)

As per Claim 5, the combination of Li, Friedman, McDermott, Mont-Reynaud, Breck, Yoo, and Carter teaches the method according to claim 3.  Li also teaches wherein the determining, according to the gradient loss function of the initial classification model, the residual between the true classification and the predicted classification generated in each iteration comprises: determining, by the device, according to the training parameter and the (Li, Page 27, discloses residual:  “yi – F(xi) are called residuals”.  Li, Page 13, indicates that yi is the true value and F(xi) is the predicted value:  “You are given (x1; y1); (x2; y2); …; (xn; yn), and the task is to fit a model F(x)”.  Li, Page 68, discloses “Give any differentiable loss function L”.  Li, Page 69, discloses a training parameter:  “a matrix of parameters to optimize”.  Li, Page 70, discloses that differentiating this loss function L produces a gradient:  “calculate negative gradients for class A”, and discloses an equation that shows this gradient is produced by differentiating the loss function.  Li, Page 71, discloses that this negative gradient equivalent to the true value minus the predicted value: “(YA(xi) – PA(xi))”, and this is known as a residual.  Li, Page 71, discloses that this is done in each iteration:  “iterate until converge”.)
However, Li does not teach and the distance factor representing the difference between the category to which the true classification belongs and a category to which a predicted classification in each iteration belongs.
McDermott teaches and the distance factor representing the difference between the category to which the true classification belongs and a category to which a predicted classification in each iteration belongs. (McDermott, Pg 291 Abstract Lines 7-10, discloses:  “Furthermore, we define a new MCE/GPD loss function that can incorporate word spotting errors and other measures of symbolic distance between correct and incorrect categories.” Here, McDermott discloses that the loss function comprises a distance between a predicted category and the true category.  McDermott, Pg 291 Intro Lines 5-7, further discloses a classification problem in which gradient descent is used:  “GPD allows us to perform gradient descent on a classification loss measure that closely reflects the misclassification rate”. McDermott, pg 294 Top of left column, discloses more details on possible distance metrics:  “For a classification problem of M categories, we consider a matrix A of inter-category symbolic distances djk between categories j and k. These distances could be many kinds of distance, such as distances between syntactic parse trees”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of McDermott with Li for at least the reasons recited in Claim 1.

As per Claim 6, the combination of Li, Friedman, McDermott, Mont-Reynaud, Breck, Yoo, and Carter teaches the method according to claim 5.  Li teaches determining, according to the training parameter and the true classification, the residual between the true classification and the predicted classification model (Li, Page 27, discloses residual:  “yi – F(xi) are called residuals”.  Li, Page 13, indicates that yi is the true value and F(xi) is the predicted value:  “You are given (x1; y1); (x2; y2); …; (xn; yn), and the task is to fit a model F(x)”.  Li, Page 68, discloses “Give any differentiable loss function L”.  Li, Page 70, discloses that differentiating this loss function L produces a gradient:  “calculate negative gradients for class A”, and discloses an equation that shows this gradient is produced by differentiating the loss function.  Li, Page 71, discloses that this negative gradient equivalent to the true value minus the predicted value: “(YA(xi) – PA(xi))”, and this is known as a residual).
However, Li does not teach wherein the determining, according to the training parameter and the true classification, the residual between the true classification and the 

    PNG
    media_image9.png
    168
    430
    media_image9.png
    Greyscale

x is the training parameter, i is a positive integer greater than 1, Yk is the true classification, Yik is the residual between the predicted classification generated in the kth iteration and the true classification, Pk(x) is a prediction probability function of the kth iteration, F, ) is a prediction function of the kth iteration, Dyk is a distance factor representing a distance between the category to which the true classification belongs and a category to which the predicted classification of the kth iteration belongs, F(xi) is a modification prediction function of the kth iteration, F is a modification prediction function of the lth iteration, and a value of 1 ranges from 1 to K, wherein K is a quantity of classes of the true classification.
Friedman teaches wherein the determining, according to the training parameter and the true classification, the residual between the true classification and the predicted classification in each iteration comprises: determining, by the device, a residual between a predicted classification generated in the kth iteration and the true classification by using the following formulas:

    PNG
    media_image10.png
    166
    501
    media_image10.png
    Greyscale

x is the training parameter, i is a positive integer greater than 1, Yk is the true classification, Yik is the residual between the predicted classification generated in the kth iteration and the true classification, Pk(x) is a prediction probability function of the kth iteration, F, ) is a prediction function of the kth iteration, Dyk is a distance factor representing a distance between the category to which the true classification belongs and a category to which the predicted classification of the kth iteration belongs, F(xi) is a modification prediction function of the kth iteration, F is a modification prediction function of the lth iteration, and a value of 1 ranges from 1 to K, wherein K is a quantity of classes of the true classification.
(First note that Li, Page 55, discloses under “Loss Function for each data point”:  “Step 3 Calculate the difference between the true probability distribution and the predicted probability distribution.  Here we use KL divergence.”
Examiner notes that KL divergence is defined as:

    PNG
    media_image1.png
    60
    289
    media_image1.png
    Greyscale

Which is equivalent to:

    PNG
    media_image2.png
    49
    411
    media_image2.png
    Greyscale

Friedman, Pg 10-11 Section 4.6 “Multi class logistic regression and classification”, teaches:
“Here we develop a gradient descent boosting algorithm for the K-class problem.  The loss function is”

    PNG
    media_image11.png
    57
    298
    media_image11.png
    Greyscale

When considering the K-class problem (such as Li’s character recognition which is a 26-class problem), it is the case that the true probability distribution is a vector of several 0’s and one 1.  Therefore, when applied to the Wikipedia DKL equation, one of ordinary skill in the art can see that the second term of the equation in this case is always zero, because either p(x) itself is 0 or p(x) is 1, which then makes log p(x) to be equal to 0.  So, the product p(x)*log p(x) is always 0.  Then one arrives at Friedman’s loss function which is the first term of the KL equation.  Therefore, Friedman is also using KL divergence in the loss function for the K-class problem.  It is therefore established that Li (“Here we use KL divergence”) is simply using Friedman’s method (and Li cites Friedman on Page 80). 
Li, Page 71, discloses “Calculate negative gradients”, and shows this to be equivalent to the residual:  “YA (xi) – PA(xi)”.  Friedman provides more detail on this residual:

    PNG
    media_image12.png
    162
    695
    media_image12.png
    Greyscale

Here, the residuals calculated using KL divergence of Li are shown to result in the claimed equations (without the claimed distance factor Dyk).

The combination of Li and Friedman thus far fails to teach Dyk is a distance factor representing a distance between the category to which the true classification belongs and a category to which the predicted classification of the kth iteration belongs

    PNG
    media_image13.png
    182
    492
    media_image13.png
    Greyscale

McDermott teaches and the distance factor representing the difference between the category to which the true classification belongs and the category to which the predicted classification in each iteration belongs

    PNG
    media_image14.png
    169
    291
    media_image14.png
    Greyscale

(McDermott, Pg 291 Abstract Lines 7-10, discloses:  “Furthermore, we define a new MCE/GPD loss function that can incorporate word spotting errors and other measures of symbolic distance between correct and incorrect categories.” Here, McDermott discloses that the loss function comprises a distance between a predicted category and the true category.  McDermott, Pg 291 Intro Lines 5-7, further discloses a classification problem in which gradient descent is used:  “GPD allows us to perform gradient descent on a classification loss measure that closely reflects the misclassification rate”. McDermott, pg 294 Top of left column, discloses more details on possible distance metrics:  “For a classification problem of M categories, we consider a matrix A of inter-category symbolic distances djk between categories j and k. These distances could be many kinds of distance, such as distances between syntactic parse trees”). 
McDermott, pg 294 3rd column, also discloses:  “Multiplying this expression, for each incorrect category a, by the inter-category symbolic distance djk, and summing over all incorrect categories thus gives an aggregate, weighted symbolic distance between the correct category k and all other categories. Multiplying this aggregate distance by the usual loss L() gives the new loss L2().”  Here, McDermott discloses scaling the Loss function to create a new Loss function, by multiplying by a distance between the correct category and incorrect categories.  Applying a scale factor Dyk to the loss function of Li and Friedman:
 
    PNG
    media_image15.png
    165
    280
    media_image15.png
    Greyscale

Results in the claimed equation

    PNG
    media_image14.png
    169
    291
    media_image14.png
    Greyscale

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of McDermott with the combination of Li and Friedman for at least the reasons recited in Claim 1.

As per Claim 7, the combination of Li, Friedman, McDermott, Mont-Reynaud, Breck, Yoo, and Carter teaches the method according to claim 6.  Li teaches further comprising: receiving, by the device, to-be-classified data (Li, Page 48, discloses receiving to-be-classified data:  “Data Set” with “20000 data points”)
classifying, by the device, the to-be-classified data by using the final classification model;  (Li, Page 48, discloses classifying the data: “Recognize the given hand written capital letter”.  Li, Page 71, discloses using the final classification model:  “Iterate until converge”)
and outputting, by the device, the classification result. (Li, Page 66, displays an output of the classification result, which is a graph of all 26 letters, showing “G” with the highest probability)

As per Claim 8, Claim 8 is an apparatus claim corresponding to method Claim 1.  The difference is that it recites a memory and a processor.  (Li implies the use of a computer device, which comprises memory and a processor, since Li supplies a hyperlink to an online location for the data set, and also on Page 2 Li supplies a Github link to a code implementation. Such code implementation must be run on a computer.)  Claim 8 is rejected for the same reasons as Claim 1.

As per Claim 9, Claim 9 is an apparatus claim corresponding to method Claim 2.  The difference is that it recites a memory and a processor.  Claim 9 is rejected for the same reasons as Claim 2.

As per Claim 10, Claim 10 is an apparatus claim corresponding to method Claim 3.  The difference is that it recites a memory and a processor.  Claim 10 is rejected for the same reasons as Claim 3.

As per Claim 11, Claim 11 is an apparatus claim corresponding to method Claim 4.  The difference is that it recites a memory and a processor.  Claim 11 is rejected for the same reasons as Claim 4.

As per Claim 12, Claim 12 is an apparatus claim corresponding to method Claim 5.  The difference is that it recites a memory and a processor.  Claim 12 is rejected for the same reasons as Claim 5.

As per Claim 13, Claim 13 is an apparatus claim corresponding to method Claim 6.  The difference is that it recites a memory and a processor.  Claim 13 is rejected for the same reasons as Claim 6.

As per Claim 14, Claim 14 is an apparatus claim corresponding to method Claim 7.  The difference is that it recites a memory and a processor.  Claim 14 is rejected for the same reasons as Claim 7.

As per Claim 15, Claim 15 is a non-transitory computer readable storage medium claim corresponding to method Claim 1.  The difference is that it recites a non-transitory computer readable storage medium and a processor.  (Li implies the use of a computer device, which comprises a non-transitory computer readable storage medium and a processor, since Li supplies a hyperlink to an online location for the data set, and also on Page 2 Li supplies a Github link to a code implementation. Such code implementation must be run on a computer, and the code must be saved on a non-transitory computer readable storage medium.)  Claim 15 is rejected for the same reasons as Claim 1.

As per Claim 16, Claim 16 is a non-transitory computer readable storage medium claim corresponding to method Claim 2.  The difference is that it recites a non-transitory computer readable storage medium and a processor.  Claim 16 is rejected for the same reasons as Claim 2.

As per Claim 17, Claim 17 is a non-transitory computer readable storage medium claim corresponding to method Claim 3.  The difference is that it recites a non-transitory computer readable storage medium and a processor.  Claim 17 is rejected for the same reasons as Claim 3.

As per Claim 18, Claim 18 is a non-transitory computer readable storage medium claim corresponding to method Claim 5.  The difference is that it recites a non-transitory computer readable storage medium and a processor.  Claim 18 is rejected for the same reasons as Claim 5.

As per Claim 19, Claim 19 is a non-transitory computer readable storage medium claim corresponding to method Claim 6.  The difference is that it recites a non-transitory computer readable storage medium and a processor.  Claim 19 is rejected for the same reasons as Claim 6.

As per Claim 20, Claim 20 is a non-transitory computer readable storage medium claim corresponding to method Claim 7.  The difference is that it recites a non-transitory computer readable storage medium and a processor.  Claim 20 is rejected for the same reasons as Claim 7.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Baccianella et. al. ("Evaluation Measures for Ordinal Regression”) discloses “ordinal regression” (or “ordinal classification”), a type of multi label classification in which the labels are sequential values, and the loss function is proportional to the distance between the true and predicted categories (“Mean Absolute Error”)
Kim et. al. (“Ordinal Classification of Imbalanced Data with Application in Emergency and Disaster Information Service”) discloses on Page 53 a loss function that depends on distance between predicted and true sequential labels, “In ordinal classification, the class label order should be considered in the loss function: the penalty for misclassification should increase monotonically as the difference between the actual and predicted class labels increases. Therefore, the absolute error loss is a natural choice in ordinal classification.
Liu et. al. (“Ordinal Regression Analysis: Using Generalized Ordinal Logistic Regression Models to Estimate Educational Data”) discloses using ordinal regression to determine education level, and on Page 245 discloses sequential integer labels representing different education levels
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached M-F 8:00 am - 5:00 pm.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/L.A.S./Examiner, Art Unit 2126     
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126