DETAILED ACTION
Claims 1 – 20 have been presented for examination.
This office action is in response to submission of the application on 03/28/2020.
Ribeiro et al. ““Why Should I Trust You?” Explaining the Predictions of Any Classifier” is included in the IDS and relied upon in the instant Office Action.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


Claims 14 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

With regard to claim 14, it recites “the ridge regression model” in “wherein the term biases the ridge regression model towards a minimization of a number of the input features”.  There is insufficient antecedent basis for this limitation in the claim since there is no previously recited ridge regression model in the parent independent claim 13.  Examiner notes that independent claim 1 recites a ridge regression model, and that dependent claim 18 recites a ridge regression model.  However, they are in different claim trees.  Examiner interprets claim 14 as incorporating the subject matter of claim 18 for the prior art search.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1 – 20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., an abstract idea) without significantly more.

Independent claim 1 recites a statutory category (i.e. a process) method, comprising: determining a feature contribution vector for the input features by locally approximating the machine learning model at the points and the point of interest using a ridge regression model, the feature contribution vector approximating for any point of the points and the point of interest a contribution of each input feature to the output score of the any point by the machine learning model; and identifying the input features that the machine learning model primarily used in providing the corresponding output score for the point of interest, based on the feature contribution vector.  The recited limitations in part, amount to steps that, under its broadest reasonable interpretation, cover mathematical relationships or calculations (see MPEP 2106.04(a)(2)).  For example, the “determining a feature contribution vector” amounts to mathematical calculations or relationships between the input features and the machine learning model.  The recited limitations in part, amount to performance of the limitations in the mind in combination with using a pen and paper (see MPEP 2106.04(a)(2)(III)).  For example, the “identifying the input features” amounts to generically evaluating the determined feature contribution vector to arrive at the input features “primarily used” using on any analytical process based on the feature contribution vector.  Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application since the claimed invention further claims: by a computing device; and sampling a plurality of points around a point of interest, the points and the point of interest each having a value for each of a plurality of input features, the points and the point of interest each having a corresponding output score for a machine learning model.  The “computing device” is recited at a high-level of generality such that they amount to no more than mere application of the judicial exception using generic computer components which does not amount to an improvement in computer functionality (see MPEP 2106.04(a)(I)).  The “sampling a plurality of points” is data gathering since it directly provides the points from which the approximation model is generated, and is not limited to a specific method for performing the sampling.  The claim is directed to an abstract idea.
The claim does not recite additional elements that, alone or in an ordered combination, are sufficient to amount to significantly more than the judicial exception.  As discussed above with respect to the integration of the abstract idea into a practical application, the recited “computing device” amount to no more than mere instructions to apply the judicial exception using generic computer components. The additional elements do not amount to a particular machine (see MPEP 2106.05(b)(I)).  Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.  Further, the recited “sampling” amount(s) to insignificant data gathering, and further amounts to well-understood, routine, conventional activity (see Perumalla et al. (US 2021/0173896) Paragraph 1 “Finding nearest neighbors of a given vector in a sample set of vectors is a common problem in machine learning, particularly in applications that involve pattern matching”).  For at least these reasons, the claim is not patent eligible.

Dependent claim 2 - 9 recite(s) the same statutory category as the parent claim(s), and further recite(s): wherein locally approximating the machine learning model using the ridge regression model comprises fitting the ridge regression model to the points, the point of interest, and the corresponding output scores of the points and the point of interest in claim 2; wherein the ridge regression model has a loss function in claim 3; wherein the loss function defines a difference between the ridge regression model and the machine learning model over the points and the point of interest in claim 4; wherein locally approximating the machine learning model using the ridge regression model comprises minimizing the loss function in claim 5; wherein the loss function comprises a term that biases the ridge regression model away from a uniform distribution approximation of the contributions of the input features to the output score of the any point by the machine learning model in claim 6; wherein the loss function comprises a term that biases the ridge regression model towards a minimization of a number of the input features that maximally contribute to the output score of the any point by the machine learning model, in approximating the contributions of the input features to the output score of the any point by the machine learning model in claim 7; wherein the loss function comprises a Kullback-Leibler (KL) divergence term in claim 8; and wherein the feature contribution vector comprises a plurality of feature contribution coefficients corresponding to the input features, each feature contribution coefficient indicative of the contribution of the input feature to which the feature contribution coefficient corresponds in claim 9.  The recited limitations in part, amount to steps that, under its broadest reasonable interpretation, cover mathematical relationships or calculations since they further limit the parent claim “determining a feature contribution vector” without changing the mathematical character of the limitation.  Accordingly, the claim(s) recite(s) an abstract idea.
This judicial exception is not integrated into a practical application since the claimed invention does not further recite any limitations.  The claim is directed to an abstract idea.
The claim(s) do not recite additional elements that, alone or in an ordered combination, are sufficient to amount to significantly more than the judicial exception since there are no further recited limitations.  For at least these reasons, the claim(s) are not patent eligible.

Dependent claim 10 - 11 recite(s) the same statutory category as the parent claim(s), and further recite(s): determining a raw contribution of the value of each input feature of the point of interest to the corresponding output score of the point of interest by the machine learning model, from the value of the input feature of the point of interest and from the feature contribution coefficient corresponding to the input feature; normalizing the raw contribution of each input feature of the point of interest to the corresponding output score of the point of interest by the machine learning model in claim 10; and after determining the raw contribution of the value of each input feature of the point of interest to the corresponding output score of the point of interest by the machine learning model, and before normalizing the raw contribution of each input feature of the point of interest to the corresponding output score of the point of interest by the machine learning model: winnowing the input features to a number thereof for which the raw contributions are highest in claim 11.  The recited limitations in part, amount to steps that, under its broadest reasonable interpretation, cover performance of the limitations in the mind in combination with using a pen and paper since they further limit the parent claim “identifying the input features” without excluding their performance in the mind.  Accordingly, the claim(s) recite(s) an abstract idea.
This judicial exception is not integrated into a practical application since the claimed invention further recite(s): outputting the raw contribution, as normalized, of each input feature of the point of interest to the corresponding output score of the point of interest by the machine learning model in claim 10.  For example, the “outputting the raw contribution” amounts to insignificant data outputting since it generically recites outputting the results of the normalized raw contribution of each input feature for any intended further use (see MPEP 2106.04(d) referencing MPEP 2016.05(g)).  The claim is directed to an abstract idea.
The claim(s) do not recite additional elements that, alone or in an ordered combination, are sufficient to amount to significantly more than the judicial exception.  As discussed above with respect to the integration of the abstract idea into a practical application, the “outputting the raw contribution” amount(s) to insignificant data outputting (see MPEP 2106.05(g)).  Further, “outputting the raw contribution” in combination with the parent claim “identifying, by the computing device, the inputs features” does not result in an improvement to computer functionality (see MPEP 2016.05(a)(I)(viii) “Arranging transactional information on a graphical user interface in a manner that assists traders in processing information more quickly, Trading Technologies v. IBG LLC, 921 F.3d 1084, 1093-94, 2019 USPQ2d 138290 (Fed. Cir. 2019)”).  For at least these reasons, the claim(s) are not patent eligible.

Dependent claim 12 recite(s) the same statutory category as the parent claim(s), and further recite(s): prior to sampling the points around the point of interest: transforming the input features to a plurality of simplified input features, wherein the feature contribution vector is determined for the input features as transformed to the simplified input features in claim 12.  The recited limitations in part, amount to steps that, under its broadest reasonable interpretation, cover mathematical calculations (see MPEP 2106.04(a)(2)).  For example, the “transforming the input features” generically covers any method for quantitively generating a new set of input features from the original set of input features.  Accordingly, the claim(s) recite(s) an abstract idea.
This judicial exception is not integrated into a practical application since the claimed invention since there are no further recited limitations.  The claim is directed to an abstract idea.
The claim(s) do not recite additional elements that, alone or in an ordered combination, are sufficient to amount to significantly more than the judicial exception since there are no further recited limitations.  For at least these reasons, the claim(s) are not patent eligible.

Independent claim 13 recites a statutory category (i.e. a manufacture) non-transitory computer-readable data storage medium storing program code executable by a processor, comprising: determine a feature contribution vector for the input features by locally approximating the machine learning model at the points and the point of interest via minimization of a loss function between the machine learning model and an approximating localized model, the feature contribution vector approximating for any point of the points and the point of interest a contribution of each input feature to the output score of the any point by the machine learning model; and identify the input features that are most responsible for the machine learning model having provided the corresponding output score for the point of interest, based on the feature contribution vector; and wherein the loss function comprises a term that biases the approximating localized model towards a minimization of a number of the input features that maximally contribute to the output score of the any point by the machine learning model, in approximating the contributions of the input features to the output score of the any point by the machine learning model.  The recited limitations in part, amount to steps that, under its broadest reasonable interpretation, cover mathematical relationships or calculations (see MPEP 2106.04(a)(2)).  For example, the “determine a feature contribution vector” and “loss function comprises” amounts to mathematical calculations or relationships between the input features and the machine learning model.  More specifically, the feature contribution vector is determined using a numerical optimization based on minimizing a loss function comprising one or more desired terms.  The recited limitations in part, amount to performance of the limitations in the mind in combination with using a pen and paper (see MPEP 2106.04(a)(2)(III)).  For example, the “identify the input features” amounts to generically evaluating the determined feature contribution vector to arrive at the input features “that are most responsible” using on any analytical process based on the feature contribution vector.  Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application since the claimed invention further claims: executable by a processor; and sample a plurality of points around a point of interest, the points and the point of interest each having a value for each of a plurality of input features, the points and the point of interest each having a corresponding output score for a machine learning model.  The “a processor” is recited at a high-level of generality such that they amount to no more than mere application of the judicial exception using generic computer components which does not amount to an improvement in computer functionality (see MPEP 2106.04(a)(I)).  The “sample a plurality of points” amounts to data gathering since it directly provides the points from which the approximation model is generated, and is not limited to a specific method for performing the sampling.  The claim is directed to an abstract idea.
The claim does not recite additional elements that, alone or in an ordered combination, are sufficient to amount to significantly more than the judicial exception.  As discussed above with respect to the integration of the abstract idea into a practical application, the recited “processor” amount to no more than mere instructions to apply the judicial exception using generic computer components. The additional elements do not amount to a particular machine (see MPEP 2106.05(b)(I)).  Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.  Further, the recited “sample” amount(s) to insignificant data gathering, and further amounts to well-understood, routine, conventional activity (see Perumalla et al. (US 2021/0173896) Paragraph 1 “Finding nearest neighbors of a given vector in a sample set of vectors is a common problem in machine learning, particularly in applications that involve pattern matching”).  For at least these reasons, the claim is not patent eligible.

Dependent claim 14 - 18 recite(s) the same statutory category as the parent claim(s), and further recite(s): wherein the term biases the ridge regression model towards a minimization of a number of the input features that maximally contribute to the output score of the any point by the machine learning model, in approximating the contributions of the input features to the output score of the any point by the machine learning model in claim 14; wherein the loss function comprises a Kullback-Leibler (KL) divergence term in claim 15; wherein the approximating localized model comprises a linear regression model in claim 16; wherein the linear regression model is a lasso regression model in claim 17; and wherein the linear regression model is a ridge regression model in claim 18.  The recited limitations in part, amount to steps that, under its broadest reasonable interpretation, cover mathematical relationships or calculations since they further limit the parent claim “determining a feature contribution vector” or “loss function comprises a term” without changing the mathematical character of the limitation.  Accordingly, the claim(s) recite(s) an abstract idea.
This judicial exception is not integrated into a practical application since the claimed invention does not further recite any limitations.  The claim is directed to an abstract idea.
The claim(s) do not recite additional elements that, alone or in an ordered combination, are sufficient to amount to significantly more than the judicial exception since there are no further recited limitations.  For at least these reasons, the claim(s) are not patent eligible.

Independent claim 19 recites a statutory category (i.e. a machine) computing device comprising: determine a feature contribution vector for the input features by locally approximating the machine learning model at the points and the point of interest using a ridge regression model having a loss function with a Kullback-Leibler (KL) divergence term, the feature contribution vector approximating for any point of the points and the point of interest a contribution of each input feature to the output score of the any point by the machine learning model; and identify the input features most responsible for the machine learning model having provided the corresponding output score for the point of interest, based on the feature contribution vector.  The recited limitations in part, amount to steps that, under its broadest reasonable interpretation, cover mathematical relationships or calculations (see MPEP 2106.04(a)(2)).  For example, the “determine a feature contribution vector” amounts to mathematical calculations or relationships between the input features and the machine learning model.  The recited limitations in part, amount to performance of the limitations in the mind in combination with using a pen and paper (see MPEP 2106.04(a)(2)(III)).  For example, the “identify the input features” amounts to generically evaluating the determined feature contribution vector to arrive at the input features “most responsible” using any analytical process based on the feature contribution vector.  Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application since the claimed invention further claims: a processor; and a memory storing instructions that the processor is to execute to; and sample a plurality of points around a point of interest, the points and the point of interest each having a value for each of a plurality of input features, the points and the point of interest each having a corresponding output score for a machine learning model.  The “processor” and “memory” are recited at a high-level of generality such that they amount to no more than mere application of the judicial exception using generic computer components which does not amount to an improvement in computer functionality (see MPEP 2106.04(a)(I)).  The “sample a plurality of points” amounts to data gathering since it directly provides the points from which the approximation model is generated, and is not limited to a specific method for performing the sampling.  The claim is directed to an abstract idea.
The claim does not recite additional elements that, alone or in an ordered combination, are sufficient to amount to significantly more than the judicial exception.  As discussed above with respect to the integration of the abstract idea into a practical application, the recited “processor” and “memory” amount to no more than mere instructions to apply the judicial exception using generic computer components. The additional elements do not amount to a particular machine (see MPEP 2106.05(b)(I)).  Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.  Further, the recited “sample” amount(s) to insignificant data gathering, and further amounts to well-understood, routine, conventional activity (see Perumalla et al. (US 2021/0173896) Paragraph 1 “Finding nearest neighbors of a given vector in a sample set of vectors is a common problem in machine learning, particularly in applications that involve pattern matching”).  For at least these reasons, the claim is not patent eligible.

Dependent claim 20 recite(s) the same statutory category as the parent claim(s), and further recite(s): wherein the loss function defines a difference between the ridge regression model and the machine learning model over the points and the point of interest, and wherein the processor is to determine the feature contribution vector by locally approximating the machine learning model using the ridge regression model via minimizing the loss function in claim 20.  The recited limitations in part, amount to steps that, under its broadest reasonable interpretation, cover mathematical relationships or calculations since they further limit the parent claim “determine a feature contribution vector” without changing the mathematical character of the limitation.  Accordingly, the claim(s) recite(s) an abstract idea.
This judicial exception is not integrated into a practical application since the claimed invention does not further recite any limitations.  The claim is directed to an abstract idea.
The claim(s) do not recite additional elements that, alone or in an ordered combination, are sufficient to amount to significantly more than the judicial exception since there are no further recited limitations.  For at least these reasons, the claim(s) are not patent eligible.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
Determining the scope and contents of the prior art.
Ascertaining the differences between the prior art and the claims at issue.
Resolving the level of ordinary skill in the pertinent art.
Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1 - 6, 8 – 12 and 19 - 20 are rejected under 35 U.S.C. 103 as being unpatentable over Di et al. (US 2019/0197411) (henceforth “Di (411)”) in view of Ribeiro et al. ““Why Should I Trust You?” Explaining the Predictions of Any Classifier” (henceforth “Ribeiro”), and further in view of Dasgupta, S. “Variable selection using Kullback-Leibler divergence loss” (henceforth “Dasgupta”).  Di (411), Ribeiro and Dasgupta are analogous art because they solve the same problem of analyzing features used in a model, and since they are in the same field of machine learning. 

With regard to claim 1, Di (411) teaches a method comprising:
sampling, by a computing device, a plurality of points, the points each having a value for each of a plurality of input features, the points each having a corresponding output score for a machine learning model; determining, by the computing device, a feature contribution vector for the input features by locally approximating the machine learning model at the points using a regression model, the feature contribution vector approximating for any point of the points a contribution of each input feature to the output score of the any point by the machine learning model; and (Di (411) Paragraph 32 - 33 and Paragraph 45 based on any desired inputted feature values and output values (sampling a plurality of points each having a corresponding output score) a set of Beta coefficients are determined (determining a vector) which represents the linear effect of the feature on the output of the model (feature contribution vector by approximating the machine learning model, and feature contribution vector approximating a contribution of each input feature to the output score) “analysis apparatus 202 may use feature values 208 inputted into statistical model 206 to train an additive linear model 212 so that the output of linear model 212 estimates output 214 of statistical model 206.” 
    PNG
    media_image1.png
    227
    328
    media_image1.png
    Greyscale
, and Paragraph 28 statistical model can be any type of machine learning (for a machine learning model), and Paragraph 51 “Computer system 400 may include functionality to execute various components of the present embodiments”)
identifying, by the computing device, the input features that the machine learning model primarily used in providing the corresponding output score for a point of interest, based on the feature contribution vector. (Di (411) Paragraph 35 the highest contributing features can be selected (identifying the input features primarily used) based on the Beta coefficients (based on the feature contribution vector) “may rank features 222-224 in descending order of global contributions 230 (i.e., coefficients 220 of linear model 212) and select a first subset of features with the highest global contributions 230 from the ranking.”, and Paragraph 36 the features are analyzed with respect to a specific output value, where it is implicit that output values have associated set of input features (for a point of interest) “Thus, 10 features that are selected as factors for characterizing a given output 214 of statistical model 206 may include three features with the highest global contributions 230 to output 214 (i.e., the highest coefficients 220 in linear model 212).”)

Di (411) does not appear to explicitly disclose: sampling a plurality of points around a point of interest; that the plurality of points comprise the point of interest; and that the locally approximating uses a ridge regression model.

However Ribeiro teaches:
sampling a plurality of points around a point of interest, the points and the point of interest each having a value for each of a plurality of input features, the points and the point of interest each having a corresponding output score for a machine learning model (Rebeiro Page 3, Right 
    PNG
    media_image2.png
    95
    371
    media_image2.png
    Greyscale
 , and Figure 1 the original model can be a neural network (a machine learning model))
determining a feature contribution vector for the input features by locally approximating the machine learning model at the points and the point of interest using a regularized regression model, the feature contribution vector approximating for any point of the points and the point of interest a contribution of each input feature to the output score of the any point by the machine learning model; and identifying the input features that the machine learning model primarily used in providing the corresponding output score for the point of interest, based on the feature contribution vector. (Rebeiro Page 3, Right the g vector’s (determining a feature contribution vector) non-zero components identify (identifying the features primarily used) which input components best explain a model output (approximating a contribution of each input feature to the output score), where the approximation is linear for a locality (by locally approximating the machine learning model at the points), and the approximation includes a regularization term Sigma (regularized regression model) 
    PNG
    media_image3.png
    33
    375
    media_image3.png
    Greyscale
 
    PNG
    media_image4.png
    133
    381
    media_image4.png
    Greyscale
)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a global linear approximation of a machine learning model disclosed by Di (411) with the method of a local linear approximation of a machine learning model disclosed by Ribeiro.  One of ordinary skill in the art would have been motivated to make this modification in order to explain a machine learning model in a desired locality (Riberio Page 3, Left “We note that local fidelity does not imply global fidelity: features that are globally important may not be important in the local context, and vice versa.”)

Di (411) in view of Rebeiro does not appear to explicitly disclose: that the locally approximating uses a ridge regression model.

However Dasgupta teaches:
locally approximating a machine learning model using a ridge regression model, (Dasgupta Page 155 - 156 ridge regression implies a specified penalization adding to a linear model square error 
    PNG
    media_image5.png
    582
    760
    media_image5.png
    Greyscale
, and Page 154 the features analyzed are related to further statistical modeling (a machine learning model) “Variable selection is also fundamental to high-dimensional statistical modeling.”)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a local linear approximation of a machine learning model disclosed by Di (411) in view of Ribeiro with the method of using L2 regularization in a linear regression disclosed by Dasgupta.  One of ordinary skill in the art would have been motivated to make this modification in order to improve the selection of the most important variables in a model (Dasgupta Page 155 “For variable selection … But there are some problems in the above methods, … An alternative strategy that emerged was penalizing the squared error loss,”)

With regard to claim 19, Di (411) teaches a computing device comprising: a processor; and a memory storing instructions that the processor is to execute to: (Di (411) Paragraph 13 – 15)
sample a plurality of points, the points each having a value for each of a plurality of input features, the points each having a corresponding output score for a machine learning model; determine a feature contribution vector for the input features by locally approximating the machine learning model at the points using a linear regression model, the feature contribution vector approximating for any point of the points a contribution of each input feature to the output score of the any point by the machine learning model; and (Di (411) Paragraph 32 - 33 and Paragraph 45 based on any desired inputted feature values and output values (sampling a plurality of points each having a corresponding output score) a set of Beta coefficients are determined (determining a vector) which represents the linear effect of the feature on the output of the model (feature contribution vector by approximating the machine learning model, and feature contribution vector approximating a contribution of each input feature to the output score) “analysis apparatus 202 may use feature values 208 inputted into statistical model 206 to train an additive linear model 212 so that the output of linear model 212 estimates output 214 of statistical model 206.” 
    PNG
    media_image1.png
    227
    328
    media_image1.png
    Greyscale
, and Paragraph 28 statistical model can be any type of machine learning (for a machine learning model))
identify the input features most responsible for the machine learning model having provided the corresponding output score for the point of interest, based on the feature contribution vector. (Di (411) Paragraph 35 the highest contributing features can be selected (identifying the input features primarily used) based on the Beta coefficients (based on the feature contribution vector) “may rank features 222-224 in descending order of global contributions 230 (i.e., coefficients 220 of linear model 212) and select a first subset of features with the highest global contributions 230 from the ranking.”, and Paragraph 36 the features are analyzed with respect to a specific output value, where it is implicit that output values have associated set of input features (for a point of interest) “Thus, 10 features that are selected as factors for characterizing a given output 214 of statistical model 206 may include three features with the highest global contributions 230 to output 214 (i.e., the highest coefficients 220 in linear model 212).”)

Di (411) does not appear to explicitly disclose: sample a plurality of points around a point of interest; that the plurality of points comprise the point of interest.

However Ribeiro teaches:
sample a plurality of points around a point of interest, the points and the point of interest each having a value for each of a plurality of input features, the points and the point of interest each having a corresponding output score for a machine learning model (Rebeiro Page 3, Right 
    PNG
    media_image2.png
    95
    371
    media_image2.png
    Greyscale
 , and Figure 1 the original model can be a neural network (a machine learning model))
determine a feature contribution vector for the input features by locally approximating the machine learning model at the points and the point of interest using a regularized regression model, the feature contribution vector approximating for any point of the points and the point of interest a contribution of each input feature to the output score of the any point by the machine learning model; and identify the input features that the machine learning model primarily used in providing the corresponding output score for the point of interest, based on the feature contribution vector. (Rebeiro Page 3, Right the g vector’s (determining a feature contribution vector) non-zero components identify (identify the features primarily used) which input components best explain a model output (approximating a contribution of each input feature to the output score), where the approximation is linear for a locality (by locally approximating the machine learning model at the points), and the approximation includes a regularization term Sigma in combination with a squared loss L (regularized regression model) in an overall minimization 
    PNG
    media_image3.png
    33
    375
    media_image3.png
    Greyscale
 
    PNG
    media_image4.png
    133
    381
    media_image4.png
    Greyscale

    PNG
    media_image6.png
    66
    381
    media_image6.png
    Greyscale
)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a global linear approximation of a machine learning model disclosed by Di (411) with the method of a local linear approximation of a machine learning model disclosed by Ribeiro.  One of ordinary skill in the art would have been motivated to make this modification in order to explain a machine learning model in a desired locality (Riberio Page 3, Left “We note that local fidelity does not imply global fidelity: features that are globally important may not be important in the local context, and vice versa.”)

Di (411) in view of Rebeiro does not appear to explicitly disclose: that the locally approximating uses a ridge regression model having a loss function with a Kullback-Leibler (KL) divergence term.

However Dasgupta teaches:
locally approximating a machine learning model using a ridge regression model having a loss function with a Kullback-Leibler (KL) divergence term, (Dasgupta Page 155 - 156 ridge regression implies a specified penalization adding to a linear model square error 
    PNG
    media_image5.png
    582
    760
    media_image5.png
    Greyscale
, and Page 154 the features analyzed are related to further statistical modeling (a machine learning model) “Variable selection is also fundamental to high-dimensional statistical modeling.”, and Page 159 the squared error loss term in the regression model can be desirably replaced by a KL divergence term 
    PNG
    media_image7.png
    39
    421
    media_image7.png
    Greyscale
, and Page 156 the ridge regression model has an explicitly squared error term 
    PNG
    media_image8.png
    135
    424
    media_image8.png
    Greyscale
)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a local linear approximation of a machine learning model disclosed by Di (411) in view of Ribeiro with the method of using L2 regularization in a linear regression disclosed by Dasgupta.  One of ordinary skill in the art would have been motivated to make this modification in order to improve the selection of the most important variables in a model (Dasgupta Page 155 “For variable selection … But there are some problems in the above methods, … An alternative strategy that emerged was penalizing the squared error loss,”, and Page 159 “There are various theoretical reasons to defend the use of Kullback-Leibler distance, ranging from information theory to the relevance of logarithmic scoring rule and the location-scale invariance of the distance, as detailed in Bernardo and Smith (1994).”)

With regard to claim 2, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 1, and further teaches:
wherein locally approximating the machine learning model using a regularized regression model comprises fitting the regularized regression model to the points, the point of interest, and the corresponding output scores of the points and the point of interest. (Ribeiro Page 3, Right elements around the point of interest are sampled randomly, where sampling the point of interest itself is possible and the weighting function is still defined for a distance of zero (i.e. exp^-0 = 1) 
    PNG
    media_image9.png
    47
    384
    media_image9.png
    Greyscale
 
    PNG
    media_image10.png
    57
    313
    media_image10.png
    Greyscale
)
wherein locally approximating the machine learning model using the ridge regression model comprises fitting the ridge regression model to desired points and the corresponding output scores of the desired points (Dasgupta Page 154 any desired set of points can be used in the creation of the ridge regression model 
    PNG
    media_image11.png
    82
    422
    media_image11.png
    Greyscale
, and Page 155 - 156 ridge regression implies a specified regularization term adding to a linear model square error)

With regard to claim 3, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 1, and further teaches:
wherein the regularized regression model has a loss function (Ribeiro Page 4, Left the squared loss between the original model output and the regression is L (different between the regression model and the machine learning model) 
    PNG
    media_image12.png
    42
    321
    media_image12.png
    Greyscale
)
wherein the ridge regression model has a loss function. (Dasgupta Page 155 - 156 ridge regression implies a specified regularization term adding to a linear model square error)

With regard to claim 4, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 3, and further teaches:
wherein a loss function defines a difference between the regularized regression model and the machine learning model over the points and the point of interest. (Ribeiro Page 4, Left the squared loss between the original model output and the regression is L (difference between the regression model and the machine learning model) 
    PNG
    media_image12.png
    42
    321
    media_image12.png
    Greyscale
, and Page 3, Right elements around the point of interest are sampled randomly, where sampling the point of interest itself is possible and the weighting function is still defined for a distance of zero (over the points and the point of interest) 
    PNG
    media_image9.png
    47
    384
    media_image9.png
    Greyscale
 
    PNG
    media_image10.png
    57
    313
    media_image10.png
    Greyscale
)
wherein the loss function defines a difference between the ridge regression model and the machine learning model over desired points (Dasgupta Page 155 - 156 ridge regression implies a specified regularization term adding to a linear model square error (difference between the regression model and the machine learning model), and Page 154 any desired set of points can be used in the creation of the ridge regression model 
    PNG
    media_image11.png
    82
    422
    media_image11.png
    Greyscale
)

With regard to claim 5, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 3, and further teaches:
wherein locally approximating the machine learning model using the regularized regression model comprises minimizing the loss function. (Rebeiro Page 3, Right the squared error L function is minimized in combination with a regularization term Sigma 
    PNG
    media_image4.png
    133
    381
    media_image4.png
    Greyscale
)
wherein locally approximating the machine learning model using the ridge regression model comprises minimizing the loss function. (Dasgupta Page 156 the ridge regression model defines a function to be minimized 
    PNG
    media_image13.png
    84
    426
    media_image13.png
    Greyscale
)

With regard to claim 6, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 3, and further teaches:
wherein the loss function comprises a term that biases the regularized regression model away from a uniform distribution approximation of the contributions of the input features to the output score of the any point by the machine learning model. (Ribeiro Page 3, Left the regularization term Sigma acts to limit the number of non-zero weights (biases the regularized regression model away from a uniform distribution approximation of the contributions of the input features) 
    PNG
    media_image14.png
    32
    371
    media_image14.png
    Greyscale
)
wherein the loss function comprises a term that biases the ridge regression model away from a uniform distribution approximation of the contributions of the input features to the output score of the any point by the machine learning model. (Dasgupta Page 156 the regularization term includes a parameter lamba, where it is implicit that adjusting this parameter changes the impact on the sparsity (biases the regularized regression model away from a uniform distribution approximation of the contributions of the input features) 
    PNG
    media_image15.png
    132
    429
    media_image15.png
    Greyscale
)

With regard to claim 8, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 3, and further teaches:
wherein the loss function comprises a Kullback-Leibler (KL) divergence term. (Dasgupta Page 159 the squared error loss term in the regression model can be desirably replaced by a KL divergence term 
    PNG
    media_image7.png
    39
    421
    media_image7.png
    Greyscale
, and Page 156 the ridge regression model has an explicitly squared error term 
    PNG
    media_image8.png
    135
    424
    media_image8.png
    Greyscale
)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a local linear approximation of a machine learning model disclosed by Di (411) in view of Ribeiro with the method of using a KL divergence term in the loss function disclosed by Dasgupta.  One of ordinary skill in the art would have been motivated to make this modification in order to improve the optimization of the coefficients (Dasgupta Page 159 “There are various theoretical reasons to defend the use of Kullback-Leibler distance, ranging from information theory to the relevance of logarithmic scoring rule and the location-scale invariance of the distance, as detailed in Bernardo and Smith (1994).”)

With regard to claim 9, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 1, and further teaches:
wherein the feature contribution vector comprises a plurality of feature contribution coefficients corresponding to the input features, each feature contribution coefficient indicative of the contribution of the input feature to which the feature contribution coefficient corresponds. (Di (411) Paragraph 35 the highest contributing features can be selected based on the Beta coefficients (based on the feature contribution vector) “may rank features 222-224 in descending order of global contributions 230 (i.e., coefficients 220 of linear model 212) and select a first subset of features with the highest global contributions 230 from the ranking.”)

With regard to claim 10, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 9, and further teaches:
wherein identifying the input features that the machine learning model primarily used in providing the corresponding output score for the point of interest comprises:
determining a raw contribution of the value of each input feature of the point of interest to the corresponding output score of the point of interest by the machine learning model, from the value of the input feature of the point of interest and from the feature contribution coefficient corresponding to the input feature; (Di (411) Paragraph 34 
    PNG
    media_image16.png
    169
    407
    media_image16.png
    Greyscale
)
normalizing the raw contribution of each input feature of the point of interest to the corresponding output score of the point of interest by the machine learning model; (Di (411) Paragraph 39 local contributions can be represented as quantile (normalized) “management apparatus 204 may determine a percentile or quantile associated with a feature value for a feature with a high local contribution toward a given output 214 value from statistical model 206.”)
outputting the raw contribution, as normalized, of each input feature of the point of interest to the corresponding output score of the point of interest by the machine learning model. (Di (411) Paragraph 39 weights/quantiles of the inputs features can be outputted with regard to a specific model output “Management apparatus 204 may also output statistics and/or metrics associated with subsets 218 of features 222-224 with high global contributions 230 and local contributions 232 toward a given output 214 value generated by statistical model 206”, and Paragraph 40 the outputted results can be for a user identification of key features “The users may also, or instead, identify key feature values for an entity that affect a corresponding output 214 value from statistical model 206 (e.g., a certain type of behavior that increases a customer's risk of churning from a product)”)

With regard to claim 11, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 10, and further teaches:
wherein identifying the input features that the machine learning model primary used in providing the corresponding output score for the point of interest comprises, after determining the raw contribution of the value of each input feature of the point of interest to the corresponding output score of the point of interest by the machine learning model, and before normalizing the raw contribution of each input feature of the point of interest to the corresponding output score of the point of interest by the machine learning model:
winnowing the input features to a number thereof for which the raw contributions are highest. (Di (411) Figure 3 after the local contributions are determined, a subset of features are determined based on the highest local contributions, and Paragraph 36 a subset of features can be selected based on their local contributions according to a user specified proportion “First, management apparatus 204 may select subsets 218 of highest-ranked features in global contributions 230 and/or local contributions 232 based on one or more parameters 216 that specify proportions associated with each subset”, and Paragraph 37 the parameters 216 can also specific a specific number of features to select “For example, parameters 216 may specify that five features with high global contributions 230 be selected before 10 features with high local contributions 232”)

With regard to claim 12, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 1, and further teaches: 
prior to sampling the points around the point of interest: transforming the input features to a plurality of simplified input features, wherein the feature contribution vector is determined for the input features as transformed to the simplified input features. (Riberio Page 3, Right a perturbed sample z’ is firstly generated having a fraction of the nonzero element of x’ (transforming the input feature to a plurality of simplified input feature), where the output values of the machine learning model are then evaluated in the original representation corresponding to the perturbed sample (prior to sampling the points around the point of interest) 
    PNG
    media_image17.png
    116
    374
    media_image17.png
    Greyscale
, and Page 3 – 4 the perturbed sample z’ is used to generate the weights wg (wherein the feature contribution vector is determined on transformed simplified input features) 
    PNG
    media_image18.png
    67
    376
    media_image18.png
    Greyscale
 
    PNG
    media_image19.png
    84
    376
    media_image19.png
    Greyscale
)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a global linear approximation of a machine learning model disclosed by Di (411) with the method of using different input samples to determine the weights of the linear regression model disclosed by Ribeiro.  One of ordinary skill in the art would have been motivated to make this modification in order to explain a model in a particular location using a reduced representation (Ribeiro Page 3, Right 
    PNG
    media_image20.png
    67
    376
    media_image20.png
    Greyscale

    PNG
    media_image21.png
    66
    374
    media_image21.png
    Greyscale
)

With regard to claim 20, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 19, and further teaches:
wherein a loss function defines a difference between the regularized regression model and the machine learning model over the points and the point of interest. (Ribeiro Page 4, Left the squared loss between the original model output and the regression is L (difference between the regression model and the machine learning model) 
    PNG
    media_image12.png
    42
    321
    media_image12.png
    Greyscale
, and Page 3, Right elements around the point of interest are sampled randomly, where sampling the point of interest itself is possible and the weighting function is still defined for a distance of zero (over the points and the point of interest) 
    PNG
    media_image9.png
    47
    384
    media_image9.png
    Greyscale
 
    PNG
    media_image10.png
    57
    313
    media_image10.png
    Greyscale
)
wherein the loss function defines a difference between the ridge regression model and the machine learning model over desired points (Dasgupta Page 155 - 156 ridge regression implies a specified regularization term adding to a linear model square error (difference between the regression model and the machine learning model), and Page 154 any desired set of points can be used in the creation of the ridge regression model 
    PNG
    media_image11.png
    82
    422
    media_image11.png
    Greyscale
)
and wherein the processor is to determine the feature contribution vector by locally approximating the machine learning model using the regularized regression model via minimizing the loss function. (Rebeiro Page 3, Right the squared error L function is minimized in combination with a regularization term Sigma 
    PNG
    media_image4.png
    133
    381
    media_image4.png
    Greyscale
)
and wherein the processor is to determine the feature contribution vector by locally approximating the machine learning model using the ridge regression model via minimizing the loss function. (Dasgupta Page 156 the ridge regression model defines a function to be minimized 
    PNG
    media_image13.png
    84
    426
    media_image13.png
    Greyscale
)

Claims 13 – 14 and 16 - 18 are rejected under 35 U.S.C. 103 as being unpatentable over Di (411) in view of Ribeiro, and further in view of Muthkrishnan et al. “LASSO: A Feature Selection Technique In Predictive Modeling For Machine Learning” (henceforth “Muthukrishnan”).  Di (411), Ribeiro and Muthukrishnan are analogous art because they solve the same problem of analyzing features used in a model, and since they are in the same field of machine learning. 

With regard to claim 13, Di (411) teaches a non-transitory computer-readable data storage medium storing program code executable by a processor to: (Di (411) Paragraph 13 - 14)
sample a plurality of points, the points each having a value for each of a plurality of input features, the points each having a corresponding output score for a machine learning model; determine a feature contribution vector for the input features by locally approximating the machine learning model at the points, the feature contribution vector approximating for any point of the points a contribution of each input feature to the output score of the any point by the machine learning model; and (Di (411) Paragraph 32 - 33 and Paragraph 45 based on any desired inputted feature values and output values (sampling a plurality of points each having a corresponding output score) a set of Beta coefficients are determined (determining a vector) which represents the linear effect of the feature on the output of the model (feature contribution vector by approximating the machine learning model, and feature contribution vector approximating a contribution of each input feature to the output score) “analysis apparatus 202 may use feature values 208 inputted into statistical model 206 to train an additive linear model 212 so that the output of linear model 212 estimates output 214 of statistical model 206.” 
    PNG
    media_image1.png
    227
    328
    media_image1.png
    Greyscale
, and Paragraph 28 statistical model can be any type of machine learning (for a machine learning model))
identify the input features that are most responsible for the machine learning model having provided the corresponding output score for a point of interest, based on the feature contribution vector (Di (411) Paragraph 35 the highest contributing features can be selected (identifying the input features primarily used) based on the Beta coefficients (based on the feature contribution vector) “may rank features 222-224 in descending order of global contributions 230 (i.e., coefficients 220 of linear model 212) and select a first subset of features with the highest global contributions 230 from the ranking.”, and Paragraph 36 the features are analyzed with respect to a specific output value, where it is implicit that output values have associated set of input features (for a point of interest) “Thus, 10 features that are selected as factors for characterizing a given output 214 of statistical model 206 may include three features with the highest global contributions 230 to output 214 (i.e., the highest coefficients 220 in linear model 212).”)

Di (411) does not appear to explicitly disclose: sample a plurality of points around a point of interest; that the plurality of points comprise the point of interest; and that the locally approximating is via minimization of a loss function between the machine learning model and an approximating localized model.

However Ribeiro teaches:
sample a plurality of points around a point of interest, the points and the point of interest each having a value for each of a plurality of input features, the points and the point of interest each having a corresponding output score for a machine learning model (Rebeiro Page 3, Right 
    PNG
    media_image2.png
    95
    371
    media_image2.png
    Greyscale
 , and Figure 1 the original model can be a neural network (a machine learning model))
determine a feature contribution vector for the input features by locally approximating the machine learning model at the points and the point of interest via minimization of a loss function between the machine learning model and an approximation localized model, the feature contribution vector approximating for any point of the points and the point of interest a contribution of each input feature to the output score of the any point by the machine learning model; and identifying the input features that the machine learning model primarily used in providing the corresponding output score for the point of interest, based on the feature contribution vector. (Rebeiro Page 3, Right the g vector’s (determining a feature contribution vector) non-zero components identify (identifying the features primarily used) which input components best explain a model output (approximating a contribution of each input feature to the output score), where the approximation is linear for a locality (by locally approximating the machine learning model at the points), and the approximation includes a regularization term Sigma in combination with a squared loss L (regularized regression model) in an overall minimization (via minimization of a loss function between the machine learning model and an approximation localized model) 
    PNG
    media_image3.png
    33
    375
    media_image3.png
    Greyscale
 
    PNG
    media_image4.png
    133
    381
    media_image4.png
    Greyscale

    PNG
    media_image6.png
    66
    381
    media_image6.png
    Greyscale
)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a global linear approximation of a machine learning model disclosed by Di (411) with the method of a local linear approximation of a machine learning model disclosed by Ribeiro.  One of ordinary skill in the art would have been motivated to make this modification in order to explain a machine learning model in a desired locality (Riberio Page 3, Left “We note that local fidelity does not imply global fidelity: features that are globally important may not be important in the local context, and vice versa.”)

	Di (411) in view of Ribeiro does not appear to explicitly disclose: wherein the loss function comprises a term that biases the approximating localized model towards a minimization of a number of the input features that maximally contribute to the output score of the any point by the machine learning model, in approximating the contributions of the input features to the output score of the any point by the machine learning model.

However Muthukrishnan teaches:
wherein a loss function comprises a term that biases an approximation localized model towards a minimization of a number of input features that maximally contribute to an output score of any point by a machine learning model, in approximating the contributions of the input features to the output score of the any point by the machine learning model (Muthukrishnan Page 19, Left coefficients in a linear regression are shrunk to zero based on a controllable parameter coefficient lamba on the regularization term 
    PNG
    media_image22.png
    356
    401
    media_image22.png
    Greyscale
)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a local ridge regression approximation of a machine learning model disclosed by Di (411) in view of Ribeiro with the lamba parameter of the ridge regression model shrinks coefficients to zero disclosed by Muthukrishnan.  One of ordinary skill in the art would have been motivated to make this modification in order to desirably shrink the number of non-zero coefficients in the ridge regression model (Muthukrishnan Page 19, Left)

With regard to claim 14, Di (411) in view of Ribeiro, and further in view of Muthukrishnan teaches all the elements of the parent claim 13, and further teaches:
wherein the term biases the ridge regression model towards a minimization of a number of the input features that maximally contribute to the output score of the any point by the machine learning model, in approximating the contributions of the input features to the output score of the any point by the machine learning model (Muthukrishnan Page 19, Left coefficients in a linear regression are shrunk to zero based on a controllable parameter coefficient lamba on the regularization term)

With regard to claim 16, Di (411) in view of Ribeiro, and further in view of Muthukrishnan teaches all the elements of the parent claim 13, and further teaches:
wherein the approximating localized model comprises a linear regression model (Di (411) Paragraph 45 the machine model is approximated by a linear representation) 

With regard to claim 17, Di (411) in view of Ribeiro, and further in view of Muthukrishnan teaches all the elements of the parent claim 16, and further teaches:
wherein the linear regression model is a lasso regression model. (Muthukrishnan Page 19 
    PNG
    media_image23.png
    262
    310
    media_image23.png
    Greyscale
)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a local ridge regression approximation of a machine learning model disclosed by Di (411) in view of Ribeiro with the alternative linear regressions models disclosed by Muthukrishnan.  One of ordinary skill in the art would have been motivated to make this modification in order to improve the regression in the presence of errors (Muthukrishnan Page 19, Left)

With regard to claim 18, Di (411) in view of Ribeiro, and further in view of Muthukrishnan teaches all the elements of the parent claim 16, and further teaches:
wherein the linear regression model is a ridge regression model (Muthukrishnan Page 19, Left, and Abstract “The traditional procedures such as Ordinary Least Squares (OLS) regression, Stepwise regression and partial least squares regression are very sensitive to random errors. Many alternatives have been established in the literature during the past few decades such as Ridge regression and LASSO and its variants.”)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a local ridge regression approximation of a machine learning model disclosed by Di (411) in view of Ribeiro with the alternative linear regressions models disclosed by Muthukrishnan.  One of ordinary skill in the art would have been motivated to make this modification in order to improve the regression in the presence of errors (Muthukrishnan Page 19, Left)

Claims 7 are rejected under 35 U.S.C. 103 as being unpatentable over Di (411) in view of Ribeiro, and further in view of Dasgupta, and further in view of Muthkrishnan.  Di (411), Ribeiro, Dasgupta and Muthukrishnan are analogous art because they solve the same problem of analyzing features used in a model, and since they are in the same field of machine learning. 

With regard to claim 7, Di (411) in view of Ribeiro, and further in view of Dasgupta teaches all the elements of the parent claim 3, and does not appear to explicitly disclose: wherein the loss function comprises a term that biases the ridge regression model towards a minimization of a number of the input features that maximally contribute to the output score of the any point by the machine learning model, in approximating the contributions of the input features to the output score of the any point by the machine learning model.

However Muthukrishnan teaches:
wherein a loss function comprises a term that biases a ridge regression model towards a minimization of a number of input features that maximally contribute to an output score of any point by a machine learning model, in approximating the contributions of the input features to the output score of the any point by the machine learning model (Muthukrishnan Page 19, Left coefficients in the ridge regression are shrunk to zero based on a controllable parameter 
    PNG
    media_image22.png
    356
    401
    media_image22.png
    Greyscale
)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a local ridge regression approximation of a machine learning model disclosed by Di (411) in view of Ribeiro, and further in view of Dasgupta with the lamba parameter of the ridge regression model shrinks coefficients to zero disclosed by Muthukrishnan.  One of ordinary skill in the art would have been motivated to make this modification in order to desirably shrink the number of non-zero coefficients in the ridge regression model (Muthukrishnan Page 19, Left)

Claims 15 are rejected under 35 U.S.C. 103 as being unpatentable over Di (411) in view of Ribeiro, and further in view of Muthkrishnan, and furhter in view of Dasgupta.  Di (411), Ribeiro, Muthukrishnan and Dasgupta are analogous art because they solve the same problem of analyzing features used in a model, and since they are in the same field of machine learning.

With regard to claim 15, Di (411) in view of Ribeiro, and further in view of Muthukrishnan teaches all the elements of the parent claim 13, and does not appear to explicitly disclose: wherein the loss function comprises a Kullback-Leibler (KL) divergence term. 

However Dasgupta teaches:
wherein the loss function comprises a Kullback-Leibler (KL) divergence term. (Dasgupta Page 159 the squared error loss term in the regression model can be desirably replaced by a KL divergence term 
    PNG
    media_image7.png
    39
    421
    media_image7.png
    Greyscale
, and Page 156 the ridge regression model has an explicitly squared error term 
    PNG
    media_image8.png
    135
    424
    media_image8.png
    Greyscale
)
It would have been obvious for one of ordinary skill in the art before the filing date of the claimed invention to have combined the method of a local linear approximation of a machine learning model disclosed by Di (411) in view of Ribeiro, and further in view of Muthukrishnan with the method of using a KL divergence term in the loss function disclosed by Dasgupta.  One of ordinary skill in the art would have been motivated to make this modification in order to improve the optimization of the coefficients (Dasgupta Page 159 “There are various theoretical reasons to defend the use of Kullback-Leibler distance, ranging from information theory to the relevance of logarithmic scoring rule and the location-scale invariance of the distance, as detailed in Bernardo and Smith (1994).”)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALFRED H. WECHSELBERGER whose telephone number is (571)272-8988. The examiner can normally be reached M - F, 10am to 6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Rehana Perveen can be reached on 571-272-3676. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ALFRED H. WECHSELBERGER/ExaminerArt Unit 2148



/REHANA PERVEEN/Supervisory Patent Examiner, Art Unit 2148