DETAILED ACTION
The applicant’s request for continued examination regarding application number 16/253,892, filed January 22, 2019 has been entered.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on March 4, 2022 has been entered.

Response to Amendments
The amendment filed March 4, 2022 has been entered. Examiner acknowledges receipt of Amendments to Application 16/253,892, which include: Amendments to the Claims, and Remarks containing Applicant’s amendments. 
Regarding Applicant’s Remarks on p.14 and the Amendments to the Claims, Examiner acknowledges Claims 1-2, 4, 10, 14-15, and 17-18 have been amended by the Applicant. Claims 1-20 remain pending in the application.
Regarding Applicant’s Remarks on p.14 and the Amendments to the Claims, Examiner acknowledges Applicant has resolved the objection in Claim 10 previously set forth in the Final Office Action mailed December 13, 2021, and therefore the earlier identified claim objection in Claim 10 is now withdrawn.

Response to Arguments
Examiner acknowledges receipt of Arguments to Application 16/253,892, which include: Remarks containing Applicant’s arguments. 
Regarding Applicant’s Remarks for Claims 1, 3-4, 14, and 16-20 under 35 U.S.C. 103 as being unpatentable over Bien et al., Classification by Set Cover: The Prototype Vector Machine, August 17 2009 [hereafter referred as Bien], in view of Parades et al., Learning prototypes and distances: A prototype reduction technique based on nearest neighbor error minimization, 2005 [hereafter referred as Parades], Examiner acknowledges applicant’s arguments and have considered them, and have found them to be not persuasive. Examiner has also noted applicant has amended the independent and dependent claims such that it necessitates further examination and re-evaluation of the amended and related original claims. The updated claim mappings according to the applicant’s amended claims are provided in the relevant sections indicated below. 
Examiner has noted that the main scope of the Applicant’s arguments are directed towards the amended claim limitations, which were not present in the original set of claims. However, Examiner has noted the following sub-arguments containing assertions that need to be addressed. 
Examiner notes the first sub-argument (see underlined) from the Applicant:
“… Furthermore, the cited portions of Bien teach using K-means on each class's points (or on the training set as a whole) and adding L dot_ K centroids to a set of unlabeled points from which prototypes are selected (e.g., augmenting the set of observed points to include additional points). Id., p. 9, section 4.2.
Adding L dot_ K centroids to a set of unlabeled points from which prototypes are selected, as in Bien, is not the same as the above-cited limitations. In particular, at no point does Bien teach "generating a plurality of gradients ... , wherein a gradient of the plurality of gradients is based on two adjacent prototypes to a data point in the label space," as in claim 1. (Emphasis added). Rather, Bien merely describes using K-means on a class's points to add additional points to a set of unlabeled data points from which prototypes are selected. Indeed, Bien does not teach anything related to generating a gradient based on prototypes.”. 
Examiner has considered this argument and finds the argument to be not persuasive. Examiner notes that the above sub-argument is directed to the Bien reference, which was used to teach the following recited claim limitations from independent Claim 1:
“determining a set of prototypes by:
identifying, by at least one processor, the features of the plurality of data points used to generate a plurality of outputs via a machine learning model;
mapping, by the at least one processor, the features of the plurality of data points to a feature space and the plurality of outputs to a label space;
determining, by the at least one processor, distances between the plurality of data points in the feature space and the label space; and
determining, by the at least one processor, the set of prototypes from the plurality of data points based on the distances between the plurality of data points in the feature space and the label space …”
Examiner notes that Applicant is trying to assert that the Bien reference was additionally used to teach the following limitation from independent Claim 1: “generating a plurality of gradients based on the plurality of data points and corresponding adjacent prototypes of the set of prototypes”. However, Examiner points to the Final Office Action mailed December 13, 2021, where it clearly shows that the Parades reference was used to teach the identified gradient limitation, hence  that portion of the sub-argument is moot. Hence, for the remainder of this response, the Examiner will focus on the above underlined sub-argument in relation to the Bien reference.
Examiner notes that Applicant’s sub-argument is directed to the definition of the term “label space”, where Applicant asserts that the L∙K centroids added to 𝓩 as taught in Bien does not teach a label space as recited in independent Claim 1. Examiner reminds Applicant that MPEP 2111 requires that the pending claims must be given their broadest reasonable interpretation consistent with the specification, and an Examiner must construe claim terms in the broadest reasonable manner during prosecution as is reasonably allowed in an effort to establish a clear record of what applicant intends to claim. Under its broadest reasonable interpretation, the term “a label space” is simply a set of labels. This interpretation is consistent with Applicant’s definition of a label space, as provided in paragraph [0005]: “… a label space (e.g., a space mapping output labels of the machine learning model).”. As indicated in the Final Office Action mailed December 13, 2021, the Bien reference is used to teach the above two limitations, where the Bien reference applies the set cover integer program to determine distances d(            
                
                    
                        x
                    
                    
                        i
                    
                
            
        ,             
                
                    
                        z
                    
                    
                        j
                    
                
            
        ) within an epsilon-ball radius ϵ between a plurality of data points in the feature space (Bien p.3 Section 1.1. The set cover integer program 1st paragraph: “Consider the two sets 𝓧 and 𝓩 … The goal is to find the smallest subset of points 𝒫⊆𝒵 such that every point             
                
                    
                        x
                    
                    
                        i
                    
                
            
        ∈𝒳 is within of some point in 𝒫 (i.e., there exists             
                
                    
                        z
                    
                    
                        j
                    
                
            
        ∈𝒫 with d(            
                
                    
                        x
                    
                    
                        i
                    
                
            
        ,             
                
                    
                        z
                    
                    
                        j
                    
                
            
        ) < ϵ ). Let             
                
                    
                        B
                    
                    
                        ϵ
                    
                
            
        (x) = x’ ∈             
                
                    
                        R
                    
                    
                        p
                    
                
            
        : d(x’, x) < ϵ denote the ball of radius ϵ centered at x … From a machine learning point of view, set cover can be seen as a clustering problem in which we wish to find the smallest number of clusters such that every point is within of at least one cluster center.”; and p.3 Section 2 The prototype vector machine 1st paragraph: “The prototype vector machine is an extension of the set cover problem to the supervised learning context …”). Bien further augments 𝓩 to include L∙K centroid points (where L is defined as a set of class labels representing a label space (Bien p.1 Introduction 1st paragraph: “… corresponding class labels             
                
                    
                        y
                    
                    
                        1
                    
                
                 
            
        , …,             
                
                    
                        y
                    
                    
                        n
                    
                
            
         ∈ {1, … L} …”), with each class l being represented by a K centroid as a data point, hence representing a label space of L∙K centroids), and applying the same prototype vector machine/set cover integer program taught in the Bien reference to determine distances d(            
                
                    
                        x
                    
                    
                        i
                    
                
            
        ,             
                
                    
                        z
                    
                    
                        j
                    
                
            
        ) within an epsilon-ball radius ϵ for the label space (Bien p.9 Section 4.2 Prototypes not on training points, 2nd paragraph: “… Another inherent flexibility of the PVM is in the choice of 𝓩, the set of potential prototypes. While 𝓩 = 𝓧 is a standard choice, we have experimented with other possibilities as well. … 𝓩 may be further augmented to include other points … one could run K-means on each class's points individually (or on the training set as a whole) and add these L∙K centroids to 𝓩. … Another successful choice for 𝓩 is to sample uniformly within the convex hull of each class's training points.”). A person having ordinary skill in the art would understand that the above citations in the Bien reference, when taken together, demonstrate that the set of unlabeled points 𝓩 in which to form the set of prototypes using the prototype vector machine can also include examination of data points in both feature space and the label space. Hence, in view of the evidence provided above, Applicant’s argument is not persuasive, and the existing prior art claim rejection is maintained.
Examiner notes the second sub-argument (see underlined) from the Applicant:
“… Paredes describes a nearest-neighbor classification method for selecting and evaluating prototypes. Paredes, Introduction. Specifically, Parades teaches simultaneously training a reduced set of prototypes and a local metric for the prototypes. Id., p. 180, col. 2, last paragraph. For example, Paredes teaches selecting prototypes via random selection and iteratively adjusting features of prototypes and corresponding local-metric weights. Id., p. 181, col. 1, first paragraph. Paredes also teaches determining a nearest-neighbor error based on weighted distances from vectors in the representation space to prototypes. Id., p. 181, col. 1, section 2-col. 2, section 2.1. Paredes further teaches utilizing gradient descent to minimize the nearest-neighbor error by visiting each prototype in a training set and updating positions and weights associated with same class and different-class nearest-neighbors of the prototype. Id., p. 181, col. 2, section 2.1-p.183, col. 1.
Utilizing nearest-neighbor classification to select and evaluate prototypes, as in Paredes, is not the same as the previously cited limitations. In particular, at no point does Paredes teach "generating a plurality of gradients ... , wherein a gradient of the plurality of gradients is based on two adjacent prototypes to a data point in the label space," as in claim 1. (Emphasis added). Rather, Paredes merely describes determining a nearest-neighbor error based on weighted distances from vectors in the representation space to prototypes. Additionally, Paredes merely describes utilizing gradient descent to minimize the error by visiting each prototype and updating positions and weights associated with same-class and different-class neighbors of a prototype.”
Examiner has considered this argument and finds the argument to be not persuasive. Examiner notes that the above sub-argument is directed to the Parades reference, which was used to teach the following recited claim limitations from independent Claim 1:
“determining, by the at least one processor using the set of prototypes, an impact of the features within the machine-learning model by:
generating a plurality of gradients based on the plurality of data points and corresponding adjacent prototypes of the set of prototypes; and
determining rank orders for the features of the plurality of data points according to locally sensitive directions of the features based on the plurality of gradients.”
Examiner notes that Applicant’s sub-argument is directed to the definition of “adjacent prototypes”, where Applicant asserts that identifying nearest-neighbor prototypes for generating gradients taught in the Parades reference is not equivalent to (i.e., within the same scope as) identifying corresponding adjacent prototypes for generating gradients as recited in independent Claim 1, but fails to explain this difference as recited in the earlier claim limitation. Examiner reminds Applicant that MPEP 2111 requires that the pending claims must be given their broadest reasonable interpretation consistent with the specification, and an Examiner must construe claim terms in the broadest reasonable manner during prosecution as is reasonably allowed in an effort to establish a clear record of what applicant intends to claim. Under its broadest reasonable interpretation, the term “adjacent” as defined by Merriam-Webster dictionary broadly indicates a relative position of an item of interest, such that it represents a “nearness” or “nearby” location to an item of interest. The term “nearest neighbor” is an art term used to describe and identify something within a certain proximity. Hence, these two terms “adjacent” and “nearest neighbor” are synonyms of each other, with both definitions used to represent the same general concept of a proximity or nearness of an item of interest, where in the context of the claim limitation, the item of interest is one or more data points or prototypes. Examiner further notes that the term “adjacent prototypes” used in the independent claim does not further limit or restrict the degree or extent of adjacency, but rather it is used to broadly indicate that the nearest/adjacent prototypes being analyzed and evaluated must be near or in proximity to an identified prototype of interest. Examiner further points to Applicant’s specification paragraphs [0062] and [0113], which use the terms “nearest” and “(k) nearest neighbors” interchangeably to describe the relative locations of these nearest/adjacent prototypes being evaluated ([0062]: “… the model analysis system 102 can determine, for a given test point {            
                
                    
                        x
                    
                    
                        i
                    
                
            
        ,             
                
                    
                        y
                    
                    
                        i
                    
                
            
        }, the nearest prototype             
                
                    
                        x
                    
                    
                        l
                    
                
            
         with             
                
                    
                        y
                    
                    
                        l
                    
                
            
         <             
                
                    
                        y
                    
                    
                        i
                    
                
            
         (i.e., the closest prototype lower than the test point) and the nearest prototype             
                
                    
                        x
                    
                    
                        u
                    
                
            
         with             
                
                    
                        y
                    
                    
                        u
                    
                
            
         >             
                
                    
                        y
                    
                    
                        i
                    
                
            
         (i.e., the closest prototype higher than the test point). This can be viewed as similar to setting a classification decision boundary at g(x) =             
                
                    
                        y
                    
                    
                        i
                    
                
            
         … the model analysis system 102 can determine the k nearest neighbors of the chosen test point and look at the mean gradients of the outputs of the k nearest neighbors to determine the impact of the features within the machine-learning model.” and [0113]: “Act 710 can involve determining, for a selected data point of the plurality of data points, an adjacent prototype to the selected data point, and generating a gradient corresponding to the selected data point and the adjacent prototype. Act 710 can then involve determining, using the gradient, the impact of the features of the plurality of data points within the machine-learning model.”). Hence, Applicant’s sub-argument indicating that identifying nearest neighbors (as taught in Parades) is not the same as identifying adjacent prototypes is not persuasive.
Referring back to the Applicant’s claim limitation in independent Claim 1 (“generating a plurality of gradients based on the plurality of data points and corresponding adjacent prototypes of the set of prototypes”), the limitation broadly recites generating a plurality of gradients based on a plurality of data points and corresponding adjacent prototypes. As indicated earlier, Examiner points out that the limitation in the independent claim does not restrict or require that the adjacent prototypes are defined, constructed, or selected in a certain way to calculate the gradients, as it only indicates that the gradients are generated based on a plurality of data points and corresponding adjacent prototypes. As indicated in the Final Office Action mailed December 13, 2021, Parades teaches a method for learning prototypes and distances (LPD) that first performs nearest neighbor classification to define groupings of same-class nearest neighbors of the same class and different-class nearest neighbors, where these nearest neighbor prototypes from each group are then further selected through a gradient descent-based calculation that minimizes the nearest neighbor error between each nearest neighbor and the selected prototype, with the gradient updates eventually identifying a reduced set of prototypes with minimized error estimations that are sufficiently close to decision boundaries (Parades p.181 col.1 1st paragraph: “… learning prototypes and distances (LPD). It starts with an initial selection of a small number of randomly selected prototypes … it iteratively adjust both the position (features) of the prototypes themselves and the corresponding local-metric weights, so that the resulting combination of prototypes and metric minimizes a suitable estimation of the probability of classification error. The adjustment rules are derived by solving the minimization problem through gradient descent.”; p.181 col.1 last paragraph-col.2 2nd paragraph (Section 2. Approach): “… We seek to use T to obtain a reduced set of prototypes, P={            
                
                    
                        y
                    
                    
                        1
                    
                
            
        ,…,             
                
                    
                        y
                    
                    
                        N
                    
                
            
        }⊂ E, n ≪ N, and a suitable weighted distance d: E x P → ℝ associated to P, which optimize the NN classification performance.”; p.182 col.2 Section 2.1 Learning the prototypes and their weights: “… Using these derivatives leads to the corresponding gradient descent update equations. A simple manner to implement these equations is by visiting each prototype x in T and updating the positions and the weights associated with the same-class and different-class NNs of x.”; p.181 Figure 1 Algorithm LPD; and p.183 col.1 2nd paragraph: “The effects of the update equations in the LPD algorithm are intuitively clear. … Since these update steps are weighted by the distance ratio, r(x), their importance depends upon the relative proximity of x to             
                
                    
                        y
                    
                    
                        x
                    
                    
                        =
                    
                
            
         or             
                
                    
                        y
                    
                    
                        x
                    
                    
                        ≠
                    
                
            
        . … this way, only those prototypes (and their weights) which are sufficiently close to the decision boundaries are actually updated.”). As shown in the above recited sections, the Parades reference teaches generating gradients based on a plurality of data points and nearest neighbor prototypes through use of the LPD algorithm, where the LPD algorithm is an iterative algorithm that calculates gradient updates for prototypes from two groups of nearest neighbors (same-class and different-class nearest neighbors) through minimizing the weighted distances between the nearest neighbors, with the end result producing a reduced set of prototypes representing those prototypes closest to decision boundaries. Furthermore, as a side note, Applicant’s specification also describes a gradient generation method involving analyzing nearest neighbor prototypes and evaluating them based on distances, thus indicating that their gradient generation method is also based on a plurality of data points and corresponding nearest neighbor prototypes ([0062]: “… the model analysis system 102 can determine, for a given test point {            
                
                    
                        x
                    
                    
                        i
                    
                
            
        ,             
                
                    
                        y
                    
                    
                        i
                    
                
            
        }, the nearest prototype             
                
                    
                        x
                    
                    
                        l
                    
                
            
         with             
                
                    
                        y
                    
                    
                        l
                    
                
            
         <             
                
                    
                        y
                    
                    
                        i
                    
                
            
         (i.e., the closest prototype lower than the test point) and the nearest prototype             
                
                    
                        x
                    
                    
                        u
                    
                
            
         with             
                
                    
                        y
                    
                    
                        u
                    
                
            
         >             
                
                    
                        y
                    
                    
                        i
                    
                
            
         (i.e., the closest prototype higher than the test point). This can be viewed as similar to setting a classification decision boundary at g(x) =             
                
                    
                        y
                    
                    
                        i
                    
                
            
         … the model analysis system can use             
                
                    
                        ∇
                    
                    ^
                
                g
                
                    
                        
                            
                                x
                            
                            
                                i
                            
                        
                    
                
                =
                …
                
                    
                        s
                        e
                        e
                         
                        e
                        q
                        u
                        a
                        t
                        i
                        o
                        n
                         
                        i
                        n
                         
                        [
                        0062
                        ]
                    
                
            
         which is a pseudo gradient … the model analysis system 102 can determine the k nearest neighbors of the chosen test point and look at the mean gradients of the outputs of the k nearest neighbors to determine the impact of the features within the machine-learning model.”). Hence, in view of the evidence provided above, Applicant’s sub-argument asserting that the gradient generation based on data points and corresponding nearest neighbor prototypes taught in Parades is not within the same scope as the gradient generation recited in the claim limitation is not persuasive, and the existing prior art claim rejection is maintained.
As noted above, Applicant’s remaining arguments are directed to the newly amended claim limitations, such that it necessitates further examination and re-evaluation of the amended and related original claims. The updated claim mappings according to the Applicant’s amended claims are provided in the relevant sections indicated below.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.











The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 3-4, 14, and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over 
 Bien et al., Classification by Set Cover: The Prototype Vector Machine, arXiv:0908.2284v1, August 17, 2009, pp.1-24 [hereafter referred as Bien] in view of Parades et al., Learning prototypes and distances: A prototype reduction technique based on nearest neighbor error minimization, Pattern Recognition 39 (2006), Elsevier Ltd., 2005, pp.180-188 [hereafter referred as Parades].
Regarding amended Claim 1, 
Bien teaches
(Currently Amended) In a digital medium environment for machine-learning interpretation, a computer-implemented method of prototype selection and analysis to determine feature sensitivity comprising:
determining a set of prototypes by:
mapping, by the at least one processor, features of a plurality of data points to a feature space and a plurality of outputs generated via a machine learning model from the plurality of data points to a label space (Examiner’s note: Under its broadest reasonable interpretation, the terms “feature space” and “label space” broadly recite a set of features and a set of labels, where this limitation broadly recite identifying a set of data points to a feature space and identifying a plurality of outputs to a label space, respectively. Bien teaches generating the collection                         
                            
                                
                                    P
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    P
                                
                                
                                    L
                                
                            
                        
                     from a prototype vector machine (PVM), where this collection represents a summary of the training set 𝒳⊂                         
                            
                                
                                    R
                                
                                
                                    p
                                
                            
                        
                      (where                         
                            
                                
                                    R
                                
                                
                                    p
                                
                            
                        
                     corresponds to a feature space, and hence corresponds to “a plurality of data points to a feature space”) and the associated class labels                         
                            
                                
                                    y
                                
                                
                                    1
                                
                            
                             
                        
                    , …,                         
                            
                                
                                    y
                                
                                
                                    n
                                
                            
                        
                     ∈ {1, … L} (where the associated class labels {1, … L} correspond to a label space, where Bien further teaches augmenting 𝒵 to include class label points, where these class label points corresponds to “a plurality of outputs to a label space”. Hence, this process of identifying data points to a feature space and class labels to a label space representing a mapping of those respective data points and class labels to their respective spaces, such that this generated summary corresponds to “mapping … the features of the plurality of data points to a feature space and the plurality of outputs to a label space” aspect in the context of “determining a set of prototypes” (Bien p.1 last paragraph-p.2 4th paragraph (Section 1. Introduction): “Suppose we are given a set of training set of points 𝒳={                        
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                             
                        
                    , …,                         
                            
                                
                                    x
                                
                                
                                    n
                                
                            
                        
                    } ⊂                         
                            
                                
                                    R
                                
                                
                                    p
                                
                            
                        
                     with corresponding class labels                         
                            
                                
                                    y
                                
                                
                                    1
                                
                            
                             
                        
                    , …,                         
                            
                                
                                    y
                                
                                
                                    n
                                
                            
                        
                     ∈ {1, … L} and in addition, a set of unlabeled points 𝒵={                        
                            
                                
                                    z
                                
                                
                                    1
                                
                            
                             
                        
                    , …,                         
                            
                                
                                    z
                                
                                
                                    m
                                
                            
                        
                    } ⊂                         
                            
                                
                                    R
                                
                                
                                    p
                                
                            
                        
                    . Our goal is to choose a relatively small set of prototypes                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                    ⊆ 𝒵 for each class l in such a way that the collection                         
                            
                                
                                    P
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    P
                                
                                
                                    L
                                
                            
                        
                    represents a summary or distillation of the training set (i.e., someone given only                         
                            
                                
                                    P
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    P
                                
                                
                                    L
                                
                            
                        
                     would have a good sense of the original training data, 𝒳 and y) … In this paper, we introduce the prototype vector machine (PVM), which describes a particular choice for the sets                         
                            
                                
                                    P
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    P
                                
                                
                                    L
                                
                            
                        
                    .”; and p.9 Section 4.2. Prototypes not on training points, 2nd paragraph: “… 𝒵 may be further augmented to include other points. For example, one could run K-means on each class’s points individually (or on the training set as a whole) and add these L∙K centroids to 𝒵.”). Bien further teaches that this method involving the prototype vector machine (PVM) is used to perform analysis on several datasets and comparisons with other prototype methods using various R packages and machine learning datasets (Bien pp.10-17 Section 6. Examples on simulated and real data, Section 6.4. UCI data sets), where these R packages are code modules running on a computer (where a computer contains a processor and non-transitory memory) processing the inputs and outputs associated with the machine learning datasets, thus corresponding to the “by at least one processor” and “via a machine-learning model” aspects of the claim limitation.);
determining, by the at least one processor, distances between the plurality of data points in the feature space and the label space (Examiner’s note: Bien teaches using the prototype vector machine (which is an extension of the set cover integer program, Bien pp.3-6 Section 2. The prototype vector machine) to determine distances between data points in the feature space by analyzing a number of elements in 𝒵 (“plurality of data points”) that are within a distance ϵ of a given data point                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                    , where the determination of these distances corresponds to “determining … distances between the plurality of data points in the feature space …” (Bien p.3 1st paragraph Section 1.1. The set cover integer program: “The goal is to find the smallest subset of points 𝒫⊆𝒵 such that every point                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                    ∈𝒳 is within of some point in 𝒫 (i.e., there exists                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                            ∈
                        
                    𝒫 with d(                        
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                            ,
                        
                                             
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                    ) < ϵ ). Let                         
                            
                                
                                    B
                                
                                
                                    ϵ
                                
                            
                        
                    (x) = x’ ∈                         
                            
                                
                                    R
                                
                                
                                    p
                                
                            
                        
                     : d(x’, x) < ϵ denote the ball of radius ϵ centered at x. … From a machine learning point of view, set cover can be seen as a clustering problem in which we wish to find the smallest number of clusters such that every point is within of at least one cluster center.”). Bien further teaches augmenting 𝒵 to include class label points and applying the same set cover integer program to determine distances for the label space, where the determination of these distances corresponds to “determining … distances between the plurality of data points in … a label space” (Bien p.9 Section 4.2. Prototypes not on training points, 2nd paragraph). As indicated earlier, Bien further teaches that this method involving the prototype vector machine (PVM) is used to perform analysis on several datasets and comparisons with other prototype methods using various R packages (Bien pp.10-17 Section 6. Examples on simulated and real data), where these R packages are code modules running on a computer (where a computer contains a processor and non-transitory memory), thus corresponding to the “by at least one processor” aspect of the claim limitation.); and
determining, by the at least one processor, the set of prototypes from the plurality of data points based on the distances between the plurality of data points in the feature space and the label space (Examiner’s note: Bien teaches implementing prototype vector machine (PVM) to solve the set cover integer program by determining a minimum set of data points (“plurality of data points”; “one or more prototypes”), where according to Bien Figure 1, the set of data points within a prototype region are as close to (“adjacent”) each other as possible (where each prototype region containing adjacent data points are represented by ϵ-balls) (Bien pp.3-5 Section 2. The prototype vector machine and Section 2.1 PVM as an integer program: “The PVM seeks a set of prototypes for each class that is optimal … that will be made precise in what follows. For a given choice of                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                    ⊆𝒵 , we consider the set of 𝛜-balls centered at each                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                            ∈
                             
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                             
                        
                    (see Figure 1). A desirable prototype set for class                         
                            l
                        
                     is one that induces a set of balls which (a) covers as many training points of class                         
                            l
                        
                     as possible, (b) covers as few training points as possible of classes other than                         
                            l
                        
                    , and (c) is sparse (i.e., uses as few prototypes as possible for the given ϵ). … We now express the three properties above as an integer program, taking as a starting point the set cover problem of Equation 2. … We define the PVM to be a solution to the following integer program: <Bien p.5 equations (3a) (3b)> …”). Bien further teaches a greedy algorithm that approximates the solution to the set cover integer program by iteratively (Bien p.8 algorithm, line 2 while loop) adding data points from 𝒵 (“prototypes from the plurality of data points”) represented by a feature-space/label-space pair (                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                    ,                         
                            l
                        
                    ) that have the least ratio of cost to number of points newly covered (Bien p.8 equations for ∆ξ, ∆η, and ∆Obj), where these calculations used for determining ∆Obj = ∆ξ - ∆η – λ in this greedy algorithm over the set of data points in 𝒵  correspond to “determining … a set of prototypes from the plurality of data points based on the distances between the plurality of data points in the feature space and the label space” (Bien pp.7-8 Section 3.2 A greedy approach: “At each step, we add the prototype that has the least ratio of cost to number of points newly covered. … At each step we find the                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                            ∈
                             
                        
                    𝒵 and class                         
                            l
                        
                     for which adding                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                     to                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                     most decreases the objective function. That is, we find the (                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                    ,                         
                            l
                        
                    ) pair with the best tradeoff of covering previously uncovered training points of class                         
                            l
                        
                     while avoiding covering points of other classes.”). Bien further teaches that this method involving the prototype vector machine (PVM) is used to perform analysis on several datasets and comparisons with other prototype methods using various R packages (Bien pp.10-17 Section 6. Examples on simulated and real data), where these R packages are code modules running on a computer (where a computer contains a processor and non-transitory memory), thus corresponding to the “by at least one processor” aspect of the claim limitation.); 
determining, by the at least one processor using the set of prototypes, an impact of the features within the machine-learning model (Examiner’s note: According to applicant’s specification paragraph [0032], the term “impact” is defined as “a measure of change to an output of a machine-learning model as a result of a feature input to the machine-learning model … the model analysis system can determine impact using a variety of different measures, including … a number of prototypes within a label space”. Bien teaches that generating a set of prototypes allows for ease of interpretability, through the identification of a representative sample of data points for each class, as well as capturing a full spread of variation within a class and between other classes (Bien p.2 2nd paragraph (Section 1. Introduction): “Having a well-selected set of prototypes                         
                            
                                
                                    P
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    P
                                
                                
                                    L
                                
                            
                        
                    ⊆ 𝒵 is advantageous for two main reasons: interpretability and classification. For domain specialists, examining a handful of representative examples of each class can be highly informative especially when n is large … a well-chosen set                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                    ⊆ 𝒵 of prototypes for class l should capture the full spread of variation within this class while also taking into account how class l differs from other classes.”), where this interpretability facilitates domain specialists to further analyze and extract additional information from the prototypes providing a representative sample of data points for each class (corresponding to a number of prototypes within a label space), thus providing a method for “determining … an impact of the features within the machine-learning model”. As indicated earlier, Bien further teaches that this method involving the prototype vector machine (PVM) is used to perform analysis on several datasets and comparisons with other prototype methods using various R packages (Bien pp.10-17 Section 6. Examples on simulated and real data), where these R packages are code modules running on a computer (where a computer contains a processor and non-transitory memory), thus corresponding to the “by at least one processor” aspect of the claim limitation.) …
While Bien teaches using a prototype vector machine to determine an impact of features, as well as suggesting other related adaptive prototype methods such as learning vector quantization (LVQ) involving gradients (Bien p.10 3rd paragraph), Bien does not explicitly teach 
determining … by generating a plurality of gradients for the plurality of data points and corresponding adjacent prototypes of the set of prototypes …
… wherein a gradient of the plurality of gradients is based on two adjacent prototypes to a data point in the label space.
Parades teaches
determining … by generating a plurality of gradients for the plurality of data points and corresponding adjacent prototypes of the set of prototypes (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites generating a plurality of gradients based on a plurality of data points and corresponding adjacent prototypes, where the term “adjacent prototypes” broadly recites prototypes that are within a proximity or nearness to an identified data point. Parades teaches a nearest-neighbor classification based method for learning prototypes and distances (LPD) involving calculations of gradients based on applying a gradient descent procedure to minimize the estimate of the nearest-neighbor error ratio based on weighted distances of data points x to a prototype y, where the weight w represents a weight associated with a feature j for each prototype, resulting in these data points x and the same-class and different-class nearest-neighbors of y corresponding to “the plurality of data points and corresponding adjacent prototypes of the set of prototypes” (Parades p.181 col.1 Section 2. Approach including equation (1); and pp.181-182 Section 2.1 Learning the prototypes and their weights, including equations (4) and (5)). Parades further teaches approximating the nearest-neighbor error estimate to make it differentiable such that a gradient descent procedure can be applied (with the approximation shown in Parades p.182 equations (4) and (5), and with the corresponding derivatives shown in Parades p.182 equations (7) and (8)). Parades teaches that applying these derivatives leads to corresponding gradient descent update equations and the LPD algorithm shown in Parades p.182 Figure 1, where each prototype x in T is visited and are updated based on the positions and weights associated with the same-class and different-class nearest neighbors of x, eventually resulting in a reduced set of prototypes containing weighted data points (associated with a corresponding feature) that are close to decision boundaries around the given minimum error estimation, with each data point within the reduced set of prototypes reflecting an importance based on relative distances/proximities to same-class or different-class nearest neighbors (Parades p.183 col.1 2nd paragraph-last paragraph, where this data point importance represents an aspect of “determining an impact of the features within the machine-learning model”). Hence, the resulting derivative equations and corresponding LPD algorithm incorporating the gradient descent procedure based on weighted distances between data points and their corresponding nearest neighbor prototypes corresponds to “determining … by generating a plurality of gradients for the plurality of data points and corresponding adjacent prototypes of the set of prototypes” (Parades p.181 col.1 last paragraph-col.2 2nd paragraph Section 2. Approach: “Let T ={                        
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                        
                    ,…,                        
                             
                            
                                
                                    x
                                
                                
                                    N
                                
                            
                        
                    } be a training set; i.e., a collection of training vectors or class-labeled points                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                     ∈ E, 1≤i≤N in a suitable representation space E =                        
                            
                                
                                    R
                                
                                
                                    m
                                
                            
                        
                    . … We seek to use T to obtain a reduced set of prototypes, P={                        
                            
                                
                                    y
                                
                                
                                    1
                                
                            
                        
                    ,…,                        
                             
                            
                                
                                    y
                                
                                
                                    n
                                
                            
                        
                    }⊂ E, n ≪ N, and a suitable weighted distance d: E x P → ℝ associated to P, which optimize the NN classification performance.”; p.181 col.2-p.182 col.2 (Section 2.1 Learning the prototypes and their weights): “… As in previous work [7,12-14], a gradient descent procedure is proposed to minimize this index. … Using these derivatives leads to the corresponding gradient descent update equations. A simple manner to implement these equations is by visiting each prototype x in T and updating the positions and the weights associated with the same-class and different-class NNs of x. This is shown in the procedure presented in Fig.1.”; p.181 Figure 1 Algorithm LPD; and p.183 col.1 2nd paragraph: “The effects of the update equations in the LPD algorithm are intuitively clear. … Since these update steps are weighted by the distance ratio, r(x), their importance depends upon the relative proximity of x to                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    =
                                
                            
                        
                     or                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    ≠
                                
                            
                        
                    . … this way, only those prototypes (and their weights) which are sufficiently close to the decision boundaries are actually updated.”).) … 
… wherein a gradient of the plurality of gradients is based on two adjacent prototypes to a data point in the label space (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites the combination of teachings from the preceding claim limitations by focusing on the data points in the label space. As indicated earlier, Bien teaches augmenting 𝒵 to include class label points, where these class label points corresponds to “a plurality of outputs to a label space”, which Bien further teaches using the prototype vector machine/set cover integer program to further determine distances and identifying a set of nearest prototypes using these class label data points and associated distances (Bien p.1 last paragraph-p.2 4th paragraph (Section 1. Introduction); p.9 Section 4.2. Prototypes not on training points, 2nd paragraph; p.3 1st paragraph Section 1.1. The set cover integer program; pp.3-5 Section 2. The prototype vector machine and Section 2.1 PVM as an integer program). As indicated earlier, Parades teaches calculating gradients based on distances between data points and their corresponding identified nearest neighbor prototypes using a LPD algorithm (Parades p.181 col.1 last paragraph-col.2 2nd paragraph Section 2. Approach; p.181 col.2-p.182 col.2 (Section 2.1 Learning the prototypes and their weights); and p.181 Figure 1 Algorithm LPD), where the selected nearest neighbor prototypes are chosen from a group of same-class nearest neighbors and a group of different-class nearest neighbors to produce a reduced set of prototypes with minimized error estimations that are sufficiently close to decision boundaries (Parades p.183 col.1 2nd paragraph and 5th paragraph). Hence the combination of Bien (determining distances and identifying nearest prototypes on a set of class label points) and Parades (calculating gradients using distances between data points and their corresponding nearest neighbor prototypes) corresponds to an application of the method described in the preceding claim limitations, where “… a gradient of the plurality of gradients is based on two adjacent prototypes to a data point in the label space”.).
	Both Bien and Parades are analogous art since they both teach prototype selection and classification based on nearest neighbor analysis.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the prototype vector machine taught in Bien and enhance it to incorporate the gradient descent technique demonstrated in the LPD algorithm taught in Parades as a way to generate a plurality of gradients using a plurality of data points and corresponding adjacent prototypes of the set of prototypes to determine an impact of features. The motivation to combine is taught in Parades, since the LPD algorithm is intuitively similar to heuristic procedures identified in other nearest-neighbor algorithms such as LVQ, which have been shown to improve classification accuracy. Furthermore, the fact that the LPD algorithm is based on an mathematical derivation that guarantees convergence towards an local minimum also results in this algorithm producing an optimal solution, and as such, a system that implements this method will not only exhibit improved classification accuracy but also generate an optimal solution, thereby making the system more computationally efficient (Parades p.183 col.1 3rd-4th paragraphs: “The attentive reader will find the above prototype update rules closely related to the so-called reward–punishment rules heuristically introduced in such popular procedures as LVQ1, LVQ2 and DSM [16–18]. … It is remarkable that an intuitive interpretation of the formally derived LPD prototype update rules is so closely related with popular heuristics which, without formal proof of their potential usefulness, have proved quite helpful to improve accuracy in many practical situations. Nevertheless, the advantages of LPD are clear: not only the update policy for prototype positions, but also for the associated metric weights, along with the corresponding smoothing and windowing terms, come from a mathematical derivation which guarantees convergence towards an (approximate) local minimum of the empirical NN error estimation.).
Regarding previously presented Claim 3, 
Bien in view of Parades teaches
(Previously Presented) The computer-implemented method as recited in claim 1, wherein determining the impact of the features within the machine-learning model comprises: 
determining a number of prototypes in a first region and a number of prototypes in a second region of the feature space or the label space (Examiner’s note: Bien teaches a set of prototypes                         
                            
                                
                                    P
                                
                                
                                    1
                                
                            
                        
                     , … ,                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                     (each corresponding to a prototype region, i.e., “a first region”, “a second region”), with each region consisting of a plurality of data points from a training set (Bien p.1 last paragraph – p.2 first paragraph: “Our goal is to choose a relatively small set of prototypes                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                     ⊆ 𝒵 for each class                         
                            l
                        
                     in such a way that the collection                         
                            
                                
                                    P
                                
                                
                                    1
                                
                            
                        
                     , … ,                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                     represents a summary or distillation of the training set…”). Bien further teaches using the set cover integer program to determine distances between data points in the feature space by analyzing a number of elements in 𝒵 that are within a distance ϵ of a given data point                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                    , hence also defining a number of prototypes within each region defined by distance 𝛜, thus corresponding to a method for “determining a number of prototypes in a first region and a number of prototypes in a second region of the feature space or the label space” (Bien p.3 1st paragraph Section 1.1 The set cover integer program: “The goal is to find the smallest subset of points 𝒫⊆𝒵 such that every point                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                    ∈𝒳 is within of some point in 𝒫 (i.e., there exists                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                            ∈
                        
                    𝒫 with d(                        
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                            ,
                        
                                             
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                    ) < ϵ ). Let                         
                            
                                
                                    B
                                
                                
                                    ϵ
                                
                            
                        
                    (x) = x’ ∈                         
                            
                                
                                    R
                                
                                
                                    p
                                
                            
                        
                     : d(x’, x) < ϵ denote the ball of radius ϵ centered at x. … From a machine learning point of view, set cover can be seen as a clustering problem in which we wish to find the smallest number of clusters such that every point is within of at least one cluster center.”).); and 
determining the impact of the features of the plurality of data points based on the number of prototypes in the first region and the second region (Examiner’s note: According to applicant’s specification paragraph [0032], the term “impact” is defined as “a measure of change to an output of a machine-learning model as a result of a feature input to the machine-learning model. … the model analysis system can determine impact using a variety of different measures, including … a number of prototypes within a label space”. Bien teaches that generating a set of prototypes allows for ease of interpretability, through the identification of a representative sample of data points for each class, as well as capturing a full spread of variation within a class and between other classes (Bien p.2 2nd paragraph (Section 1. Introduction): “Having a well-selected set of prototypes                         
                            
                                
                                    P
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    P
                                
                                
                                    L
                                
                            
                        
                    ⊆ 𝒵 is advantageous for two main reasons: interpretability and classification. For domain specialists, examining a handful of representative examples of each class can be highly informative especially when n is large … a well-chosen set                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                    ⊆ 𝒵 of prototypes for class l should capture the full spread of variation within this class while also taking into account how class l differs from other classes.”), where this interpretability facilitates domain specialists to further analyze and extract additional information from the prototypes providing a representative sample of data points for each class (corresponding to a number of prototypes within a label space), thus providing a method for “determining an impact of the features” for different sets of prototypes (corresponding to different regions), resulting in this method corresponding to a method for “determining the impact of the features of the plurality of data points based on the number of prototypes in the first region and the second region”.).  
Regarding amended Claim 4, 
Bien in view of Parades teaches
(Currently Amended) The computer-implemented method as recited in claim 1, wherein determining the impact of the features within the machine-learning model comprises: 
determining, for the data point in the label space, a plurality of prototypes comprising the two adjacent prototypes to the data point within the label space (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites that the plurality of prototypes includes the two adjacent prototypes used for generating a gradient as recited in independent Claim 1. As indicated earlier, Bien teaches augmenting 𝒵 to include class label points, where these class label points corresponds to “a plurality of outputs to a label space”, which Bien further teaches using the prototype vector machine/set cover integer program to further determine distances and identifying a set of nearest prototypes using these class label data points and associated distances (Bien p.1 last paragraph-p.2 4th paragraph (Section 1. Introduction); p.9 Section 4.2. Prototypes not on training points, 2nd paragraph; p.3 1st paragraph Section 1.1. The set cover integer program; pp.3-5 Section 2. The prototype vector machine and Section 2.1 PVM as an integer program). As indicated earlier, Parades teaches calculating gradients based on distances between data points and their corresponding identified nearest neighbor prototypes using a LPD algorithm (Parades p.181 col.1 last paragraph-col.2 2nd paragraph Section 2. Approach; p.181 col.2-p.182 col.2 (Section 2.1 Learning the prototypes and their weights); and p.181 Figure 1 Algorithm LPD), where the selected nearest neighbor prototypes are chosen from a group of same-class nearest neighbors and a group of different-class nearest neighbors to produce a reduced set of prototypes with minimized error estimations that are sufficiently close to decision boundaries (Parades p.183 col.1 2nd paragraph and 5th paragraph). Hence the combination of Bien (determining distances and identifying nearest prototypes on a set of class label points) and Parades (calculating gradients using distances between data points and their corresponding nearest neighbor prototypes) corresponds to an application of the method described in the preceding claim limitations, where “a gradient of the plurality of gradients is based on two adjacent prototypes to a data point in the label space”. Parades further teaches the computed gradients are based on the selection of these prototypes from these two groups of nearest neighbors (Parades p.182 col.1 3rd paragraph-col.2 1st paragraph, including equations (7) and (8)), thus corresponding to “determining … a plurality of prototypes comprising the two adjacent prototypes …” aspect in the limitation.); 
analyzing the plurality of prototypes to determine a mean and a variance of the plurality of prototypes (Examiner’s note: As indicated earlier, Parades teaches the learning prototypes and distances method (LPD) involving calculations of gradients based on applying a gradient descent procedure on an estimate of the nearest-neighbor error (Parades p.181 col.1 Section 2. Approach and p.181 col.2 Section 2.1 Learning the prototypes and their weights), shown in Parades p.181 equation (2) (where the nearest-neighbor error is based on a weighted distance of data points x to a prototype y, Parades p.181 equation (1), where the weight w represents a weight associated with a feature j for each prototype, and where the nearest-neighbor error is expressed as a ratio of a set of prototypes represented by same-class and different-class nearest neighbors of x, Parades p.182 equation (5), resulting in these data points x and the same-class and different-class nearest-neighbors of y). As indicated earlier, Parades further teaches approximating the nearest-neighbor error and performing a differentiation to obtain derivatives in order to generate corresponding gradient descent update equations and the LPD algorithm shown in Parades p.182 Figure 1, where each prototype x in T is visited and are updated based on the positions and weights associated with the same-class and different-class nearest neighbors of x, eventually resulting in a reduced set of prototypes containing weighted data points (associated with a corresponding feature) that are close to decision boundaries around the given minimum error estimation, with each data point within the reduced set of prototypes reflecting an importance based on relative distances/proximities to same-class or different-class nearest neighbors (Parades p.183 col.1 2nd paragraph-last paragraph). Parades teaches that the LPD algorithm requires the application of learning step factors                         
                            
                                
                                    μ
                                
                                
                                    i
                                    j
                                
                            
                        
                     and                         
                            
                                
                                    v
                                
                                
                                    i
                                    j
                                
                            
                        
                     that need to learned as part of analyzing the set of prototypes in order to update/converge the set of prototypes to an optimal form, where these factors                         
                            
                                
                                    μ
                                
                                
                                    i
                                    j
                                
                            
                        
                     and                         
                            
                                
                                    v
                                
                                
                                    i
                                    j
                                
                            
                        
                     correspond to “a mean and variance” for given data point i and a feature j (Parades p.182 col.2 last paragraph: “Two sets of learning step factors,                         
                            
                                
                                    μ
                                
                                
                                    i
                                    j
                                
                            
                        
                    ,                         
                            
                                
                                    v
                                
                                
                                    i
                                    j
                                
                            
                        
                     , are needed by this algorithm. They can take just a fixed value for all i, j or may depend on i, j following simple rules; for instance,                         
                            
                                
                                    μ
                                
                                
                                    i
                                    j
                                
                            
                        
                     may be inversely proportional to the variance of each feature j. In addition, for smoother (but slower) convergence, these values may be decreased along the successive iterations of the LPD while loop. Large values of                         
                            
                                
                                    v
                                
                                
                                    i
                                    j
                                
                            
                        
                     give more importance to the learning of the prototypes themselves while large values of                         
                            
                                
                                    μ
                                
                                
                                    i
                                    j
                                
                            
                        
                     emphasize the learning of the distance associated to these prototypes.”) and that this LPD algorithm corresponds correspond to similar heuristics found in LVQ algorithms (Parades p.183 col.1 3rd paragraph) as well as being analogous to similar well-known EM estimations of Gaussian mixtures involving mean and covariance matrices (Parades p.181 col.1 2nd paragraph), and as such, makes this LPD algorithm correspond to a process for “analyzing the plurality of prototypes to determine a mean and a variance of the plurality of prototypes”.); and 
determining, based on the mean and the variance of the plurality of prototypes in the label space, the impact of the features of the plurality of data points within the machine-learning model (Examiner’s note: As indicated earlier, Parades teaches the LPD method involving calculating and applying derivatives representing the gradient descent update equations and the LPD algorithm as shown in Parades p.182 Figure 1, where each prototype x in T is visited and are updated based on the positions and weights associated with the same-class and different-class nearest neighbors of x, eventually resulting in a reduced set of prototypes containing weighted data points (associated with a corresponding feature) that are close to decision boundaries around the given minimum error estimation, with each data point within the reduced set of prototypes reflecting an importance based on relative distances/proximities to same-class or different-class nearest neighbors and the application of learning step factors                         
                            
                                
                                    μ
                                
                                
                                    i
                                    j
                                
                            
                        
                     and                         
                            
                                
                                    v
                                
                                
                                    i
                                    j
                                
                            
                        
                     that need to learned as part of analyzing the set of prototypes in order to update/converge the set of prototypes to an optimal form, where these factors                         
                            
                                
                                    μ
                                
                                
                                    i
                                    j
                                
                            
                        
                     and                         
                            
                                
                                    v
                                
                                
                                    i
                                    j
                                
                            
                        
                     correspond to “a mean and variance” for given data point i and a feature j, where this importance (representing an aspect of “determining an impact of the features within the machine-learning model”) is based on these learning factors (Parades p.182 col.2 last paragraph: “Two sets of learning step factors,                         
                            
                                
                                    μ
                                
                                
                                    i
                                    j
                                
                            
                        
                    ,                         
                            
                                
                                    v
                                
                                
                                    i
                                    j
                                
                            
                        
                     , are needed by this algorithm. They can take just a fixed value for all i, j or may depend on i, j following simple rules; for instance,                         
                            
                                
                                    μ
                                
                                
                                    i
                                    j
                                
                            
                        
                     may be inversely proportional to the variance of each feature j. In addition, for smoother (but slower) convergence, these values may be decreased along the successive iterations of the LPD while loop. Large values of                         
                            
                                
                                    v
                                
                                
                                    i
                                    j
                                
                            
                        
                     give more importance to the learning of the prototypes themselves while large values of                         
                            
                                
                                    μ
                                
                                
                                    i
                                    j
                                
                            
                        
                     emphasize the learning of the distance associated to these prototypes.”; and p.183 col.1 2nd paragraph (Section 2.1 Learning the prototypes and their weights): “The effects of the update equations in the LPD algorithm are intuitively clear. For each training vector x, its same-class NN,                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                     =                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    =
                                
                            
                        
                     , is moved towards x, while its different-class NN,                         
                            
                                
                                    y
                                
                                
                                    k
                                
                            
                        
                     =                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    ≠
                                
                            
                        
                     , is moved away from x. Similarly, the feature-dependent weights associated with                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    =
                                
                            
                        
                      are modified so as to make it appear closer to x in a feature-dependent manner, while those of                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    ≠
                                
                            
                        
                     are modified so that it will similarly appear farther from x. Since these update steps are weighted by the distance ratio, r(x), their importance depends upon the relative proximity of x to                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    =
                                
                            
                        
                     or                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    ≠
                                
                            
                        
                    . This is further divided by the corresponding squared distance, thereby reducing the update importance for large distances. Finally, the resulting steps are windowed by the derivative of the sigmoid function applied to the distance ratio, r(x). This way, only those prototypes (and their weights) which are sufficiently close to the decision boundaries are actually updated.”).).  
Regarding amended Claim 14, 
Bien teaches
(Currently Amended) In a digital medium environment for machine-learning interpretation, a system for prototype selection and analysis to determine feature sensitivity comprising: 
at least one processor (Examiner’s note: Bien teaches a prototype selection method involving a prototype vector machine (PVM), where the PVM is used to perform analysis on several datasets and comparisons with other prototype methods using various R packages (Bien pp.10-17 Section 6. Examples on simulated and real data), where these R packages are code modules running on a computer (where a computer contains a processor and non-transitory memory), thus corresponding to the “at least one processor” aspect of the claim limitation.); and 
a non-transitory computer memory (Examiner’s note: Bien teaches a prototype selection method involving a prototype vector machine (PVM), where the PVM is used to perform analysis on several datasets and comparisons with other prototype methods using various R packages (Bien pp.10-17 Section 6. Examples on simulated and real data), where these R packages are code modules running on a computer (where a computer contains a processor and non-transitory memory), thus corresponding to the “a non-transitory computer memory” aspect of the claim limitation.) comprising: 
a plurality of data points used to generate a plurality of outputs via a machine-learning model (Examiner’s note: Bien teaches a set of training points 𝒳={                        
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                             
                        
                    , …,                         
                            
                                
                                    x
                                
                                
                                    n
                                
                            
                        
                    } and 𝒵={                        
                            
                                
                                    z
                                
                                
                                    1
                                
                            
                             
                        
                    , …,                         
                            
                                
                                    z
                                
                                
                                    m
                                
                            
                        
                    } (“plurality of data points”) where the data points are a subset of                         
                            
                                
                                    R
                                
                                
                                    p
                                
                            
                        
                    , with associated class labels                         
                            
                                
                                    y
                                
                                
                                    1
                                
                            
                             
                        
                    , …,                         
                            
                                
                                    y
                                
                                
                                    n
                                
                            
                        
                     (“plurality of outputs”) where the data points are elements in the set {1, …, L}, and where the identification of the set of training points (where these training points are used in a machine-learning model to produce the outputs) and associated class labels corresponds to “a plurality of data points used to generate a plurality of outputs …” (Bien p.1 last paragraph-p.2 4th paragraph (Section 1. Introduction): “Suppose we are given a set of training set of points 𝒳={                        
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                             
                        
                    , …,                         
                            
                                
                                    x
                                
                                
                                    n
                                
                            
                        
                    } ⊂                         
                            
                                
                                    R
                                
                                
                                    p
                                
                            
                        
                     with corresponding class labels                         
                            
                                
                                    y
                                
                                
                                    1
                                
                            
                             
                        
                    , …,                         
                            
                                
                                    y
                                
                                
                                    n
                                
                            
                        
                     ∈ {1, … L} and in addition, a set of unlabeled points 𝒵={                        
                            
                                
                                    z
                                
                                
                                    1
                                
                            
                             
                        
                    , …,                         
                            
                                
                                    z
                                
                                
                                    m
                                
                            
                        
                    } ⊂                         
                            
                                
                                    R
                                
                                
                                    p
                                
                            
                        
                    . Our goal is to choose a relatively small set of prototypes                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                    ⊆ 𝒵 for each class l in such a way that the collection                         
                            
                                
                                    P
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    P
                                
                                
                                    L
                                
                            
                        
                    represents a summary or distillation of the training set (i.e., someone given only                         
                            
                                
                                    P
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    P
                                
                                
                                    L
                                
                            
                        
                     would have a good sense of the original training data, 𝒳 and y) … In this paper, we introduce the prototype vector machine (PVM), which describes a particular choice for the sets                         
                            
                                
                                    P
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    P
                                
                                
                                    L
                                
                            
                        
                    .”). Bien further teaches that this method involving the prototype vector machine (PVM) is used to perform analysis on several datasets and comparisons with other prototype methods using various R packages and machine learning datasets (Bien pp.10-17 Section 6. Examples on simulated and real data, Section 6.4. UCI data sets), where these R packages are code modules running on a computer (where a computer contains a processor and non-transitory memory) processing the inputs and outputs associated with the machine learning datasets, thus corresponding to the “via a machine-learning model” aspect of the claim limitation.), and 
instructions that, when executed by the at least one processor (Examiner’s note: As indicated earlier, Bien further teaches that this method involving the prototype vector machine (PVM) is used to perform analysis on several datasets and comparisons with other prototype methods using various R packages and machine learning datasets (Bien pp.10-17 Section 6. Examples on simulated and real data, Section 6.4. UCI data sets), where these R packages are code modules running on a computer (where a computer contains a processor and non-transitory memory), thus corresponding to “instructions that, when executed by the at least one processor”.), cause the system to: 
map features of the plurality of data points to a feature space and the plurality of outputs to a label space (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.);
identify a set of prototypes by: 
determining a subset of data points within a threshold distance relative to a first data point of the plurality of data points within the feature space (Examiner’s note: This claim limitation of “determining a subset of data points within a threshold distance relative to a first data point of the plurality of data points within the feature space” is similar in scope to the combined scope of two claim limitations recited in independent Claim 1: “determining, by the at least one processor, distances between the plurality of data points based on the distances between the plurality of data points in the feature space and label space; and determining, by the at least one processor, a set of prototypes from the plurality of data points based on the distances between the plurality of data points in the feature space and the label space”, where the end result is a set of prototypes (“a subset of data points”) that are within an epsilon ball of radius 𝛜 (where this radius 𝛜 corresponds to “a threshold distance relative to a first data point in the plurality of data points within the feature space”), and hence is rejected under similar rationale identified by those two claim limitations recited in independent Claim 1.);
adding the first data point to the set of prototypes based on distances between the first data point and the subset of data points in the label space (Examiner’s note: Bien teaches                         
                            
                                
                                    C
                                
                                
                                    l
                                
                            
                        
                    (j) representing a cost based on the distances                         
                            
                                
                                    B
                                
                                
                                    ϵ
                                
                            
                        
                     and a number of prototypes λ (corresponding to “determine a cost … based on the distances between the first data point and the subset of data points … and a total number of prototypes …”) (Bien p.5 2nd paragraph-p.6 2nd paragraph (Section 2.1 PVM as an integer program): “λ ≥ 0 is a parameter specifying the cost of adding a prototype. Its effect is to control the number of prototypes chosen … We generally choose λ = 1/n …where                         
                            
                                
                                    C
                                
                                
                                    l
                                
                            
                        
                    (j) is the cost of adding                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                     to                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                     …                         
                            
                                
                                    C
                                
                                
                                    l
                                
                            
                        
                    (j) = λ + |                        
                            
                                
                                    B
                                
                                
                                    ϵ
                                
                            
                        
                    (                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                     ) ∩ (X \                         
                            
                                
                                    X
                                
                                
                                    l
                                
                            
                        
                    )|.”). As indicated earlier, the set of prototypes can be established in both feature space and label space based on the data points included in Z (Bien p.9 Section 4.2 Prototypes not on training points, 2nd paragraph). Bien further teaches approximating the solution to the set cover integer program using an greedy algorithm by iteratively (see Bien p.8 algorithm, line 2 while loop) adding data points from 𝒵 (corresponding to a set containing “the first data point”) represented by a feature-space/label-space pair (                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                    ,                         
                            l
                        
                    ) that have the least ratio of cost to number of points newly covered (Bien p.8 equations for ∆ξ, ∆η, and ∆Obj), where in line 2 the data point z* is added into the set of prototypes which includes the first data point as a prototype (corresponding to “adding the first data point to the set of prototypes based on distances between the first data point and the subset of data points in the label space”) (Bien pp.7-8 Section 3.2 A greedy approach: “At each step, we add the prototype that has the least ratio of cost to number of points newly covered. … At each step we find the                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                            ∈
                             
                        
                    𝒵 and class                         
                            l
                        
                     for which adding                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                     to                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                     most decreases the objective function. That is, we find the (                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                    ,                         
                            l
                        
                    ) pair with the best tradeoff of covering previously uncovered training points of class                         
                            l
                        
                     while avoiding covering points of other classes.”).);
determine, using the set of prototypes, an impact of the features within the machine-learning model (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.) …  
While Bien teaches using a prototype vector machine to determine an impact of features, as well as suggesting other related adaptive prototype methods such as learning vector quantization (LVQ) involving gradients (Bien p.10 3rd paragraph), Bien does not explicitly teach 
determine … by generating a plurality of gradients for the plurality of data points and corresponding adjacent prototypes of the set of prototypes … 
… wherein a gradient of the plurality of gradients is based on two adjacent prototypes to a data point in the label space.
	Parades teaches
determine … by generating a plurality of gradients for the plurality of data points and corresponding adjacent prototypes of the set of prototypes (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.) … 
… wherein a gradient of the plurality of gradients is based on two adjacent prototypes to a data point in the label space (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.).
	Both Bien and Parades are analogous art since they both teach prototype selection and classification based on nearest neighbor analysis.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the prototype vector machine taught in Bien and enhance it to incorporate the gradient descent technique demonstrated in the LPD algorithm taught in Parades as a way to generate a plurality of gradients using a plurality of data points and corresponding adjacent prototypes of the set of prototypes to determine an impact of features. The motivation to combine is taught in Parades, as provided in the prior art claim mapping of Claim 1.
Regarding original Claim 16, 
Claim 16 recites the system as recited in claim 14, wherein the instructions that cause the system to determine the impact of the features within the machine learning model cause the system to perform operations comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 3, and hence is rejected under similar rationale provided by Bien in view of Parades as indicated in Claim 3, in view of rejections from Claim 14.
Regarding amended Claim 17, 
Claim 17 recites the system as recited in claim 14, wherein the instructions that cause the system to determine the impact of the features within the machine learning model cause the system to perform operations comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 4, and hence is rejected under similar rationale provided by Bien in view of Parades as indicated in Claim 4, in view of rejections from Claim 14.
Regarding amended Claim 18, 
Bien in view of Parades teaches
(Currently Amended) The system as recited in claim 17, wherein the instructions that cause the system to determine the impact of the features within the machine-learning model cause the system to: 
determine a bias of the data point by determining a distance between the data point and the mean in the label space (Examiner’s note: According to Applicant’s specification paragraph [0068], the term “bias” indicates a variance (computed as a difference or distance between a data point represented by g(x) and a selected data point (i.e., the center of the set of prototypes), and a mean squared bias determines an optimization for selecting an epsilon ball size (i.e., a set of prototypes with radius ϵ). Hence, in the context of the claims, the bias (or distance between a data point and a selected data point in a set of prototypes) is interpreted as a distance metric that is used to perform optimization of the set of prototypes centered around an epsilon ball size with radius ϵ (where space around the center of a set of prototypes represents “a mean in the … space”). As indicated earlier, Parades teaches approximating the nearest-neighbor error and performing a differentiation to obtain derivatives in order to generate corresponding gradient descent update equations and the LPD algorithm shown in Parades p.182 Figure 1. As shown in Parades p.182 Figure 1, a conditional check involving a distance metric and a small constant ε (|λ’-λ| > ε) is used as a terminating condition that decides whether to continue with the LPD optimization or to terminate the optimization (where the result of the termination will yield an optimal set of prototypes), with the value λ updated during each while iteration towards identifying a smaller space dictated by the nearest neighbor error estimate J(P,W) (Parades p.182 equation (4), which corresponds to “a mean in the … space”). Hence, this distance metric |λ’-λ| corresponds to a step in which to “determine a bias of the selected data point by determining a distance between the selected data point and the mean in the … space”. When combined with the teachings of Bien, indicating that the set of data points can also include data points in the label space (Bien p.9 Section 4.2 Prototypes not on training points, 2nd paragraph), both Bien and Parades teach the limitations identified in this claim, which correspond to a process that “determine a bias of the selected data point by determining a distance between the selected data point and the mean in the label space”.); and 
determine the impact of the features of the plurality of data points based on the bias of the data point (Examiner’s note: As indicated earlier, Parades teaches the LPD algorithm as shown in Parades p.182 Figure 1, where each prototype x in T is visited and are updated based on the positions and weights associated with the same-class and different-class nearest neighbors of x, eventually resulting in a reduced set of prototypes containing weighted data points (associated with a corresponding feature) that are close to decision boundaries around the given minimum error estimation, with each data point within the reduced set of prototypes reflecting an importance based on relative distances/proximities to same-class or different-class nearest neighbors (Parades p.183 col.1 2nd paragraph-last paragraph), where this importance (representing an aspect of “determining an impact of the features within the machine-learning model”) varies based on the relative proximity of the data x to                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    =
                                
                            
                        
                     or                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    ≠
                                
                            
                        
                     according to the effects of the gradient update equations (where the gradient update equations corresponds to a mechanism that produces “locally sensitive directions of the features”). As shown in Parades p.182 Figure 1, a conditional check involving a distance metric and a small constant ε (|λ’-λ| > ε) is used as a terminating condition that decides whether to continue with the LPD optimization or to terminate the optimization (where the result of the termination will yield an optimal set of prototypes), with the value λ updated during each while iteration towards identifying a smaller space dictated by the nearest neighbor error estimate J(P,W) (Parades p.182 equation (4), which corresponds to “a mean in the … space”). This distance metric |λ’-λ| corresponds to a step in which to “determine a bias of the selected data point by determining a distance between the selected data point and the mean in the … space”. When combined with the teachings of Bien, indicating that the set of data points can also include data points in the label space (Bien p.9 Section 4.2 Prototypes not on training points, 2nd paragraph), both Bien and Parades teach the limitations identified in this claim, which correspond to a process that “determine a bias of the selected data point by determining a distance between the selected data point and the mean in the label space”. Hence, the usage of the LPD algorithm to determine a set of prototypes, where the set of prototypes contain data points reflecting an importance based on relative distances to same-class and different-class nearest neighbors corresponds to a process for “determining rank orders for the features of the plurality of data points according to locally sensitive directions of the features based on the plurality of gradients” (Parades p.183 col.1 2nd paragraph (Section 2.1 Learning the prototypes and their weights): “The effects of the update equations in the LPD algorithm are intuitively clear. For each training vector x, its same-class NN,                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                     =                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    =
                                
                            
                        
                     , is moved towards x, while its different-class NN,                         
                            
                                
                                    y
                                
                                
                                    k
                                
                            
                        
                     =                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    ≠
                                
                            
                        
                     , is moved away from x. Similarly, the feature-dependent weights associated with                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    =
                                
                            
                        
                      are modified so as to make it appear closer to x in a feature-dependent manner, while those of                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    ≠
                                
                            
                        
                     are modified so that it will similarly appear farther from x. Since these update steps are weighted by the distance ratio, r(x), their importance depends upon the relative proximity of x to                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    =
                                
                            
                        
                     or                         
                            
                                
                                    y
                                
                                
                                    x
                                
                                
                                    ≠
                                
                            
                        
                    . This is further divided by the corresponding squared distance, thereby reducing the update importance for large distances. Finally, the resulting steps are windowed by the derivative of the sigmoid function applied to the distance ratio, r(x). This way, only those prototypes (and their weights) which are sufficiently close to the decision boundaries are actually updated.”), where the determination of importances corresponds to an aspect of “determining … the impact of the features of the plurality of data points”, where this determination is based on a LPD algorithm that determines a bias distance (corresponding to “based on the bias of the selected data point”).).  
Regarding original Claim 19, 
Bien in view of Parades teaches
(Original) The system as recited in claim 14, wherein the instructions that cause the system to identify the set of prototypes cause the system to: 
determine a cost based on the distances between the first data point and the subset of data points in the label space and a total number of prototypes (Examiner’s note: Bien teaches                         
                            
                                
                                    C
                                
                                
                                    l
                                
                            
                        
                    (j) representing a cost based on the distances                         
                            
                                
                                    B
                                
                                
                                    ϵ
                                
                            
                        
                     and a number of prototypes λ (corresponding to “determine a cost … based on the distances between the first data point and the subset of data points … and a total number of prototypes …”) (Bien p.5 2nd paragraph-p.6 2nd paragraph (Section 2.1 PVM as an integer program): “λ ≥ 0 is a parameter specifying the cost of adding a prototype. Its effect is to control the number of prototypes chosen … We generally choose λ = 1/n …where                         
                            
                                
                                    C
                                
                                
                                    l
                                
                            
                        
                    (j) is the cost of adding                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                     to                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                     …                         
                            
                                
                                    C
                                
                                
                                    l
                                
                            
                        
                    (j) = λ + |                        
                            
                                
                                    B
                                
                                
                                    ϵ
                                
                            
                        
                    (                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                     ) ∩ (X \                         
                            
                                
                                    X
                                
                                
                                    l
                                
                            
                        
                    )|.”). As indicated earlier, the set of prototypes can be established in both feature space and label space based on the data points included in Z (Bien p.9 Section 4.2 Prototypes not on training points, 2nd paragraph). Bien further teaches approximating the solution to the set cover integer program using an greedy algorithm by iteratively (see Bien p.8 algorithm, line 2 while loop) adding data points from 𝒵 (corresponding to a set containing “the first data point”) represented by a feature-space/label-space pair (                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                    ,                         
                            l
                        
                    ) that have the least ratio of cost to number of points newly covered (Bien p.8 equations for ∆ξ, ∆η, and ∆Obj), where in line 2 the data point z* is added into the set of prototypes which includes the first data point as a prototype (corresponding to “ … between the first data point and the subset of data points …”) (Bien pp.7-8 Section 3.2 A greedy approach: “At each step, we add the prototype that has the least ratio of cost to number of points newly covered. … At each step we find the                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                            ∈
                             
                        
                    𝒵 and class                         
                            l
                        
                     for which adding                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                     to                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                     most decreases the objective function. That is, we find the (                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                    ,                         
                            l
                        
                    ) pair with the best tradeoff of covering previously uncovered training points of class                         
                            l
                        
                     while avoiding covering points of other classes.”).); and 
add the first data point to the set of prototypes based on the cost (Examiner’s note: Bien teaches                         
                            
                                
                                    C
                                
                                
                                    l
                                
                            
                        
                    (j) representing a cost based on the distances                         
                            
                                
                                    B
                                
                                
                                    ϵ
                                
                            
                        
                     and a number of prototypes λ (Bien p.5 2nd paragraph-p.6 2nd paragraph (Section 2.1 PVM as an integer program): “λ ≥ 0 is a parameter specifying the cost of adding a prototype. Its effect is to control the number of prototypes chosen … We generally choose λ = 1/n …where                         
                            
                                
                                    C
                                
                                
                                    l
                                
                            
                        
                    (j) is the cost of adding                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                     to                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                     …                         
                            
                                
                                    C
                                
                                
                                    l
                                
                            
                        
                    (j) = λ + |                        
                            
                                
                                    B
                                
                                
                                    ϵ
                                
                            
                        
                    (                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                     ) ∩ (X \                         
                            
                                
                                    X
                                
                                
                                    l
                                
                            
                        
                    )|.”). Bien further teaches approximating the solution to the set cover integer program using an greedy algorithm by iteratively (see Bien p.8 algorithm, line 2 while loop) adding data points from 𝒵 (corresponding to “a first data point of the plurality of data points”) represented by a feature-space/label-space pair (                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                    ,                         
                            l
                        
                    ) that have the least ratio of cost to number of points newly covered (Bien p.8 equations for ∆ξ, ∆η, and ∆Obj), where in Bien p.8 algorithm line 2 the data point z* is added into the set of prototypes, thus corresponding to “add the first data point to the set of prototypes based on the cost” (Bien pp.7-8 Section 3.2 A greedy approach: “At each step, we add the prototype that has the least ratio of cost to number of points newly covered. … At each step we find the                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                            ∈
                             
                        
                    𝒵 and class                         
                            l
                        
                     for which adding                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                     to                         
                            
                                
                                    P
                                
                                
                                    l
                                
                            
                        
                     most decreases the objective function. That is, we find the (                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                    ,                         
                            l
                        
                    ) pair with the best tradeoff of covering previously uncovered training points of class                         
                            l
                        
                     while avoiding covering points of other classes.”).).  
Regarding original Claim 20, 
Bien in view of Parades teaches
(Original) The system as recited in claim 19, wherein the instructions that cause the system to identify the set of prototypes cause the system to: 
identify, for a second data point of the plurality of data points, a second subset of data points within the threshold distance relative to the second data point within the feature space (Examiner’s note: This claim limitation is similar in scope to a corresponding claim limitation from independent Claim 14: “identify a set of prototypes by: determining a subset of data points within a threshold distance relative to a first data point of the plurality of data points within the feature space”, where the end result is a set of prototypes (“a second subset of data points”) that are within an epsilon ball of radius 𝛜 (where this radius 𝛜 corresponds to “a threshold distance relative to the second data point within the feature space”), such that this claim as a whole is directed towards a mere iterative process of identifying a separate set of prototypes (a second subset of data points representing a second set of prototypes), and hence is rejected under similar rationale as indicated in independent Claim 14.); 
determine distances between the second data point and the second subset of data points within the label space (Examiner’s note: This claim limitation is similar in scope to a corresponding claim limitation from independent Claim 14: “identify a set of prototypes by: determining a subset of data points within a threshold distance relative to a first data point of the plurality of data points within the feature space”, which is of the same scope as the combined claim limitations recited in independent Claim 1: “determining, by the at least one processor, distances between the plurality of data points based on the distances between the plurality of data points in the feature space and label space; and determining, by the at least one processor, a set of prototypes from the plurality of data points based on the distances between the plurality of data points in the feature space and the label space”, where the end result is a set of prototypes (“a second subset of data points”) that are within an epsilon ball of radius 𝛜 (where this radius 𝛜 corresponds to “a distance between the second data point and the second subset of data points within the label space”), such that this claim as a whole is directed towards a mere iterative process of identifying a separate set of prototypes (a second subset of data points representing a second set of prototypes), and hence is rejected under similar rationale as indicated in independent Claim 14 and the combined claim limitations recited in independent Claim 1.); and 
determine a cost associated with adding the second data point to the set of prototypes based on the distances between the second data point and the second subset of data points and the total number of prototypes including the first data point as a prototype within the set of prototypes (Examiner’s note: This claim limitation is similar in scope to a combined scope of the corresponding claim limitations from dependent Claim 19: “determine a cost based on the distances between the first data point and the subset of data points in the label space and a total number of prototypes; and add the first data point to the set of prototypes based on the cost”, such that this claim as a whole is directed towards a mere iterative process of running the same greedy algorithm based on a cost function to identify and add more prototypes to an existing set of prototypes (a second data point with a second subset of data points within a distance ϵ), where this set of prototypes includes a prototype that was already added (a first data point as a prototype) in addition to a new prototype currently being added as part of this iteration (a second data point), and hence is rejected under similar rationale identified by those two claim limitations recited in dependent Claim 19.).  

Allowable Subject Matter



Claims 5-13 are identified as allowable 
over the prior art.  The following is a statement of reasons for the indication of allowable subject matter.
Independent claim 5 recites the following claim limitation: 
… generating a plurality of gradients for the plurality of data points, the plurality of gradients comprising, for a selected data point of the plurality of data points, a gradient based on a first adjacent prototype with a first model prediction lower than a model prediction of the selected data point and a second adjacent prototype with a second model prediction higher than the model prediction of the selected data point; …
While the prior art teaches generating a plurality of gradients based on the plurality of data points and corresponding adjacent prototypes of the set of prototypes, the prior art does not teach generating a plurality of gradients based on two data points that are adjacent below and adjacent above a selected data point within the set of prototypes, such that it is possible that these two data points are not necessarily adjacent to each other, but only adjacent to a selected data point (i.e., a data point centered within the set of prototypes defined by a radius ϵ), thereby making this claim allowable over the prior art.
Claims 6-13 are dependent claims based on independent Claim 5, therefore these dependent claims are also allowable over the prior art.
Claims 2 are objected to as being dependent upon a rejected base claim, but would be allowable 
if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Dependent Claims 2 and 15 both contain aspects of similar allowable subject matter recited in independent Claim 5, since these dependent claims both recite limitations directed to the selection of a first adjacent prototype lower than (i.e., below) a data point, the selection of a second adjacent prototype higher than (i.e., above) a data point, and the generation of a gradient using these first and second adjacent prototypes.
As indicated in the earlier Final Office Action mailed December 13, 2021, in an effort to advance prosecution to progress this case towards allowability, Examiner (in collaboration with the examiner’s supervisor and the applicant’s attorney) had proposed examiner amendments (option 1: identifying the claim limitation from independent Claim 5 that are similar in scope to dependent Claims 2 and 15, and incorporating them into independent Claims 1 and 14 to make all independent claims allowable; or option 2: incorporating Claims 2 and 15 into Claims 1 and 14). However, both proposals were respectfully declined by the Applicant, and hence no agreement was reached.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Seo et al., Soft Nearest Prototype Classification, IEEE Transactions on Neural Networks, Vol.14, No.2, March 2003, where Seo teaches a soft nearest prototype classification method, first using LVQ algorithm to select two nearest prototypes according to a distance measurement (p.391 Section II. NPC and LVQ), and then further using these prototypes in a gradient-based optimization procedure to implement a cost function that determines a set of prototypes in which prototypes with correct labels are attracted to a data point proportional to a distance and weight factor, and prototypes with incorrect labels are repelled proportional to a distance and weight factor (p.392 col.2 2nd paragraph). 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is 303-297-4332. The examiner can normally be reached Monday-Friday 8:00am - 4:30pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/WILLIAM WAI YIN KWAN/Examiner, Art Unit 2121                                                                                                                                                                                                        

/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121