DETAILED ACTION
Claims 1-21 are pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):

(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 9 and 10 are rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.

    PNG
    media_image1.png
    87
    407
    media_image1.png
    Greyscale
Claim 9 recites the limitation “the output quantity of interest is y=f(x)=g(h(x)), and the intermediate layer is z=h(x).” There is no definition for f, g and h functions, thus the claim is indefinite. For examination purposes examiner has interpreted f(x) as any neural network f, which can be represented as a nested function.
Claim 10 recites the equation


There is no definition for variables s, j, f, P, g, and h(x) thus the claim is indefinite. For examination purposes examiner has interpreted s as a slice, j as an element, f as a function or a network, P as a distribution, g as a function, and h(x) as a function with input x.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-21 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1:  Is the claim to a process, machine, manufacture, or composition of matter? 
Yes, claims 1 and 11 recite a non-transitory machine-readable medium, and therefore is an article of manufacture, which is a statutory category of invention; claim 17 recites a system comprising processing circuitry and a memory, and therefore is a machine, which is a statutory category of invention; claim 21 recites a method and therefore is a process, which is a statutory category of invention.

Step 2A, prong One: Does the claim recite an abstract idea, law of nature or natural phenomenon? 
	Yes, claims 1, 17 and 21 recite (1) computing, for each artificial neuron in the set of intermediate artificial neurons, an influence score based on an average gradient of an output quantity of interest with respect to the artificial neuron across a plurality of inputs weighted by a probability of each input; and (mathematical calculations in light of spec. [0085]) (2) providing an output associated with the computed influence scores (an evaluation or judgement), which is a mental process. 

claim 11 recites (1) computing, for each artificial neuron in the set of intermediate artificial neurons, an influence score, wherein the influence score measures an influence of the artificial neuron on an output quantity of interest for a set of inputs of the deep neural network; (mathematical calculations in light of spec. [0085]) (2) identifying, from the artificial neurons in the set of intermediate artificial neurons, a first subset of artificial neurons and a second subset of artificial neurons, wherein, for each artificial neuron in the first subset, the influence score exceeds a threshold value, and wherein, for each artificial neuron in the second subset, the influence score does not exceed the threshold value; (an evaluation or judgement), (3) generating a new artificial neural network comprising the first subset of artificial neurons and lacking at least a portion of the second subset of artificial neurons; and (a judgement) (4) providing an output representing the new artificial neural network. (an evaluation or judgement), which is a mental process.

	If a claim limitation, under its broadest reasonable interpretation, covers performance in the human mind, then it falls within the mental processes of abstract ideas. Accordingly, the claims 1, 11, 17 and 21 recite an abstract idea.

Step 2A, prong Two: Does the claim recite additional elements that integrate the judicial exception into a practical application? 
No, the judicial exception is not integrated into a practical application. Claims 1 and 11 recites “instructions, computing machines”; claim 17 recites “processing circuitry, a memory”; claim 21 recites “computing machines” amount to mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f).
	
Claims 1, 11, 17 and 21: The limitation of “accessing a set of intermediate artificial neurons in a deep neural network, wherein the deep neural network is fully or partially trained;” represents data gathering. A step of gathering data for use in a claimed process is a pre-solution activity, therefore both of the receiving steps are insignificant extra-solution activities – see MPEP 2016.05(g).

Accordingly, these additional elements do not provide a meaningful limitation to transform the abstract idea into a patent eligible application of the abstract idea. The claims 1, 11, 17 and 21 as a whole, considering all additional elements both individually and in combination, are directed to an abstract idea.

Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? 
No, the claims 1, 11, 17 and 21 do not recite additional elements that amount to an inventive concept (significantly more) than the recited judicial exception. 

Claims 1 and 11 recites “instructions, computing machines”; claim 17 recites “processing circuitry, a memory”; claim 21 recites “computing machines” amount to mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f); and the accessing step is an insignificant extra-solution activity that is well-understood, routine and conventional (WURC). See MPEP 2106.05(d)(II).

Further, the following limitations are well-understood, routine and conventional (WURC):
“wherein the deep neural network is fully or partially trained;” see prior art Cheng (Cheng, p. 3325 right col. "VGG16-based [known CNN] semantic segmentation network [23], [24] are typically pretrained on the large ImageNet object classification data set"). 
Accordingly, considering the claim as a whole and the additional elements both individually and in combination, do not provide significantly more than the abstract idea. These independent claims are not patent eligible.

Dependent claims 2 and 18 recite “… wherein the influence score measures an influence of the artificial neuron on the output quantity of interest for a set of inputs of the deep neural network.” In step 2A prong One, the limitation of, the score measures an influence of the neuron on the output for inputs is an evaluation or a judgement, therefore is a mental process. In step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Dependent claims 3, 15 and 19 recite “…the operations further comprising: determining, based on at least a subset of the computed influence scores, an influence-directed explanation why a given set of inputs to the deep neural network corresponds to the output quantity of interest, wherein the output associated with the computed influence scores comprises the influence-directed explanation.” In step 2A prong One, the limitation of determining an explanation based on the scores, and the output comprises he explanation is an evaluation or a judgement, therefore is a mental process. In step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Dependent claims 4, 16 and 20 recite “… wherein the influence-directed explanation comprises a portion of the input responsible for the output quantity of interest.” In step 2A prong One, the limitation the explanations comprises a part the input responsible for the output is an evaluation or a judgement, therefore is a mental process. In step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Dependent claim 5 recites “…the operations further comprising: determining that, for the given set of inputs to the deep neural network, the output quantity of interest comprises an error; and in response to the error and based on the influence-directed explanation, adjusting the deep neural network or providing additional training data or different preprocessing steps to the deep neural network.” In step 2A prong One, the limitation determining the output comprises an error, and in response to the error, adjusting the neural network or providing additional training data is an evaluation or a judgement, therefore is a mental process. In step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Dependent claim 6 recites “…the operations further comprising: identifying, from the artificial neurons in the set of intermediate artificial neurons, a first subset of artificial neurons and a second subset of artificial neurons, wherein, for each artificial neuron in the first subset, the influence score exceeds a threshold value, and wherein, for each artificial neuron in the second subset, the influence score does not exceed the threshold value; generating a new artificial neural network comprising the first subset of artificial neurons and lacking at least a portion of the second subset of artificial neurons; and providing an output representing the new artificial neural network.” In step 2A prong One, the limitation of identifying a first subset of neurons and a second subset of neurons is an evaluation or judgment, generating a new neural network comprising the first subset and lacking the second subset is a judgement, and providing an output presenting the new neural network is an evaluation or judgment, therefore is a mental process. In step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Dependent claims 7 and 12 recite “…the operations further comprising: using the new artificial neural network for inference to solve a same problem as the deep neural network.” In step 2A prong One, the limitation of using the new neural network for inference to solve the same problem as the neural network is an evaluation or judgement, therefore is a mental process. In step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Dependent claims 8 and 13 recite “…wherein the new artificial neural network lacks each and every artificial neuron in the second subset of artificial neurons.” In step 2A prong One, the limitation of the new neural network lacking each and every neuron in the second subset is an evaluation or judgement, therefore is a mental process. In step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Dependent claim 9 recites “…wherein: the set of intermediate artificial neurons comprises an intermediate layer, the input is x, the output quantity of interest is y=f(x)=g(h(x)), and the intermediate layer is z=h(x)..” In step 2A prong One, the limitation of y=f(x)=g(h(x)) representing the relationships between input and output and z=h(x) representing the relationships between input and an intermediate layer is a mathematical relationship, therefore is a mathematical concept. In step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Dependent claim 10 recites “…wherein computing the influence score for a given artificial neuron zj in the intermediate layer comprises computing:
    PNG
    media_image1.png
    87
    407
    media_image1.png
    Greyscale

wherein: χ is the influence score, and
P(x) is the probability of the input x.”


    PNG
    media_image1.png
    87
    407
    media_image1.png
    Greyscale
In step 2A prong One, the limitation of limitation 

representing the influence score for a given artificial neuron zj in the intermediate layer is mathematical calculations, therefore is a mathematical concept. In step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Dependent claim 14 recites “… wherein the influence score is computed based on an average gradient of the output quantity of interest with respect to the artificial neuron across the set of inputs weighted by a probability of each input.” In step 2A prong One, the limitation of the score is computed based on an average gradient is mathematical calculations (in light of spec. [0085]), therefore is a mathematical concept. In step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-4, 17-20 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Selvaraju ("Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization") in view of Datta ("Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems").

In regard to claims 1, 17 and 21, Selvaraju teaches: A non-transitory machine-readable medium storing instructions which, when executed by one or more computing machines, cause the one or more computing machines to perform operations comprising: (Selvaraju, p. 618 "Our code is available at https://github.com/ramprs/grad-cam/ along with a demo on CloudCV [2]1 and video at youtu.be/COjUB9Izk6E."; code and a demo inherently teach the implementation on computing machines and medium storing instructions.) 
accessing a set of intermediate artificial neurons in a deep neural network, wherein the deep neural network is fully or partially trained; (Selvaraju, p. 620 "we can expect the last convolutional layers to have the best compromise between high-level semantics and detailed spatial information. The neurons in these layers [a set of intermediate artificial neurons] look for semantic class-specific information in the image (say object parts). Grad-CAM uses the gradient information flowing into the last convolutional layer of the CNN to understand the importance of each neuron for a decision of interest."; "These gradients flowing back are global-average-pooled to obtain the neuron importance weights α_kc..."; p. 622 "We evaluate the pretrained off-the-shelf VGG-16 [trained CNN] [41] model from the Caffe [19] Model Zoo."; p. 623 "We finetune an ImageNet trained VGG-16 model for the task of classifying 'doctor' vs. 'nurse'.")

    PNG
    media_image2.png
    125
    388
    media_image2.png
    Greyscale
computing, for each artificial neuron in the set of intermediate artificial neurons, an influence score based on an average gradient of an output quantity of interest with respect to the artificial neuron (Selvaraju, p. 620 "Grad-CAM uses the gradient information flowing into the last convolutional layer of the CNN to understand the importance of each neuron for a decision of interest... As shown in Fig. 2, in order to obtain the class-discriminative localization map Grad-CAM L_c_Grad-CAM ϵ Ru×v of width u and height v for any class c , we first compute the gradient of the score for class c, yc (before the softmax), with respect to feature maps Ak of a convolutional layer, i.e. ∂yc / ∂Ak . These gradients flowing back are global-average-pooled to obtain the neuron importance weights α_kc:This weight α_kc represents a partial linearization of the deep network downstream from A, and captures the ‘importance’ of feature map k for a target class c."; α_kc is an influence score based on an average gradient (∂yc) of an output quantity of interest (yc) with respect to the neuron of a convolutional layer, c is a class of interest) 
… providing an output associated with the computed influence scores. (Selvaraju, p. 621 "... Figures 1c, 1f and 1i, 1l show Grad-CAM visualizations for ‘tiger cat’ and ‘boxer (dog)’ [e.g. output] respectively. Ablation studies and more Grad- CAM visualizations can be found in [38]"; p. 618 "We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent. Our approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say logits for ‘dog’ or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions [e.g. output] in the image for predicting the concept."; visual explanations or the localization map are the output based on the score of the gradient information cited above.)


    PNG
    media_image3.png
    77
    424
    media_image3.png
    Greyscale
Selvaraju does not teach, but Datta teaches: computing... an influence score ... across a plurality of inputs weighted by a probability of each input; and (Datta, p. 600 "We are given an algorithm A. A operates on inputs (also referred to as features for ML systems), N={1,…,n}. Every i∈N, can take on various states, given by Xi."; p. 601 "One intended use of QII is to provide personalized transparency reports to users of data analytics systems... The influence measure is therefore... The above probability can be interpreted as the probability that feature i [a probability of each input] is pivotal to the classification of c(x). Computing the average of this quantity over X [across a plurality of inputs] yields:"; ΣPr(x) where x ϵ X is the summation across a plurality of inputs, and it is weighted by E(c(X)), a probability of each input, feature i to class c.)

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Selvaraju to incorporate the teachings of Datta by including Quantitative Input Influence (QII) measures. Doing so would allow the system to capture the degree of influence of inputs on outputs of systems. (Datta, p. 598 "Specifically, we introduce a family of Quantitative Input Influence (QII) measures that capture the degree of influence of inputs on outputs of systems. These measures provide a foundation for the design of transparency reports that accompany system decisions (e.g., explaining a specific credit decision) and for testing tools useful for internal and external oversight (e.g., to detect algorithmic discrimination). Distinctively, our causal QII measures carefully account for correlated inputs while measuring influence.")

Claims 17 and 21 recite substantially the same limitation as claim 1, therefore the rejection applied to claims 17 and 21 also apply to claim 1. In addition, Selvaraju teaches: (claim 17) A system comprising: processing circuitry; and a memory storing instructions which, when executed by the processing circuitry, cause the processing circuitry to perform operations comprising:

In regard to claims 2 and 18, reference is made to the rejection of claims 1 and 17 respectively, and Selvaraju teaches: wherein the influence score measures an influence of the artificial neuron on the output quantity of interest (Selvaraju, p. 620 "we first compute the gradient of the score for class c, yc (before the softmax), with respect to feature maps Ak of a convolutional layer, i.e. ∂yc / ∂Ak . These gradients flowing back are global-average-pooled to obtain the neuron importance weights α_kc:... ∂yc / ∂Ak... This weight α_kc represents a partial linearization of the deep network downstream from A, and captures the ‘importance’ of feature map k for a target class c."; α_kc is an influence score on an output quantity of interest (yc), c is a class of interest) for a set of inputs of the deep neural network. (Selvaraju, p. 621 "We built our training dataset using the top 250 relevant images [e.g. a set of inputs] (for each class) from a popular image search engine."; p. 622 "In order to measure whether Grad-CAM helps distinguish between classes we select images from VOC 2007 [e.g. a set of inputs] val set that contain exactly two annotated categories and create visualizations for each one of them.")
 
In regard to claims 3 and 19, reference is made to the rejection of claims 1 and 17 respectively, and Selvaraju teaches: the operations further comprising: determining, based on at least a subset of the computed influence scores, an influence-directed explanation why a given set of inputs to the deep neural network corresponds to the output quantity of interest, (Selvaraju, p. 618 "We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent. Our approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept [based on the gradient information/scores] (say logits for ‘dog’ or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. [explanation why a given set of inputs to the deep neural network corresponds to the output quantity of interest]") wherein the output associated with the computed influence scores comprises the influence-directed explanation. (Selvaraju, p. 621 "L_c_Grad-CAM... the neuron importance weights α_kc: This weight α_kc represents a partial linearization of the deep network downstream from A, and captures the ‘importance’ of feature map k for a target class c... We apply a ReLU to the linear combination of maps because we are only interested in the features that have a positive influence [e.g. influence-directed explanation] on the class of interest, i.e. pixels whose intensity should be increased in order to increase yc. Negative pixels are likely to belong to other categories in the image."; the higher the score, the higher of the importance/influence of the explanation in the L_c_Grad-CAM the localization map.)

In regard to claims 4 and 20, reference is made to the rejection of claims 3 and 19 respectively, and Selvaraju teaches: wherein the influence-directed explanation comprises a portion of the input responsible for the output quantity of interest. (Selvaraju, p. 618 "We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent. Our approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say logits for ‘dog’ or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image [a portion of the input] for predicting the concept [responsible for the output quantity of interest]."; p. 625 "Figure 5: Interpreting image captioning models: We use our class-discriminative localization technique, Grad-CAM to find spatial support regions [a portion of input] for captions in images [responsible for the output quantity of interest]. Fig. 5a Visual explanations from image captioning model [23] highlighting image regions considered to be important for producing the captions.")  

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Selvaraju in view of Datta in further view of Zeiler ("Visualizing and Understanding Convolutional Networks").

In regard to claim 5, reference is made to the rejection of claim 3, and Selvaraju and Datta do note teach, but Zeiler teaches: the operations further comprising: determining that, for the given set of inputs to the deep neural network, the output quantity of interest comprises an error; and (Zeiler, p. 822 "Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set [e.g. the given set of inputs]... but the visualizations reveal that this particular feature map focuses on the grass in the background, not the foreground objects [e.g. error in the output of interest] ... 4.1 Architecture Selection... While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place. By visualizing the first and second layers of Krizhevsky et al. ’s architecture (Fig. 5(a) & (c)), various problems [e.g. error] are apparent."; errors in the visualization of output quantity of interest) 
in response to the error and based on the influence-directed explanation, adjusting the deep neural network or providing additional training data or different preprocessing steps to the deep neural network. (Zeiler, p. 823 "By visualizing the first and second layers...  various problems [errors] are apparent. The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies [e.g. explanation]. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems [in response to error and explanation], we (i) reduced the 1st layer filter size from 11x11 to 7x7 and (ii) made the stride of the convolution 2, rather than 4. [adjusting the deep neural network]")  

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Selvaraju and Datta to include the error of the visualization output of Zeiler. Doing so would assist with selecting good neural network architectures. (Zeiler, p. 823 "While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures... This new architecture retains much more information in the 1st and 2nd layer features, as shown in Fig. 5(b) & (d). More importantly, it also improves the classification performance as shown in Section 5.1.")

Claims 6-9 are rejected under 35 U.S.C. 103 as being unpatentable over Selvaraju in view of Datta in further view of Yu ("NISP: Pruning Networks using Neuron Importance Score Propagation").


    PNG
    media_image4.png
    197
    358
    media_image4.png
    Greyscale
In regard to claim 6, reference is made to the rejection of claim 1, and Selvaraju and Datta do not teach, but Yu teaches:  the operations further comprising: identifying, from the artificial neurons in the set of intermediate artificial neurons, a first subset of artificial neurons and a second subset of artificial neurons, wherein, for each artificial neuron in the first subset, the influence score exceeds a threshold value, and wherein, for each artificial neuron in the second subset, the influence score does not exceed the threshold value; (Yu, p. 9196 "We define the neuron importance score as a non-negative value w.r.t. a neuron, and use sl to represent the vector of neuron importance scores in the l-th layer. Suppose Nl neurons are to be kept in the l-th layer after pruning; we define the neuron prune indicator of the l-th layer as a binary vector s∗l, computed based on neuron importance scores sl such that s∗l,i=1 if and only if sl,i is among top Nl values [e.g. threshold] in sl."; p. 9198 "Given target pruning ratios [e.g. threshold] for each layer, we propagate the importance scores, compute the prune indicator of neurons based on their importance scores and remove neurons with prune indicator value 0"; first subset are neurons with s*l,i=1, i.e. the top-ranked neurons, and second subset with s*l,i=0, i.e. the bottom-ranked neurons.)

generating a new artificial neural network comprising the first subset of artificial neurons and lacking at least a portion of the second subset of artificial neurons; and (Yu, p. 9194 "The CNN is pruned by removing neurons with least importance, and it is then fine-tuned to recover its predictive power."; p. 9196 "We propagate the neuron importance from the final response layer (FRL) to previous layers, and prune bottom-ranked neurons (with low importance scores shown in each node) given a pre-defined pruning ratio per layer in a single pass"; the pruned CNN is the new artificial neural network keeping important neurons and removing unimportant neurons.)
providing an output representing the new artificial neural network. (Yu, p. 9194 "The CNN is pruned by removing neurons with least importance, and it is then fine-tuned to recover its predictive power"; p. 9194 "Figure 1. We measure the importance of neurons in the final response layer (FRL), and derive Neuron Importance Score Propagation (NISP) to propagate the importance to the entire network. Given a pre-defined pruning ratio per layer, we prune the neurons/filters with lower importance score. We finally fine-tune the pruned model to recover its predictive accuracy"; p. 9196 "The motivation of our objective is that the difference between the responses produced by the original network and the one produced by the pruned network should be minimized w.r.t. important neurons."; see Fig. 1 Pruned network, the pruned CNN is an output representing the new artificial neural network.) 

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Selvaraju and Datta to include pruned network in the NISP of Yu. Doing so would achieve significant acceleration and compression with negligible accuracy loss. (Yu, p. 9194 "NISP is evaluated on several datasets with multiple CNN models and demonstrated to achieve significant acceleration and compression with negligible accuracy loss.")

In regard to claim 7, reference is made to the rejection of claim 6, and Selvaraju and Datta do not teach, but Yu teaches: the operations further comprising: using the new artificial neural network for inference to solve a same problem as the deep neural network. (Yu, p. 9196 "The motivation of our objective is that the difference between the responses produced by the original network and the one produced by the pruned network should be minimized w.r.t. important neurons."; solve a same problem using the original network and the pruned network.) 
The rationale for combining the teachings of Selvaraju, Datta and Yu is the same as set forth in the rejection of claim 6.

In regard to claim 8, reference is made to the rejection of claim 6, and Selvaraju and Datta do not teach, but Yu teaches: wherein the new artificial neural network lacks each and every artificial neuron in the second subset of artificial neurons. (Yu, p. 9198 "Given target pruning ratios for each layer, we propagate the importance scores, compute the prune indicator of neurons based on their importance scores and remove neurons with prune indicator value 0"; removing each and every neurons with s*l,i=0, i.e. each of the bottom-ranked neurons.)  
The rationale for combining the teachings of Selvaraju, Datta and Yu is the same as set forth in the rejection of claim 6.

In regard to claim 9, reference is made to the rejection of claim 1, and Selvaraju and Datta do not teach, but Yu teaches: wherein: the set of intermediate artificial neurons comprises an intermediate layer, the input is x, the output quantity of interest is y=f(x)=g(h(x)), and the intermediate layer is z=h(x). (Yu, p. 9196 "Most neural networks can be represented as a nested function. Thus, we define a network with depth n as a function F (n) = f (n) ◦ f (n−1) ◦ · · · ◦ f (1)... [g(h(x))] f(l)(x) [f(x)]… Suppose we have a dataset of M samples, and each is represented using x (m) …"; x is the input)  
The rationale for combining the teachings of Selvaraju, Datta and Yu is the same as set forth in the rejection of claim 6.

Claims 11-13 and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Selvaraju in view of Yu.

In regard to claim 11, Selvaraju teaches: A non-transitory machine-readable medium storing instructions which, when executed by one or more computing machines, cause the one or more computing machines to perform operations comprising: (Selvaraju, p. 618 "Our code is available at https://github.com/ramprs/grad-cam/ along with a demo on CloudCV [2]1 and video at youtu.be/COjUB9Izk6E."; code and a demo inherently teach the implementation on computing machines and medium storing instructions.) 
accessing a set of intermediate artificial neurons in a deep neural network, wherein the deep neural network is fully or partially trained; (Selvaraju, p. 620 "we can expect the last convolutional layers to have the best compromise between high-level semantics and detailed spatial information. The neurons in these layers [a set of intermediate artificial neurons] look for semantic class-specific information in the image (say object parts). Grad-CAM uses the gradient information flowing into the last convolutional layer of the CNN to understand the importance of each neuron for a decision of interest."; "These gradients flowing back are global-average-pooled to obtain the neuron importance weights α_kc..."; p. 622 "We evaluate the pretrained off-the-shelf VGG-16 [trained CNN] [41] model from the Caffe [19] Model Zoo."; p. 623 "We finetune an ImageNet trained VGG-16 model for the task of classifying 'doctor' vs. 'nurse'.")
computing, for each artificial neuron in the set of intermediate artificial neurons, an influence score, wherein the influence score measures an influence of the artificial neuron on an output quantity of interest (Selvaraju, p. 620 "Grad-CAM uses the gradient information flowing into the last convolutional layer of the CNN to understand the importance of each neuron for a decision of interest... As shown in Fig. 2, in order to obtain the class-discriminative localization map Grad-CAM L_c_Grad-CAM ϵ Ru×v of width u and height v for any class c , we first compute the gradient of the score for class c, yc (before the softmax), with respect to feature maps Ak of a convolutional layer, i.e. ∂yc / ∂Ak . These gradients flowing back are global-average-pooled to obtain the neuron importance weights α_kc: … This weight α_kc represents a partial linearization of the deep network downstream from A, and captures the ‘importance’ of feature map k for a target class c."; α_kc is an influence score based on an average gradient (∂yc) of an output quantity of interest (yc) with respect to the neuron of a convolutional layer, c is a class of interest) for a set of inputs of the deep neural network; (Selvaraju, p. 621 "We built our training dataset using the top 250 relevant images [e.g. a set of inputs] (for each class) from a popular image search engine."; p. 622 "In order to measure whether Grad-CAM helps distinguish between classes we select images from VOC 2007 [e.g. a set of inputs] val set that contain exactly two annotated categories and create visualizations for each one of them.")

Selvaraju does not teach, but Yu teaches:  identifying, from the artificial neurons in the set of intermediate artificial neurons, a first subset of artificial neurons and a second subset of artificial neurons, wherein, for each artificial neuron in the first subset, the influence score exceeds a threshold value, and wherein, for each artificial neuron in the second subset, the influence score does not exceed the threshold value; (Yu, p. 9196 "We define the neuron importance score as a non-negative value w.r.t. a neuron, and use sl to represent the vector of neuron importance scores in the l-th layer. Suppose Nl neurons are to be kept in the l-th layer after pruning; we define the neuron prune indicator of the l-th layer as a binary vector s∗l, computed based on neuron importance scores sl such that s∗l,i=1 if and only if sl,i is among top Nl values [e.g. threshold] in sl."; p. 9198 "Given target pruning ratios [e.g. threshold] for each layer, we propagate the importance scores, compute the prune indicator of neurons based on their importance scores and remove neurons with prune indicator value 0"; first subset are neurons with s*l,i=1, i.e. the top-ranked neurons, and second subset with s*l,i=0, i.e. the bottom-ranked neurons.)
generating a new artificial neural network comprising the first subset of artificial neurons and lacking at least a portion of the second subset of artificial neurons; and (Yu, p. 9194 "The CNN is pruned by removing neurons with least importance, and it is then fine-tuned to recover its predictive power."; p. 9196 "We propagate the neuron importance from the final response layer (FRL) to previous layers, and prune bottom-ranked neurons (with low importance scores shown in each node) given a pre-defined pruning ratio per layer in a single pass"; the pruned CNN is the new artificial neural network keeping important neurons and removing unimportant neurons.)
providing an output representing the new artificial neural network. (Yu, p. 9194 "The CNN is pruned by removing neurons with least importance, and it is then fine-tuned to recover its predictive power"; p. 9194 "Figure 1. We measure the importance of neurons in the final response layer (FRL), and derive Neuron Importance Score Propagation (NISP) to propagate the importance to the entire network. Given a pre-defined pruning ratio per layer, we prune the neurons/filters with lower importance score. We finally fine-tune the pruned model to recover its predictive accuracy"; p. 9196 "The motivation of our objective is that the difference between the responses produced by the original network and the one produced by the pruned network should be minimized w.r.t. important neurons."; see Fig. 1 Pruned network, the pruned CNN is an output representing the new artificial neural network.)
The rationale for combining the teachings of Selvaraju and Yu is the same as set forth in the rejection of claim 6.

In regard to claim 12, reference is made to the rejection of claim 11, and Selvaraju and Datta do not teach, but Yu teaches: the operations further comprising: using the new artificial neural network for inference to solve a same problem as the deep neural network. (Yu, p. 9196 "The motivation of our objective is that the difference between the responses produced by the original network and the one produced by the pruned network should be minimized w.r.t. important neurons."; solve a same problem using the original network and the pruned network.)
The rationale for combining the teachings of Selvaraju and Yu is the same as set forth in the rejection of claim 6.

In regard to claim 13, reference is made to the rejection of claim 11, and Selvaraju and Datta do not teach, but Yu teaches: wherein the new artificial neural network lacks each and every artificial neuron in the second subset of artificial neurons. (Yu, p. 9198 "Given target pruning ratios for each layer, we propagate the importance scores, compute the prune indicator of neurons based on their importance scores and remove neurons with prune indicator value 0"; removing each and every neurons with s*l,i=0, i.e. each of the bottom-ranked neurons.)  
The rationale for combining the teachings of Selvaraju and Yu is the same as set forth in the rejection of claim 6.

In regard to claim 15, reference is made to the rejection of claim 1, and Selvaraju teaches: the operations further comprising: determining, based on at least a subset of the computed influence scores, an influence-directed explanation why a given set of inputs to the deep neural network corresponds to the output quantity of interest; (Selvaraju, p. 618 "We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent. Our approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept [based on the gradient information/scores] (say logits for ‘dog’ or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. [explanation why a given set of inputs to the deep neural network corresponds to the output quantity of interest]") and providing an additional output representing the influence-directed explanation. (Selvaraju, p. 621 "... Figures 1c, 1f and 1i, 1l [e.g. additional output] show Grad-CAM visualizations for ‘tiger cat’ and ‘boxer (dog)’ respectively. Ablation studies and more Grad- CAM visualizations can be found in [38]"; p. 621 "L_c_Grad-CAM... the neuron importance weights α_kc: This weight α_kc represents a partial linearization of the deep network downstream from A, and captures the ‘importance’ of feature map k for a target class c... We apply a ReLU to the linear combination of maps because we are only interested in the features that have a positive influence [e.g. influence-directed explanation] on the class of interest, i.e. pixels whose intensity should be increased in order to increase yc. Negative pixels are likely to belong to other categories in the image."; the higher the score, the higher of the importance/influence of the explanation in the L_c_Grad-CAM the localization map. The highlighted important region or Fig. 1 are examples of additional output.)

In regard to claim 16, reference is made to the rejection of claim 1, and Selvaraju teaches: wherein the influence- directed explanation comprises a portion of the input responsible for the output quantity of interest. (Selvaraju, p. 618 "We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent. Our approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say logits for ‘dog’ or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image [a portion of the input] for predicting the concept [responsible for the output quantity of interest]."; p. 625 "Figure 5: Interpreting image captioning models: We use our class-discriminative localization technique, Grad-CAM to find spatial support regions [a portion of input] for captions in images [responsible for the output quantity of interest]. Fig. 5a Visual explanations from image captioning model [23] highlighting image regions considered to be important for producing the captions.")

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Selvaraju in view of Yu in further view of Datta.

In regard to claim 14, reference is made to the rejection of claim 11, and Selvaraju teaches: wherein the influence score is computed based on an average gradient of the output quantity of interest with respect to the artificial neuron (Selvaraju, p. 620 "Grad-CAM uses the gradient information flowing into the last convolutional layer of the CNN to understand the importance of each neuron for a decision of interest... As shown in Fig. 2, in order to obtain the class-discriminative localization map Grad-CAM L_c_Grad-CAM ϵ Ru×v of width u and height v for any class c , we first compute the gradient of the score for class c, yc (before the softmax), with respect to feature maps Ak of a convolutional layer, i.e. ∂yc / ∂Ak . These gradients flowing back are global-average-pooled to obtain the neuron importance weights α_kc:… This weight α_kc represents a partial linearization of the deep network downstream from A, and captures the ‘importance’ of feature map k for a target class c."; α_kc is an influence score based on an average gradient (∂yc) of an output quantity of interest (yc) with respect to the neuron of a convolutional layer, c is a class of interest)

Selvaraju and Yu do not teach, but Datta teaches:  across the set of inputs weighted by a probability of each input. (Datta, p. 600 "We are given an algorithm A. A operates on inputs (also referred to as features for ML systems), N={1,…,n}. Every i∈N, can take on various states, given by Xi."; p. 601 "One intended use of QII is to provide personalized transparency reports to users of data analytics systems... The influence measure is therefore... The above probability can be interpreted as the probability that feature i [a probability of each input] is pivotal to the classification of c(x). Computing the average of this quantity over X [across a plurality of inputs] yields:"; ΣPr(x) where x ϵ X is the summation across a plurality of inputs, and it is weighted by E(c(X)), a probability of each input, feature i to class c.)
The rationale for combining the teachings of Selvaraju, Yu and Datta is the same as set forth in the rejection of claim 1.

See 2153.01(a). If, however, the application names fewer joint inventors than a publication (e.g., the application names as joint inventors A and B, and the publication names as authors A, B and C), it would not be readily apparent from the publication that it is by the inventor (i.e., the inventive entity) or a joint inventor and the publication would be treated as prior art under AIA  35 U.S.C. 102(a)(1).

Claims 1-4 and 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Selvaraju in view of Leino ("Influence-Directed Explanations for Deep Convolutional Networks").
In regard to claim 1, Selvaraju teaches: A non-transitory machine-readable medium storing instructions which, when executed by one or more computing machines, cause the one or more computing machines to perform operations comprising: (Selvaraju, p. 618 "Our code is available at https://github.com/ramprs/grad-cam/ along with a demo on CloudCV [2]1 and video at youtu.be/COjUB9Izk6E."; code and a demo inherently teach the implementation on computing machines and medium storing instructions.)
Selvaraju does not teach, but Leino teaches: accessing a set of intermediate artificial neurons in a deep neural network, wherein the deep neural network is fully or partially trained; (Leino, p. 2 left col. "... The slice parameter exposes the internals of a network, and allows one to compute influence with respect to intermediate neurons [for each intermediate artificial neuron]"; p. 2 left col "Figure 1 demonstrates on a VGG16 (Simonyan & Zisserman, 2014) model trained [trained CNN] on the ImageNet dataset (Rus sakovsky et al., 2015) the capability of influence-directed explanations to extract meaningful insight about the network’s inner workings.")
computing, for each artificial neuron in the set of intermediate artificial neurons, an influence score based on an average gradient of an output quantity of interest with respect to the artificial neuron across a plurality of inputs weighted by a probability of each input; and (Leino, p. 2 left col. "Distributional influence [an in fluence score] is parameterized by a slice of the network (e.g. a particular layer), a quantity of interest, and a distribution of interest. The measure is the average partial derivative [average gradient] of the quantity of interest [the output quantity of interest] over the distribution of interest [the inputs distribution] at the slice... The slice parameter exposes the internals of a network, and allows one to compute influence with respect to intermediate neurons [for each intermediate artificial neuron]"; p. 7 "Axiom 3 (Distribution linearity (DL))... P(x) = integral of A g(a)P_a(x)da... We can show that the only influence measure that satisfies these three axioms is the weighted gradient of the input probability distribution [across the set of inputs weighted by a probability of each input]" )
providing an output associated with the computed influence scores. (Leino, p. right col. "The influence measure defined in Section 2 is parameterized by a distribution of interest P (Equation 1 [e.g. computed scores]) over which the measure is taken... Defining the distribution of interest with support over a larger set of instances will yield explanations [e.g. output] that capture the factors common to network behaviors across the corresponding population of instances. These explanations capture the 'essence' of what the network learned about that population, and can be used to identify the concepts that are most relevant to the network’s behavior on it.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Selvaraju to include influence-directed explanations of Leino. Doing so would allow the system to identify influential concepts that generalize across instances. (Leino, p. 1 “Our evaluation demonstrates that influence-directed explanations (1) identify influential concepts that generalize across instances…”)

In regard to claim 2, reference is made to the rejection of claim 1, and Selvaraju does not teach, but Leino teaches: wherein the influence score measures an influence of the artificial neuron on the output quantity of interest for a set of inputs of the deep neural network. (Leino, p. 2 left col. "Distributional influence [an in fluence score] is parameterized by a slice of the network (e.g. a particular layer), a quantity of interest, and a distribution of interest. The measure is the average partial derivative [average gradient] of the quantity of interest [the output quantity of interest] over the distribution of interest [the inputs distribution] at the slice... The slice parameter exposes the internals of a network, and allows one to compute influence with respect to intermediate neurons [for each intermediate artificial neuron]"; p. 7 "Axiom 3 (Distribution linearity (DL))... P(x) = integral of A g(a)P_a(x)da... We can show that the only influence measure that satisfies these three axioms is the weighted gradient of the input probability distribution [across the set of inputs weighted by a probability of each input]" )
The rationale for combining the teachings of Selvaraju and Leino is the same as set forth in the rejection of claim 1.

In regard to claim 3, reference is made to the rejection of claim 1, and Selvaraju does not teach, but Leino teaches: the operations further comprising: determining, based on at least a subset of the computed influence scores, an influence-directed explanation why a given set of inputs to the deep neural network corresponds to the output quantity of interest, wherein the output associated with the computed influence scores comprises the influence-directed explanation. (Leino, p.2 right col. "The influence measure defined in Section 2 is parameterized by a distribution of interest P (Equation 1 [e.g. computed scores]) over which the measure is taken... Defining the distribution of interest with support over a larger set of instances will yield explanations [e.g. output, an influence-directed explanation] that capture the factors common to network behaviors across the corresponding population of instances. These explanations capture the 'essence' of what the network learned about that population, and can be used to identify the concepts that are most relevant to the network’s behavior on it.")
The rationale for combining the teachings of Selvaraju and Leino is the same as set forth in the rejection of claim 1.

In regard to claim 4, reference is made to the rejection of claim 3, and Selvaraju does not teach, but Leino teaches: wherein the influence- directed explanation comprises a portion of the input responsible for the output quantity of interest. (Leino, p. 2 “… predict ‘sports car’ over ‘convertible’. The images in Figure 1(b) are computed by rendering the receptive field of the most influential map in the original feature space for the corresponding image in 1(a). The results coincide with an intuitive understanding of the distinction between these classes: the depicted interpretation highlights the portion of the image depicting the car’s top.”)
The rationale for combining the teachings of Selvaraju and Leino is the same as set forth in the rejection of claim 1.

In regard to claim 9, reference is made to the rejection of claim 1, and Selvaraju does not teach, but Leino teaches: wherein: the set of intermediate artificial neurons comprises an intermediate layer, the input is x, the output quantity of interest is y = f(x) = g(h(x)), and the intermediate layer is z = h(x). (Leino, p. 2 "a slice of a network f is a tuple of functions <g, h>, such that h: X-> z, and g: z-> R and f = g ∘ h [g(h(x))].The internal representation for an instance x is given by z = h(x).") 
The rationale for combining the teachings of Selvaraju and Leino is the same as set forth in the rejection of claim 1.


    PNG
    media_image1.png
    87
    407
    media_image1.png
    Greyscale
In regard to claim 10, reference is made to the rejection of claim 9, and Selvaraju does not teach, but Leino teaches: wherein computing the influence score for a given artificial neuron zj in the intermediate layer comprises computing: wherein: χ is the influence score, and P(x) is the probability of the input x. (Leino, p. 2 "where X… and n is the number of inputs to f… a distribution of interest P… a slice s of a network f is a tuple of functions <g,h> such that … The internal representation for an instance x is given by z=h(x). In our setting, elements of z can be viewed as the activations of neurons at a particular layer. Definition 1. The influence of an element j in the internal representation defined by s = <g; h> is given by… Eq(1) [score] ")
The rationale for combining the teachings of Selvaraju and Leino is the same as set forth in the rejection of claim 1.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SU-TING CHUANG whose telephone number is (408)918-7519.  The examiner can normally be reached on Monday - Thursday 8-5 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571)272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/S.C./Examiner, Art Unit 2122                 

/YING YU CHEN/Primary Examiner, Art Unit 2125