DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendment filed 2022-10-07 has been entered. The status of the claims is as follows:
 Claims 2, 4, 14, and 19 are cancelled.
Claims 1, 3, 5-13, 15-18, and 20-23 remain pending in the application.
Claims 1, 7, 11, 13, and 18 are amended.
Claims 21-23 are new.
Response to Arguments
Applicant’s arguments with respect to rejections under 35 USC 103 have been fully considered but they are not persuasive.
Applicant argues on Remarks Page 8 that “The Office Action appears to unreasonably broadly interpret ‘connections between nodes’ and ‘weights’ as any sort of connection between layers, essentially equating the filters of Luo, which deals with convolutional neural networks, with the node weights of the claims. The claims and specification are clear that the claimed DNN has connections between nodes with associated weights. The filters in a CNN connect combinations of nodes. The weights of the fully connected layers are what are sparsified in the claims, not filters as in Luo. Luo itself refers to prior methods that sparsify weights, distinguishing them from filter pruning. See Luo, Page 2, second column under ‘Network pruning’ indicating that ‘Small-weight connections below a threshold would be discarded, leading to a sparse architecture. But their method did not reduce the size of activation tensor, which would dominate the memory footprint when batch size is large. Thus some researchers focus their attention on filter pruning to reduce channel number of activation tensor.’ Section 3 describes Luo's method ‘to discard several unimportant filters...’”
Examiner respectfully disagrees.  While Applicant states, “The filters in a CNN connect combinations of nodes”, Examiner points out that connections between combinations of nodes are, nevertheless, still connections between nodes.  Examiner has used Luo to teach the limitation “and determining neural entropies of respective connections between nodes by calculating an area covered by the spreading signal”.  Examiner notes that the limitation does not state “connections between single nodes”.  Examiner also points out that a limitation which does imply the connection between single nodes: “determining a spreading signal for each connection between nodes in multiple adjacent layers of the DNN wherein the spreading signal for each connection is an element-wise multiplication of input activations to a node in a first layer connected to a nod
Regarding Applicant’s argument that Luo distinguishes their own work of pruning filters from pruning weights, Examiner points out that filters are a combination of weights, and are analogous to simple weights in a generic non-convolutional DNN.  This is further supported by other sources such as Sze et al. (“Efficient Processing of Deep Neural Networks: A Tutorial and Survey”; hereinafter “Sze”), Page 7 Bottom Left: “To align the terminology of CNNs with the generic DNN, filters are composed of weights (i.e., synapses)”.  In the passage cited by Applicant as distinguishing filters from weights, Luo expands on the previous work of Han et al. (“Learning both Weights and Connections for Efficient Neural Networks”; hereinafter “Han2”), who prunes individual weights having small magnitude from a CNN, by instead pruning entire filters having small entropy from a CNN.  Given what is already disclosed by the Luo, Erdogan, and Han references, Examiner determined that it is not necessary to include Han2 in the combination of arts applied.  Erdogan and Han teach pruning connections (weight connections), while Luo applies entropy to pruning connections (filter connections).
Examiner also points out that a CNN is considered a type of DNN.  Luo themselves mention DNN in Page 6 Top Right:  “Because the training process of deep neural networks is a highly non-convex, if we always fine-tuned the network until convergence after pruning every layer, the network is likely to be attracted to a set of poor values in the early stage.”  Furthermore, multiple other sources state as much, including Rastegari et al. (“XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”; hereinafter “Rastegari”) in Page 1:  “In computer vision, a particular type of DNN, known as Convolutional Neural Networks (CNN), have demonstrated state-of-the-art results in object recognition and detection.”
Applicant argues on Remarks Pages 8-9 that “The Office Action asserts that Luo teaches: ‘sparsifying the sorted weights for each layer to create multiple sparse model layers’ as recited in independent claims 1, 13, and 18. This assertion is respectfully traversed. The claim language specifically recites sparsifying the sorted weights by the use of multiple different sparsity levels for each layer followed by retraining of the multiple sparse model layers for each layer and the selection of one of the retrained sparse model layers for each corresponding layer to form the optimized DNN. This process is believed quite different from that described in Luo which sparsifies filters, not weights as claimed. While the filters in Luo are sorted, there is no teaching of pruning each layer at different sparsity levels to create multiple sparse model layers for each layer based on connection weights. As stated in the Office Action referring to Luo: ‘We prune each layer with 50% compression rate.’ Thus, only one sparsity level is created for each layer in Luo. Claim 1 clarifies that multiple sparse model layers with different sparsity are created, and ‘one of the retrained sparse model layers for each corresponding layer’ is selected for the DNN.”
Examiner respectfully disagrees.  Regarding the argument that Luo sparsifies filters and not weights, Examiner points to the discussion on this matter above.  Regarding the “different sparsity levels”, Luo was not relied upon to each this limitation, but instead this was taught by additional reference Kim.
Applicant argues on Remarks Page 9 that “The Office Action also asserts that Luo teaches ‘comparing accuracy of the retained sparse model layers to model layers prior to sparsifying’ as claimed. This assertion is believed in error, as it references the last two paragraphs on page 5 of Luo. The first of such paragraphs refers to a former strategy and the comparison is done only after the CNN has been sparsified based on filters. Such comparison is not used to select one of the retrained sparse model layers for each corresponding layer as are recited in the last element of claim 1. The layers in Luo have already been selected, meaning that the layers were not selected based on the comparison. In addition, Luo does not create multiple sparsified layers on which to perform the comparison. The Office Action also refers to Tables 3 and 4 in support of the assertion that Luo teaches ‘comparing accuracy of the retained sparse model layers to model layers prior to sparsifying’ as claimed. Such Tables compare already tuned models to either other types of pruning (Table 3) or models already tuned using different pruning thresholds of 90%, 75% and 50% (Table 4). Neither table performs the selection of sparse model layers for each corresponding layer based on a comparing as claimed.”
Examiner respectfully disagrees.  First, Examiner points out that Examiner mapped this limitation to two different references, as Examiner felt they both taught it.  This was mapped to both Luo and Han.  No argument has been presented against Han.  Regarding Luo, Examiner points out that Luo Page 5 last paragraph:  “In this paper, we propose a novel learning schedule to trade-off between training speed and accuracy”, and Table 4 “Overall Performance of our approach on ResNet-50 with different compression rate” demonstrates that Luo indeed does “compare accuracy of the retrained sparse model layers to model layers prior to sparsifying” as claimed.  Table 4 shows that Luo is comparing the accuracy of the sparsified layers to the pre-sparsified layers, as there are rows for “Original” and then different levels of pruning.  Regarding the next limitation, “selecting one of the retrained sparse model layers for each corresponding layer as an optimized DNN based on the comparing”, Luo Page 5 last paragraph also discloses:  “Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly. Of course, this performance is not yet optimal, but is approaching to it closely.”  Here, Luo performs fine-tuning until the accuracy is acceptable.  Only then are the sparse layers accepted as the final model, and thus Luo discloses that the accuracy comparison is used to select the sparse layers as layers of the optimized DNN.  While Applicant argues that Luo’s accuracy information is based on “models already tuned using different pruning thresholds”, nevertheless, this accuracy comparison information is used in the subsequent algorithms for “two epochs” as a result, which produces sparsified networks of acceptable accuracy.  The layers are not selected as the final layers until after the second epoch.  
Finally, Examiner points out that Han, Page 3 Para 3, was also used to map the same two limitations, and they are presented in the claim rejections below.
Applicant argues on Remarks Page 9 that “The Office Action asserts that it would have been obvious to combine the layer-wise sparsity levels of Kim with the sparse network of Luo, Erdogan, and Han to improve efficiency and performance. This motivation is general in nature and not sufficient to support the combination as Kim utilizes an entirely different method to derive different sparsity levels as claimed. There is no explanation of how to combine the adaptive learning algorithm determined sparsity levels of Kim which are based on non-zero ratios, with the processes described in Luo, Erdogan, and Han that utilize sparsity level thresholds. In addition, Luo sparsifies based on filters, not connection weights as claimed. The rejection thus lacks sufficient detail to support the likelihood of success in making the combination which would change the fundamental sparsification process of Luo from a filter based sparsification to a weight based sparsification. The combination would change the principle of operation of Luo.
Examiner respectfully disagrees.  In response to Applicant's argument above, the test for obviousness is not whether the features of a secondary reference may be bodily incorporated into the structure of the primary reference; nor is it that the claimed invention must be expressly suggested in any one or all of the references.  Rather, the test is what the combined teachings of the references would have suggested to those of ordinary skill in the art.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981).  The question above is why one would want to incorporate the different layer-specific sparsity levels of Kim, and not Kim’s entire pruning algorithm.  As disclosed in the motivation statement in the rejection below, Kim states that “explicit sparsity control improved the classification performance.”  One of ordinary skill in the art would then be motivated to apply this concept of layer-specific sparsity control to Luo, Erdogan, and Han, in order to achieve an optimized performance by applying different sparsity levels to different layers.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 5-9, 11-13, 15-16, 18, and 20-21 are rejected under 35 U.S.C. 103 as being unpatentable over Luo et. al. (“An Entropy-based Pruning Method for CNN Compression”; hereinafter “Luo”), in view of Erdogan et. al. (“Measurement Criteria for Neural Network Pruning”; hereinafter “Erdogan”), Han et. al. (“DSD: Dense-Sparse-Dense Training for Deep Neural Networks”; hereinafter “Han”), and Kim et. al. (“Deep neural network with weight sparsity control and pre-training extracts hierarchical features and enhances classification performance: Evidence from whole-brain resting-state functional connectivity patterns of schizophrenia.”; hereinafter “Kim”).
As per claim 1, Luo teaches A computer implemented method of optimizing a neural network (Luo, Introduction Page 2 Left Column First Bullet, discloses “A simple yet effective framework is proposed, to accelerate and compress CNN models in both training and inference stage. Our method can compress the size of intermediate activations, reducing the run-time memory consumption dramatically, which is less concerned in previous works.”), the method including operations comprising:
obtaining a deep neural network (DNN) trained with a training dataset;  (Luo, Introduction 2nd Paragraph Line 7, discloses that the method is directed to deep neural networks: “One of the main issues of deep neural networks is its huge computational  and storage overhead.”  Luo, Page 4 Section 3.2 Left Column 3rd Paragraph Line 4, implies an already trained neural network:  “In order to calculate the entropy, more output values need to be collected, which can be obtained using an evaluation set. In practice, the evaluation set can be simply the original training set, or a subset of it.”)
and determining neural entropies of respective connections between nodes by calculating an area covered by the spreading signal. (Luo, Page 4 Section 3.2 Left Column Paragraph 3 discloses “We first use global average pooling to convert the output of layer i, which is a c x h x w tensor, into a 1 x c vector. In this way, each channel of Ii+1 (activation of layer i / input of layer i + 1) has a corresponding score for one image. In order to calculate the entropy, more output values need to be collected, which can be obtained using an evaluation set. In practice, the evaluation set can be simply the original training set, or a subset of it. Finally, we get a matrix M e Rn x c, where n is the number of images in the evaluation set, and c is the channel number. For each channel j, we would pay attention to the distribution of M: ; j . To compute the entropy value of this channel, we first divide it into m different bins, and calculate the probability of each bin. Finally, the entropy can be calculated as follows:  Hj = -Sumi=1-m (pi log pi)
Examiner’s Note:  Here, the 1 x c vector is currently representing the “spreading signal”.  Luo runs n samples from an evaluation set through the NN, thereby constructing an n x c matrix, which is just how Luo is storing the n samples of each of the c elements of the spreading signal.  For each element of the spreading signal, the n elements are used to construct a probability distribution by dividing the values into bins.  Finally, a function of the probability distribution is calculated (f(p(x)) = p(x) * log (p(x)) and the area under the curve of this function is calculated by summing up the function values for each bin.  The result of this calculation of the area under the curve of this function is the entropy, as noted by Luo (“Finally, the entropy can be calculated as follows:  Hj = -Sumi=1-m (pi log pi)”).  This entropy is used to prune filters which act as connections between nodes (“Our goal is to prune the filters Wi”), and Luo is thus determining entropies of connections between nodes.)
in multiple adjacent layers of the DNN (Luo, Page 5 Section 3.3 Para 2, discloses:  “In order to reduce the running time, we prune the first 10 convolutional layers via our entropy-based method.”)
optimizing the DNN based on the determined neural entropies for the connections between nodes in the multiple adjacent layers  (Luo, Intro Page 2 Left Column Bullet 1, discloses optimizing a CNN (a type of DNN): “A simple yet effective framework is proposed, to accelerate and compress CNN models in both training and inference stage. Our method can compress the size of intermediate activations, reducing the run-time memory consumption dramatically, which is less concerned in previous works.”  Luo, Page 5 Section 3.3 Para 2, discloses multiple adjacent layers: “In order to reduce the running time, we prune the first 10 convolutional layers via our entropy-based method.”  Luo, Page 4 Section 3.2 Right Column First Full Paragraph, discloses that the optimizing is based on neural entropies:  “Hence, our entropy-based method can be used for evaluating the importance of each channel. A smaller score of Hj means channel j is less important in this layer, thus could be removed.”)
Luo does not explicitly teach determining a spreading signal for each connection between nodes in multiple adjacent layers of the DNN wherein the spreading signal for each connection is an element-wise multiplication of input activations to a node in a first layer connected to a nodlayers prior to sparsifying; and selecting one of the retrained sparse model layers for each corresponding layer as an optimized DNN based on the comparing. 
Erdogan teaches determining a spreading signal for each connection between nodes in multiple adjacent layers of the DNN wherein the spreading signal for each connection is an element-wise multiplication of input activations to a node in a first layer connected to a nod(Erdogan, Section 2 Page 84 top of left column, discloses:  “the activity of a hidden node i, is obtained by Yi = Ai(neti) where Ai is the activation function and neti, is defined by neti = Sumk=1-Nk (Vik Xk) where Xk is an input pattern, Vik is the weight connection from input node k to hidden node i and Nk is the number of input nodes. The activity Yi is normalized as follows: pi = Yi / Sumj=1-Nk (Yj) where Nk is the number of hidden nodes.  By using this normalized activity, an entropy function can be formulated by Hj = -Sumi=1-Nk (pi log pi).”  
Examiner’s Note:  Here “input pattern” is the activation from the previous layer’s node, and Vik Xk is an element of the said element-wise multiplication of activations and weights, and is thus an element of the “spreading signal”. This spreading signal is subsequently used by Erdogan in a series of calculations that results in entropy used for pruning nodes of the neural network.
Luo and Erdogan are analogous art because they are both directed to using entropy to prune neural networks.  It would have been obvious to one of ordinary skill in the art before the effective filing date to combine Luo’s method of calculating entropy in a neural network based on a probability distribution with Erdogan’s method of calculating the entropy of nodes in a neural network.  The modification would have been obvious because one of ordinary skill in the art would be motivated to measure the relevance of hidden nodes in order to prune redundant nodes, as suggested by Erdogan (Erdogan: Abstract).
Luo further teaches wherein optimizing the DNN comprises: sorting the weights for each layer based on their corresponding neural entropies  (Luo, Page 4 End of Section 3.2, discloses:  “All the filters are sorted in the descending order according to their entropy scores”.  Luo discloses performing the pruning process for each layer in Page 6 Section 4.1 Para 2:  “We prune each layer with 50% compression rate”).
and iteratively for each layer, based on the sorted weights: 
sparsifying the sorted weights for each layer to create multiple sparse model layers (Luo Page 4 End of Section 3.2, discloses:  “All the filters are sorted in the descending order according to their entropy scores, and only the top k filters are preserved. Of course, the corresponding channels in Wi+1 are removed too.”  Here, Luo discloses removing filters (weights) and corresponding channels, which will result in a sparse layer.)
retraining the sparse model layers (Luo, Page 5 Final Paragraph, discloses:  “Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly.”  Fine-tuning is another way of saying retraining.)
comparing accuracy of the retrained sparse model layers to model layers prior to sparsifying (Luo, Page 5, last 2 paragraphs, discloses:  “We demonstrate in our experiments that the former strategy is not feasible since pruning too many layers may drop the accuracy significantly” and “In this paper, we propose a novel learning schedule to trade-off between training speed and accuracy”.  Here, Luo discloses comparing accuracy after pruning layers compared to original, as shown in Luo tables 3 and 4 on Page 7.)
and selecting one of the retrained sparse model layers for each corresponding layer as an optimized DNN based on the comparing (Luo, Page 5 Last 2 paragraphs, discloses:  “In this paper, we propose a novel learning schedule to trade-off between training speed and accuracy. Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly. Of course, this performance is not yet optimal, but is approaching to it closely.”  Here, based on the comparing (“accuracy”) Luo retrains sparse model layers and after several rounds of fine-tuning, selects an optimized DNN, which comprises at least one retrained sparse model layer.)
However, the combination of Luo and Erdogan thus far fails to teach optimizing the DNN based on the absolute values of the weights for the connections between nodes in the multiple adjacent layers; multiple sparse model layers having different levels of sparsity
Han teaches optimizing the DNN based on the absolute values of the weights for the connections between nodes in the multiple adjacent layers.  (Han, Bottom of Page 2 to Page 3, discloses:  “We use the simple heuristic to quantify the importance of the weights using their absolute value.  Sparse Training: The S step prunes the low-weight connections and trains a sparse network”.  Here, Han discloses optimizing the DNN (“prunes”) based on the absolute values of the weights (“to quantify the importance of the weights using their absolute value”) for the connections between nodes in the multiple adjacent layers (“prunes the low-weight connections”)).
The combination of Han with Luo results in the claimed limitation of optimizing the DNN based on the determined neural entropies and absolute values of the weights for the connections between nodes in the multiple adjacent layers, as Luo suggests comparing two different methods in the last paragraph of Section 4.2:  “In [24], Li et al. proposed a similar framework for filter pruning. They calculated the absolute weight sum of each filter as the channel selection metric. They pruned ResNet-34 on ImageNet to get a smaller model. However, their method cannot prune too many filters. Otherwise, the generalization ability of pruned model would be damaged greatly. In [24], they only pruned 10.8% parameters with 1.06% drop in the top-1 accuracy. As a comparison, our model “Pruned-75” pruned 16% parameters and slightly increased both top-1 and top-5 accuracy rates.”  Here, Luo compares the results of their model (which, as shown above, is based on entropy), with the results of the previous method (“the absolute weight sum”).  Han explicitly discloses pruning based on the absolute values of the weights.  Luo’s description is analogous to Instant Specification [0055-0056], [0047], [0049], [0031], where Applicant acknowledges that absolute value of the weights was the previous method used, and subsequently compares the performance of the entropy method to the older absolute value of weights method.
Luo, Erdogan, and Han are analogous art because they are all directed to optimizing neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the entropy-based neural network pruning of the combination of Luo and Erdogan, with the dense-sparse-dense training of Han.  The modification would have been obvious because one of ordinary skill in the art would be motivated to achieve superior optimization performance (Han: Abstract).
Note that Han, like Luo, further teaches wherein optimizing the DNN comprises: sorting the weights for each layer based on their corresponding [neural entropies] absolute values (Recall above that Luo discloses neural entropies.  Han, Bottom of Page 2, discloses:  “We use the simple heuristic to quantify the importance of the weights using their absolute value”.  Then continues on Page 3 Para 1, “For each layer W with N parameters, we sorted the parameters”.  Thus, Han discloses sorting the weights (“parameters”) for each layer based on their corresponding absolute values.)
and iteratively, based on the sorted weights: 
sparsifying the sorted weights for each layer to create sparse model layers (Han, Page 3 Para 2, discloses:  “picked the k-th largest one λ = Sk as the threshold where k = N *(1 - sparsity),  and generated a binary mask to remove all the weights smaller than λ.”  Here, Han discloses sparsifying (“remove all the weights”) the sorted weights (“k-th largest…remove all weights smaller than λ”).  As disclosed above, Han does this for each layer.  Removing weights will result in sparse layers.)
retraining the sparse model layers  (Han, Page 3 Para 3, discloses:  “Retraining while enforcing the binary mask in each iteration, we converted a dense network into a sparse network that has a known sparsity support.”  Here, Han discloses retraining the sparse model layers.  “While enforcing the binary mask” refers to the sparsified network, with some weights being “masked”, meaning they are removed.  Note that this is also taught by Luo, who in Section 3.4 Right Column Paragraph 2 Lines 2-4, discloses fine-tuning (i.e., retraining) the sparse DNN after pruning:  “Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly.”)
comparing accuracy of the retrained sparse model layers to model layers prior to sparsifying; (Han, Page 3 Para 3, discloses: “Retraining while enforcing the binary mask in each iteration, we converted a dense network into a sparse network that has a known sparsity support and can fully recover or even increase the original accuracy of initial dense model under the sparsity constraint.”  Here, Han suggests comparing the accuracy of the sparse model layers to the original layers, as they state “recover or even increase the original accuracy of initial dense model”).
and selecting one of the retrained sparse model layers for each corresponding layer as an optimized DNN based on the comparing (Han, Page 3 Para 3, discloses:  “Retraining while enforcing the binary mask in each iteration, we converted a dense network into a sparse network that has a known sparsity support and can fully recover or even increase the original accuracy of initial dense model under the sparsity constraint. The sparsity is the same for all the layers and can be tuned using validation. We find a sparsity value between 25% and 50% generally works well in our experiments.”  Here, Han discloses comparing the accuracy (“fully recover or even increase the original accuracy”) and based on the resulting “sparsity constraint” from the comparing, states “the sparsity is the same for all layers”.  Thus, Han discloses selecting all the layers, which includes at least one of the layers.)
However, the combination of Luo, Erdogan, and Han fails to teach multiple sparse model layers having different levels of sparsity.
Kim teaches multiple sparse model layers having different levels of sparsity.  (Kim, Page 131 Below Eq 4, discloses:  “Various target non-zero ratios (ρ(J + 1,J) = 0.3, 0.5, 0.7, or 1.0) were tested for each hidden layer to reflect the potentially different optimal sparsity level in each layer”, and in the final paragraph of Page 134 states:  “For example, the average non-zero ratios for the DNN with three hidden layers were 0.52 for the first hidden layer, 0.72 for the second hidden layer, and 0.85 for the third hidden layer.”)
Kim and the combination of Luo, Erdogan, and Han are analogous art because they are in the field of endeavor of optimizing neural networks.
It would have been obvious before the effective filing date of the claimed invention to combine the layer-wise sparsity levels of Kim with the sparse network of Luo, Erdogan, and Han.  One of ordinary skill in the art would be motivated to do so in order to improve efficiency and performance (Kim, Page 140 right column:  “The key findings of this investigation are summarized as follows: (1) the L1-norm regularization of the DNN weights via explicit sparsity control improved the classification performance.”)

As per claim 3, the combination of Luo, Erdogan, Han, and Kim teaches the method of claim 1 wherein sparsifying the sorted weights for each layer comprises pruning the connections between nodes as a function of increasing sparsity levels (Han, Bottom of Page 2, discloses:  “We use the simple heuristic to quantify the importance of the weights using their absolute value”.  Then continues on Page 3 Para 1, “For each layer W with N parameters, we sorted the parameters”.  Thus, Han discloses sorting the weights (“parameters”) for each layer based on their corresponding absolute values.  Han, Page 3 Para 3, discloses:  “Retraining while enforcing the binary mask in each iteration, we converted a dense network into a sparse network that has a known sparsity support and can fully recover or even increase the original accuracy of initial dense model under the sparsity constraint. The sparsity is the same for all the layers and can be tuned using validation. We find a sparsity value between 25% and 50% generally works well in our experiments.”  Here, Han discloses sparsifying the sorted weights for each layer by pruning the connections between nodes.  Han discloses finding a sparsity constraint that can “fully recover or even increase the original accuracy of initial dense model under the sparsity constraint”.  Han discloses that finding the correct constraint can be found by tuning:  “The sparsity is the same for all the layers and can be tuned using validation. We find a sparsity value between 25% and 50% generally works well in our experiments.”  Thus, Han discloses a function of increasing sparsity levels, as Han has tried several values of sparsity constraints between 25% and 50%.)

As per claim 5, the combination of Luo, Erdogan, Han, and Kim as shown above teaches the method of claim 3.  Han teaches and further comprising increasing a density of the sparse DNN by adding connections between nodes while retraining the sparse DNN.  (Han, Abstract Lines 4-9, discloses adding connections while retraining a sparse DNN:  “In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint. In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network.”)
It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Han with Luo, Erdogan, and Kim for at least the reasons recited in Claim 1.

As per Claim 6, the combination of Luo, Erdogan, Han, and Kim as shown above teaches the method of claim 5 as well as sorted weights and optimizing the DNN (see Rejection to Claim 1).  Luo teaches wherein the [sorted] weights are imported into a parameter matrix for optimizing the DNN. (Recall above Han teaches sorted weights.  One of ordinary skill in the art will appreciate that neural network values are typically calculated with the parameters/weights in a matrix structure.  Thus, Luo teaches in Page 2 Section 2 Para 2:  “In most deep models, the parameters of each layer form a large and dense matrix, which leads to both storage and computational difficulties.”)

As per claim 7, the combination of Luo, Erdogan, Han, and Kim as shown above teaches the method of claim 1.  Luo teaches optimizing the DNN as a function of the neural entropies (Luo discloses removing connections based on entropies.  Luo, Page 4 Section 3.2 Para 1 Last Sentence, discloses:  “Our goal is to prune the filters Wi”.  Luo, Figure 1, shows the filters Wi as being connections between layers, which comprise nodes.  Luo, Page 4 Section 3.2 under Eq 1, discloses:  “Where, pi is the probability of bin i, Hj is the entropy of channel j. In general, if some layers are weak enough, e.g., most of their activation are zeros, their entropy are relatively small. Hence, our entropy-based method can be used for evaluating the importance of each channel. A smaller score of Hj means channel j is less important in this layer, thus could be removed.”  Here, Luo discloses that the pruning is done based on entropies.)
However, Luo fails to teach wherein optimizing the DNN comprises regularization of the DNN during retraining. 
Han teaches wherein optimizing the DNN comprises regularization of the DNN during retraining. (Han, Abstract Lines 2-7, discloses removing unimportant connections (i.e., nuisance variables) while retraining the DNN:  “We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint.”)
It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Han with Luo, Erdogan, and Kim for at least the reasons recited in Claim 1.

As per claim 8, the combination of Luo, Erdogan, Han, and Kim teaches the method of claim 7.  Luo teaches entropic thresholding. (Luo discloses removing connections based on entropies.  Luo, Page 4 Section 3.2 Para 1 Last Sentence, discloses:  “Our goal is to prune the filters Wi”.  Luo, Page 4 Figure 1, shows the filters Wi as being connections between layers, which comprise nodes.  Luo, Page 4 Section 3.2 under Eq 1, discloses:  “Where, pi is the probability of bin i, Hj is the entropy of channel j. In general, if some layers are weak enough, e.g., most of their activation are zeros, their entropy are relatively small. Hence, our entropy-based method can be used for evaluating the importance of each channel. A smaller score of Hj means channel j is less important in this layer, thus could be removed.”  Here, Luo discloses that the pruning is done based on entropies.)
However, Luo does not teach wherein regularization comprises: reducing a dimensionality of a DNN.
Han teaches wherein regularization comprises: reducing a dimensionality of a DNN (Han, Abstract Lines 5-7, discloses regularization to prune (i.e., reduce dimensionality of) the DNN:  “In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint”.)
It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Han with Luo, Erdogan, and Kim for at least the reasons recited in Claim 1.

As per claim 9, the combination of Luo, Erdogan, Han, and Kim teaches the method of claim 7.  Luo teaches wherein regularization comprises: pruning connections between nodes based on the neural entropies (Luo discloses removing connections based on entropies.  Luo, Page 4 Section 3.2 Para 1 Last Sentence, discloses:  “Our goal is to prune the filters Wi”.  Luo, Figure 1, shows the filters Wi as being connections between layers, which comprise nodes.  Luo, Page 4 Section 3.2 under Eq 1, discloses:  “Where, pi is the probability of bin i, Hj is the entropy of channel j. In general, if some layers are weak enough, e.g., most of their activation are zeros, their entropy are relatively small. Hence, our entropy-based method can be used for evaluating the importance of each channel. A smaller score of Hj means channel j is less important in this layer, thus could be removed.”  Here, Luo discloses that the pruning is done based on entropies.)
However, Luo does not teach pruning least important connections between nodes to induce network sparsity; fine tuning the DNN after pruning by sparsely retraining the DNN; removing a sparsity constraint; and retraining the DNN while including all the removed connections between nodes
Han teaches pruning least important connections between nodes to induce network sparsity (Han, Abstract Lines 5-7, discloses pruning least important connections to induce network sparsity:  “In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint.”)
fine tuning the DNN after pruning by sparsely retraining the DNN; (Luo, Page 5 Section 3.4 Right Column Paragraph 2 Lines 2-4, discloses fine tuning (i.e., retraining) the network after pruning (i.e., while network is sparse):  “Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly.”)
removing a sparsity constraint;  (Han, Abstract Lines 7-9, discloses removing a sparsity constraint:  “In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network.”)
and retraining the DNN while including all the removed connections between nodes. (Han, Abstract Lines 7-9, discloses retraining a network after bringing back removed connections:  “In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network.”)
It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Han with Luo, Erdogan, and Kim for at least the reasons recited in Claim 1.

As per claim 11, the combination of Luo, Erdogan, Han, and Kim teaches the method of claim 1.  Han teaches wherein optimizing the DNN comprises removing nuisance variables within the DNN as a function of the determined entropies while retraining the DNN. (Han, Abstract Lines 2-7, discloses removing unimportant connections (i.e., nuisance variables) while retraining the DNN:  “We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint.”)
It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Han with Luo, Erdogan, and Kim for at least the reasons recited in Claim 1.

As per claim 12, the combination of Luo, Erdogan, Han, and Kim teaches the method of claim 1.  Luo teaches wherein optimizing the DNN comprises determining a size of each layer of the DNN (Luo, Page 4 Section 3.2 Right Column Last Paragraph, discloses determining a size of each layer: “All the filters are sorted in the descending order according to their entropy scores, and only the top k filters are preserved.”)
However, Luo does not teach wherein optimizing the DNN comprises guiding training of the DNN
Han teaches wherein optimizing the DNN comprises guiding training of the DNN to determine a size of each layer of the DNN. (Han, Abstract Lines 2-7, discloses removing unimportant connections (i.e., nuisance variables) while training the DNN:  “We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint.”)
It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Han with Luo, Erdogan, and Kim for at least the reasons recited in Claim 1.

As per claim 13, claim 13 is a device claim corresponding to method claim 1. The
difference is that the device claim recites a memory and a processor.  (Luo, Abstract Lines 7-10 discloses a memory:  “Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods.”  Luo, Page 8 Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 13 is rejected for the same reasons as claim 1.)

As per claim 15, claim 15 is a device claim corresponding to method claim 3. The
difference is that the device claim recites a memory and a processor.  (Luo, Abstract Lines 7-10 discloses a memory:  “Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 15 is rejected for the same reasons as claim 3.

As per claim 16, claim 16 is a device claim corresponding to method claim 8. The
difference is that the device claim recites a memory and a processor.  (Luo, as shown above, discloses a memory and a processor.)  Claim 16 is rejected for the same reasons as claim 8.

As per claim 18, claim 18 is a computer readable medium claim corresponding to method claim 1. The difference is that the computer readable medium claim recites a computer readable medium and a processor.  (Luo, Introduction Paragraph 2, discloses that their method is directed to minimizing storage space to overcome deployment on machine readable media on devices:  “In spite of its great success, a typical deep model is hard to be deployed on resource constrained devices, e.g., mobile phones and embedded gadgets. A resource constrained scenario means a computing task must be accomplished with limited resource supply, such as computing time, storage space, battery power, computing capability and so on. One of the main issues of deep neural networks is its huge computational and storage overhead, which constitutes a serious challenge for a mobile device with limited computing resource.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 18 is rejected for the same reasons as claim 1.

As per claim 20, claim 20 is a computer readable medium claim corresponding to method claim 3. The difference is that the computer readable medium claim recites a computer readable medium and a processor.  Claim 20 is rejected for the same reasons as claim 3.

As per claim 21, the combination of Luo, Erdogan, Han, and Kim teaches the method of claim 1.  Han teaches wherein the DNN comprises a fully connected DNN. (Han, Figure 1, discloses starting with a fully connected DNN:

    PNG
    media_image1.png
    351
    1413
    media_image1.png
    Greyscale

Furthermore, Han discloses on Page 7 Para 2, fully connected layers of a non-convolutional DNN (an LSTM):  “In the sparse phase, weights are pruned in the Fully Connected layers and the Bidirectional Recurrent layer only (they are the majority of the weights.)”)
It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Han with Luo, Erdogan, and Kim for at least the reasons recited in Claim 1.

As per claim 22, claim 22 is a device claim corresponding to method claim 21. The
difference is that the device claim recites a memory and a processor.  Claim 22 is rejected for the same reasons as claim 21.

As per claim 23, claim 23 is a computer readable medium claim corresponding to method claim 21. The difference is that the computer readable medium claim recites a computer readable medium and a processor.  Claim 23 is rejected for the same reasons as claim 21.

Claims 10 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Luo, Erdogan, Han, and Kim, further in view of Majumdar et. al. (US PGPub 2014/0046885 A1; hereinafter “Majumdar”).
As per claim 10, the combination of Luo, Erdogan, Han, and Kim as shown above teaches the method of claim 1. Luo teaches wherein optimizing the DNN comprises: determining a maximum pruning rate for each layer of the DNN while enforcing a total number of parameters in each layer and [a number of bits to represent each parameter] (Luo, Page 4 Section 3.2 Right Column Last Paragraph, discloses determining a compression rate (i.e., maximum pruning rate):  “The next issue is how to decide the pruning boundary. One feasible method is to specify a threshold value, all channels with score below this threshold are removed from the network. However, this threshold value is a hyperparameter, which is hard to be specified. Another more practical method is using a constant compression rate. All the filters are sorted in the descending order according to their entropy scores, and only the top k filters are preserved. Of course, the corresponding channels in Wi+1 are removed too.”  Luo, Page 5 Section 3.4 Right Column Final Paragraph Lines 6-7, indicates that this is done for each layer of the DNN:  “Only after the final layer has been pruned, the network is fine-tuned carefully with many epochs.” Examiner’s Note:  Here, enforcing a total number of parameters for each layer is indicated by “only the top k filters are preserved.”) *Majumdar below teaches a number of bits to represent each parameter.
pruning layers of the DNN in accordance with the maximum pruning rate;  (Luo, Page 4 Section 3.2 Right Column Last Paragraph Lines 6-8, discloses leaving a fixed number of filters behind (i.e., a maximum pruning rate):  “All the filters are sorted in the  descending order according to their entropy scores, and only the top k filters are preserved.”)
and re-training the pruned DNN. (Luo, Page 5 Section 3.4 Right Column Paragraph 2 Lines 2-4, discloses fine tuning (i.e., retraining) after pruning:  “Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly.”)
The combination of Luo, Erdogan, Han, and Kim fails to teach a number of bits to represent each parameter.  Majumdar teaches a number of bits to represent each parameter. (Majumdar, Para [0004] First Sentence, discloses a number of bits to represent each parameter:  “Neural signals and parameters of a neural system (e.g., synaptic weights, neural states, etc) can be represented in quantized form with a pre-defined bit precision and stored in a system memory for further use.”)
Luo, Erdogan, Han, Kim, and Majumdar are analogous art because they are all directed to optimizing neural networks.  It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the entropy-based neural network pruning of the combination of Luo, Erdogan, Han, and Kim, with the bit sizes for parameters of Majumdar.  The modification would have been obvious because one of ordinary skill in the art would be motivated to save the memory space of the neural system (Majumdar: Para [0004]).

As per claim 17, claim 17 is a device claim corresponding to method claim 10. The
difference is that the device claim recites a memory and a processor.  (Luo, Abstract Lines 7-10 discloses a memory:  “Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 17 is rejected for the same reasons as claim 10.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Sze et al. (“Efficient Processing of Deep Neural Networks: A Tutorial and Survey”) discloses on Page 7 Bottom Left: “To align the terminology of CNNs with the generic DNN, filters are composed of weights (i.e., synapses)”
Rastegari et al (“XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”) discloses on Page 1:  “In computer vision, a particular type of DNN, known as Convolutional Neural Networks (CNN), have demonstrated state-of-the-art results in object recognition and detection.”
Han et al. (“Learning both Weights and Connections for Efficient Neural Networks”) discloses pruning weights/connections, as disclosed in Abstract:  “First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/L.A.S./Examiner, Art Unit 2126   
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126