Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/03/2018 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Response to Amendment
The amendment filed 2021-05-07 has been entered. Claims 1-20 remain pending in the application. 
Applicant’s amendment to the Drawings have overcome the objection previously set forth in the Non-Final Office Action mailed 2021-03-02.  
Applicant’s amendment to the Specification has overcome the immediate objection, but has resulted in another objection, see “Specification” section below.  
Applicant’s amendments to the claims have overcome some of the objections, but some objections remain.  See “Claim Objections” below.
Applicant’s amendment to claim 18 has not fully overcome the 101 rejection, as Applicant has amended to “computer-readable medium”, and Specification [0074] states:  “The terms computer-readable medium and storage device do not include carrier waves to the extent carrier waves are deemed too transitory”.  The claim must unequivocally exclude signals, and not just “to the extent” that they are “deemed too transitory”.  Amending the claim to “non-transitory computer-readable medium” will overcome this rejection.

Response to Arguments
In response to Applicant’s argument that Luo’s 1xc vector is not a “spreading signal”, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).  In the Office Action, Examiner relies on Erdogan for the second limitation of Claim 1 that specifies the nature of the “spreading signal”.  In the rejection of the third limitation of Claim 1 over Luo, Luo’s 1xc vector is fulfilling the functional role (and not necessarily the definition) of the “spreading signal”, which is to be calculated based on input activations and weights and then to be sampled through iterations to form a probability distribution from which an entropy is calculated, said entropy being used to prune connections (Luo’s “filters”, which lie between layers:  “Our goal is to prune the filters Wi”) between nodes. The 1xc vector does correspond to input activations, as it is the result of an average of a cxhxw tensor (“We first use global average pooling to convert the output of layer i, which is a cxhxw tensor, into a 1xc vector”) wherein the tensor is calculated as the result of a convolution operation that comprises multiplications between input activations and weights (“We use a triplet <Ii; Wi; *> to denote the convolution in layer i, where Ii e Rcxhxw is the input tensor, Wi e Rdxcxkxk is a set of filter weights, * denote the convolution operation.”). 
In response to Applicant’s argument that adding probabilities in bins does not relate to the area covered by the spreading signal, and that no analysis was done for this, Examiner will reiterate and expand on what was stated in the Office Action.  Note that Examiner was i), pi log pi, which is the area under a function of the probability distribution, and a discrete version of the continuous integral in Instant Specification [0042].
In response to Applicant’s argument that Luo fails to teach “an exponent of an area covered by the spreading signal”, this is also inconsistent with Specification [0042], as what is calculated is the area under f(x)*log(f(x)), which is the area covered by [the probability distribution of the spreading signal multiplied by the logarithm of the probability distribution of the spreading signal].  There is no exponent.  Regardless, Examiner could still map the prior art to this, as since there is no specific exponent detailed in the Specification, Examiner can assume an exponent of 1, resulting in the same value.
In response to Applicant’s argument that Erdogan does not perform an element-wise multiplication of input activations with corresponding weights to determine a spreading signal, Examiner respectfully disagrees.  Erdogan, in Equation 2, sums up VikXk, where Xk is an “input ik is a weight.  It is true that Erdogan uses this sum to plug into the next layer’s activation function to determine the activations of the next layer, but this does not change the fact that Xk, which is the input activation from the previous layer (and Erdogan calls “input pattern”) is multiplied by each of the weights Vik, and this is done from each previous node k to next node i.  In fact, this is true of every neural network, and many references would have worked, but Erdogan was chosen as it is more analogous as it applies to entropy.  It is a basic property of all neural networks that the so-called “spreading signal” is intermediately calculated, as in order to calculate the activations of the next layer, one must multiply the previous layer’s activations with the weights of respective connections in an element wise fashion.

Specification
The disclosure is objected to because of the following informality: 
Specification Para [0040] states:  “A low variance, for example variance 262, implies a considerable uncertainty in a particular connection meaning that firing of that connection is highly dependent on the data that passes through, whereas a low variance indicates that a particular connection is always on or off regardless of the data (low amount of information is carried through that connection).Appropriate correction is required.”  Applicant here states “A low variance implies a considerable uncertainty…whereas a low variance…,” which is not true and also repeats “low variance”.  Examiner recommends the following correction:  “A high variance, for example variance 264…”

Claim Objections
Claim 9 is objected to because of the following informalities: In line 4, Claim 9 recites the limitation "the network". There is insufficient antecedent basis for this limitation in the claim.  Examiner suggests "the DNN".  Appropriate correction is required.
Claim 12 is objected to because of the following informalities:  Line 2 of Claim 12 recites the limitation "the multi-layer DNN".  There is insufficient antecedent basis for this limitation in the claim.  Examiner suggests "the DNN".  Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claim 18 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because Claim 18 recites a computer-readable medium, and Specification [0074] states:  “The terms computer-readable medium and storage device do not include carrier waves to the extent carrier waves are deemed too transitory”.  The claim must unequivocally exclude signals, and not just “to the extent” that they are “deemed too transitory”.  The BRI of computer-readable medium in this case may encompass non-statutory transitory forms of signal transmission, such as a propagating electrical or electromagnetic signal per se.  Therefore, the claim is not patent eligible subject matter.  See MPEP 2106.03 (II).


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1, 13, 18, and 19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 1, 13, and 18 recite the limitation “determining a spreading signal for each connection between nodes in multiple adjacent layers of the DNN wherein the spreading signal is an element-wise multiplication of input activations between the nodes in a first layer to nodes in a second next layer with corresponding weights of connections between such nodes.”  Applicant states a “spreading signal for each connection”, implying multiple scalar values, but then recites “element-wise multiplication”, implying a single matrix. Also, it is unclear to the examiner what is meant by “Input activations between the neurons in a first layer to neurons in a second next layer”, as there is no “…and” after the “between”, so it is unclear if an implied relationship is intra- or inter- layer.  “Input activations” is also unclear to the examiner because if the relationship is inter-layer, the activations are outbound from the first layer, rather than 
Claims 1, 13, and 18 recite the limitation “calculating an exponent of an area covered by the spreading signal.” It is unclear what is meant by “an area covered by the spreading signal” because the “spreading signal” is defined as an element-wise multiplication, which is a matrix.  An area of a matrix is undefined.  Examiner is interpreting the limitation as “calculating an area bounded by a curve of a function of a probability distribution of each element of the spreading signal.”

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 3, 4, 13, 14, 18, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Luo et. al. (“An Entropy-based Pruning Method for CNN Compression”; hereinafter “Luo”), in view of Erdogan et. al. (“Measurement Criteria for Neural Network Pruning”; hereinafter “Erdogan”).
As per claim 1, Luo teaches A computer implemented method of optimizing a neural network (Luo, Introduction Page 2 Left Column First Bullet, discloses “A simple yet effective framework is proposed, to accelerate and compress CNN models in both training and inference stage. Our method can compress the size of intermediate activations, reducing the run-time memory consumption dramatically, which is less concerned in previous works.”), the method including operations comprising:
obtaining a deep neural network (DNN) trained with a training dataset;  (Luo, Introduction 2nd Paragraph Line 7, discloses that the method is directed to deep neural networks: “One of the main issues of deep neural networks is its huge computational  and storage overhead.”  Luo, Section 3.2 Left Column 3rd Paragraph Line 4, implies an already trained neural network:  “In order to calculate the entropy, more output values need to be collected, which can be obtained using an evaluation set. In practice, the evaluation set can be simply the original training set, or a subset of it.”)
and determining neural entropies of respective connections between nodes by calculating an exponent of an area covered by the spreading signal. (Examiner is interpreting “calculating an exponent of an area covered by the spreading signal” as “calculating an area bounded by a curve of a function of a probability distribution of each element of the spreading signal.”  Luo, Section 3.2 Left Column Paragraph 3 discloses “We first use global average pooling to convert the output of layer i, which is a c x h x w tensor, into a 1 x c vector. In this way, each channel of Ii+1 (activation of layer i / input of layer i + 1) has a corresponding score for one image. In order to calculate the entropy, more output values need to be collected, which can be obtained using an evaluation set. In practice, the evaluation set can be simply the original training set, or a subset of it. Finally, we get a matrix M e Rn x c, where n is the number of images in the evaluation set, and c is the channel number. For each channel j, we would pay attention to the distribution of M: ; j . To compute the entropy value of this channel, we first divide it into m different bins, and calculate the probability of each bin. Finally, the entropy can be calculated as follows:  Hj = -Sumi=1-m (pi log pi)
Examiner’s Note:  Here, the 1 x c vector is currently representing the “spreading signal”.  Luo runs n samples from an evaluation set through the NN, thereby constructing an n x c matrix, which is just how Luo is storing the n samples of each of the c elements of the spreading signal.  For each element of the spreading signal, the n elements are used to construct a probability distribution by dividing the values into bins.  Finally, a function of the probability distribution is calculated (f(p(x)) = p(x) * log (p(x)) and the area under the curve of this function is calculated by summing up the function values for each bin.  The result of this calculation of the area under the curve of this function is the entropy, as noted by Luo (“Finally, the entropy can be calculated as follows:  Hj = -Sumi=1-m (pi log pi)”).  This entropy is used to prune filters which act as connections between nodes (“Our goal is to prune the filters Wi”), and Luo is thus determining entropies of connections between nodes.  Also, an “exponent” of this area is calculated, as this area raised to the power of 1 is the same value.)
in multiple adjacent layers of the DNN (Luo, Section 3.3 Para 2, discloses:  “In order to reduce the running time, we prune the first 10 convolutional layers via our entropy-based method.”)
Luo does not explicitly teach determining a spreading signal for each connection between nodes in adjacent layers of the DNN wherein the spreading signal is an element-wise 
Erdogan teaches determining a spreading signal for each connection between nodes in adjacent layers of the DNN wherein the spreading signal is an element-wise multiplication of input activations between the nodes in a first layer to nodes in a second next layer with corresponding weights of connections between such neurons (Examiner is interpreting the limitation as  “determining a spreading signal for a pair of adjacent layers of the DNN, wherein the spreading signal is an element-wise multiplication of activations sent along connections from the nodes in a first layer to nodes in a second next layer with corresponding weights of the connections between such nodes”.  Erdogan, Section 2 Page 84 top of left column, discloses:  “the activity of a hidden node i, is obtained by Yi = Ai(neti) where Ai is the activation function and neti, is defined by neti = Sumk=1-Nk (Vik Xk) where Xk is an input pattern, Vik is the weight connection from input node k to hidden node i and Nk is the number of input nodes. The activity Yi is normalized as follows: pi = Yi / Sumj=1-Nk (Yj) where Nk is the number of hidden nodes.  By using this normalized activity, an entropy function can be formulated by Hj = -Sumi=1-Nk (pi log pi).”  
Examiner’s Note:  Here “input pattern” is the activation from the previous layer’s node, and Vik Xk is an element of the said element-wise multiplication of activations and weights, and is thus an element of the “spreading signal”. This spreading signal is subsequently used by Erdogan in a series of calculations that results in entropy used for pruning nodes of the neural network.


As per claim 2, the combination of Luo and Erdogan teaches the method of claim 1 and further comprising optimizing the DNN based on the determined neural entropies for the connections between nodes in the multiple adjacent layers.  (Luo, Intro Page 2 Left Column Bullet 1, discloses optimizing a CNN (a type of DNN): “A simple yet effective framework is proposed, to accelerate and compress CNN models in both training and inference stage. Our method can compress the size of intermediate activations, reducing the run-time memory consumption dramatically, which is less concerned in previous works.”  Luo, Section 3.3 Para 2, discloses multiple adjacent layers: “In order to reduce the running time, we prune the first 10 convolutional layers via our entropy-based method.”  Luo, Section 3.2 Right Column First Full Paragraph, discloses that the optimizing is based on neural entropies:  “Hence, our entropy-based method can be used for evaluating the importance of each channel. A smaller score of Hj means channel j is less important in this layer, thus could be removed.”)

As per claim 3, the combination of Luo and Erdogan teaches the method of claim 2 wherein optimizing the DNN comprises pruning the connections between nodes as a function of the neural entropies to create a sparse DNN.  (Luo, Section 3.2 Para 1 Last Sentence, discloses:  “Our goal is to prune the filters Wi”.  Luo, Figure 1, shows the filters Wi as being connections between layers, which comprise nodes.  Luo, Section 3.2 under Eq 1, discloses:  “Where, pi is the probability of bin i, Hj is the entropy of channel j. In general, if some layers are weak enough, e.g., most of their activation are zeros, their entropy are relatively small. Hence, our entropy-based method can be used for evaluating the importance of each channel. A smaller score of Hj means channel j is less important in this layer, thus could be removed.”  Here, Luo discloses that the pruning is done based on entropies.  Examiner’s Note:  Removing connections results in a sparse DNN.

As per claim 4, the combination of Luo and Erdogan teaches the method of claim 3 and further comprising retraining the sparse DNN. (Luo, Section 3.4 Right Column Paragraph 2 Lines 2-4, discloses fine-tuning (i.e., retraining) the sparse DNN after pruning:  “Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly.”)

As per claim 13, claim 13 is a device claim corresponding to method claim 1. The
difference is that the device claim recites a memory and a processor.  (Luo, Abstract Lines 7-10 discloses a memory:  “Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 13 is rejected for the same reasons as claim 1.)

As per claim 14, claim 14 is a device claim corresponding to method claim 2. The
difference is that the device claim recites a memory and a processor.  (Luo, as shown in claim 13, discloses a memory and a processor.)  Claim 14 is rejected for the same reasons as claim 2.

As per claim 18, claim 18 is a machine readable medium claim corresponding to method claim 1. The difference is that the machine readable medium claim recites a machine readable medium and a processor.  (Luo, Introduction Paragraph 2, discloses that their method is directed to minimizing storage space to overcome deployment on machine readable media on devices:  “In spite of its great success, a typical deep model is hard to be deployed on resource constrained devices, e.g., mobile phones and embedded gadgets. A resource constrained scenario means a computing task must be accomplished with limited resource supply, such as computing time, storage space, battery power, computing capability and so on. One of the main issues of deep neural networks is its huge computational and storage overhead, which constitutes a serious challenge for a mobile device with limited computing resource.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 18 is rejected for the same reasons as claim 1.

As per claim 19, claim 19 is a device claim, dependent upon a machine readable medium claim, corresponding to method claim 2. The difference is that the claim recites a machine readable medium, a memory and a processor.  (Luo, Introduction Paragraph 2, discloses that their method is directed to minimizing storage space to overcome deployment on machine readable media on devices:  “In spite of its great success, a typical deep model is hard to be deployed on resource constrained devices, e.g., mobile phones and embedded gadgets. A resource constrained scenario means a computing task must be accomplished with limited resource supply, such as computing time, storage space, battery power, computing capability and so on. One of the main issues of deep neural networks is its huge computational and storage overhead, which constitutes a serious challenge for a mobile device with limited computing resource.” Luo, Abstract Lines 7-10 discloses a memory:  “Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor. Claim 19 is rejected for the same reasons as claim 2.)

Claims 5, 7, 8, 9, 11, 12, 15, 16, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Luo and Erdogan, further in view of Han et. al. (“DSD: Dense-Sparse-Dense Training for Deep Neural Networks”; hereinafter “Han”).

As per claim 5, the combination of Luo and Erdogan as shown above teaches the method of claim 4.  However, the combination of Luo and Erdogan fails to teach and further comprising increasing a density of the sparse DNN by adding connections between nodes while retraining the sparse DNN.  
Han teaches and further comprising increasing a density of the sparse DNN by adding connections between nodes while retraining the sparse DNN.  (Han, Abstract Lines 4-9, discloses adding connections while retraining a sparse DNN:  “In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint. In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network.”)
Luo, Erdogan, and Han are analogous art because they are all directed to optimizing neural networks.  It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the entropy-based neural network pruning of the combination 

As per claim 7, the combination of Luo and Erdogan as shown above teaches the method of claim 2.  Luo teaches optimizing the DNN as a function of the neural entropies (Luo discloses removing connections based on entropies.  Luo, Section 3.2 Para 1 Last Sentence, discloses:  “Our goal is to prune the filters Wi”.  Luo, Figure 1, shows the filters Wi as being connections between layers, which comprise nodes.  Luo, Section 3.2 under Eq 1, discloses:  “Where, pi is the probability of bin i, Hj is the entropy of channel j. In general, if some layers are weak enough, e.g., most of their activation are zeros, their entropy are relatively small. Hence, our entropy-based method can be used for evaluating the importance of each channel. A smaller score of Hj means channel j is less important in this layer, thus could be removed.”  Here, Luo discloses that the pruning is done based on entropies.)
However, the combination of Luo and Erdogan fails to teach wherein optimizing the DNN comprises regularization of the DNN during training. 
Han teaches wherein optimizing the DNN comprises regularization of the DNN during training. (Han, Abstract Lines 2-3, discloses regularization during training:  “We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance”.)

As per claim 8, the combination of Luo, Erdogan, and Han teaches the method of claim 7.  Luo teaches entropic thresholding. (Luo discloses removing connections based on entropies.  Luo, Section 3.2 Para 1 Last Sentence, discloses:  “Our goal is to prune the filters Wi”.  Luo, Figure 1, shows the filters Wi as being connections between layers, which comprise nodes.  Luo, Section 3.2 under Eq 1, discloses:  “Where, pi is the probability of bin i, Hj is the entropy of channel j. In general, if some layers are weak enough, e.g., most of their activation are zeros, their entropy are relatively small. Hence, our entropy-based method can be used for evaluating the importance of each channel. A smaller score of Hj means channel j is less important in this layer, thus could be removed.”  Here, Luo discloses that the pruning is done based on entropies.)
However, Luo does not teach wherein regularization comprises: reducing a dimensionality of a DNN.
Han teaches wherein regularization comprises: reducing a dimensionality of a DNN (Han, Abstract Lines 5-7, discloses regularization to prune (i.e., reduce dimensionality of) the DNN:  “In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint”.)

As per claim 9, the combination of Luo, Erdogan, and Han teaches the method of claim 7.  Luo teaches wherein regularization comprises: pruning connections between nodes based on the neural entropies (Luo discloses removing connections based on entropies.  Luo, Section 3.2 Para 1 Last Sentence, discloses:  “Our goal is to prune the filters Wi”.  Luo, Figure 1, shows the filters Wi as being connections between layers, which comprise nodes.  Luo, Section 3.2 under Eq 1, discloses:  “Where, pi is the probability of bin i, Hj is the entropy of channel j. In general, if some layers are weak enough, e.g., most of their activation are zeros, their entropy are relatively small. Hence, our entropy-based method can be used for evaluating the importance of each channel. A smaller score of Hj means channel j is less important in this layer, thus could be removed.”  Here, Luo discloses that the pruning is done based on entropies.)
However, Luo does not teach pruning least important connections between nodes to induce network sparsity; fine tuning the DNN after pruning by sparsely retraining the network; removing a sparsity constraint; and retraining the DNN while including all the removed connections between nodes
Han teaches pruning least important connections between nodes to induce network sparsity (Han, Abstract Lines 5-7, discloses pruning least important connections to induce network sparsity:  “In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint.”)
fine tuning the DNN after pruning by sparsely retraining the network; (Luo, Section 3.4 Right Column Paragraph 2 Lines 2-4, discloses fine tuning (i.e., retraining) the network after pruning (i.e., while network is sparse):  “Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly.”)
removing a sparsity constraint;  (Han, Abstract Lines 7-9, discloses removing a sparsity constraint:  “In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network.”)
and retraining the DNN while including all the removed connections between nodes. (Han, Abstract Lines 7-9, discloses retraining a network after bringing back removed connections:  “In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network.”)

As per claim 11, the combination of Luo, Erdogan, and Han teaches the method of claim 2 wherein optimizing the DNN comprises removing nuisance variables within the DNN as a function of the determined entropies while training the DNN. (Han, Abstract Lines 2-7, discloses removing unimportant connections (i.e., nuisance variables) while training the DNN:  “We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint.”  

As per claim 12, the combination of Luo, Erdogan, and Han teaches the method of claim 2.  Luo teaches wherein optimizing the DNN comprises determining a size of each layer of the DNN (Luo, Section 3.2 Right Column Last Paragraph, discloses determining a size of each layer: “All the filters are sorted in the descending order according to their entropy scores, and only the top k filters are preserved.”)
However, Luo does not teach wherein optimizing the DNN comprises guiding training of the multi-layer DNN
Han teaches wherein optimizing the DNN comprises guiding training of the multi-layer DNN to determine a size of each layer of the DNN. (Han, Abstract Lines 2-7, discloses removing unimportant connections (i.e., nuisance variables) while training the DNN:  “We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint.”)

As per claim 15, claim 15 is a device claim corresponding to method claim 5. The
difference is that the device claim recites a memory and a processor.  (Luo, Abstract Lines 7-10 discloses a memory:  “Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 15 is rejected for the same reasons as claim 5.

As per claim 16, claim 16 is a device claim corresponding to method claim 8. The
difference is that the device claim recites a memory and a processor.  (Luo, as shown above, discloses a memory and a processor.)  Claim 16 is rejected for the same reasons as claim 8.

As per claim 20, claim 20 is a duplicate of Claim 15.  Claim 20 is rejected for the same reasons as claim 15.

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Luo and Erdogan, further in view of Kadav et. al. (US PGPub 2017/0337471 A1; hereinafter “Kadav”).
As per claim 6, the combination of Luo and Erdogan as shown above teaches the method of claim 3.  The combination of Luo and Erdogan further teaches wherein pruning is performed [using a greedy layer-wise pruning] based on entropic ranking to remove less entropic connections. (Luo, Section 3.2 Right Column Last Paragraph Lines 6-8, discloses pruning connections based on entropic ranking:  “All the filters are sorted in the descending order according to their entropy scores, and only the top k filters are preserved”.  Examiner’s Note:  In Luo, “filters” are being removed, which are the connections between layers of neurons.  This is shown in Luo, Section 3.2, Paragraph 1, where Ii represents the input tensor for a given layer, and Wi represents the filter weights.  *Kadav, below, teaches greedy layer-wise pruning.
However, the combination of Luo and Erdogan fails to teach using a greedy layer-wise pruning.  Kadav teaches using a greedy layer-wise pruning. (Kadav, Para [0026] Lines 2-9, discloses greedy layer-wise pruning:  “For deep networks, pruning and retraining on a layer-by-layer basis can be very time consuming. Pruning layers across the network gives a holistic view of the robustness of the network, resulting in a smaller network. In particular, a “greedy” pruning accounts for filters that have been removed in previous layers without considering the kernels for the previously pruned feature maps when calculating the sum of absolute weights.”

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Luo and Erdogan, further in view of Majumdar et. al. (US PGPub 2014/0046885 A1; hereinafter “Majumdar”).
As per claim 10, the combination of Luo and Erdogan as shown above teaches the method of claim 2. The combination of Luo and Erdogan further teaches wherein optimizing the DNN comprises: determining a maximum pruning rate for each layer of the DNN while enforcing a total number of parameters in each layer and [a number of bits to represent each parameter];  (Luo, Section 3.2 Right Column Last Paragraph, discloses determining a compression rate (i.e., maximum pruning rate):  “The next issue is how to decide the pruning boundary. One feasible method is to specify a threshold value, all channels with score below this threshold are removed from the network. However, this threshold value is a hyperparameter, which is hard to be specified. Another more practical method is using a constant compression rate. All the filters are sorted in the descending order according to their entropy scores, and only the top k filters are preserved. Of course, the corresponding channels in Wi+1 are removed too.”  Luo, Section 3.4 Right Column Final Paragraph Lines 6-7, indicates that this is done for each layer of the DNN:  “Only after the final layer has been pruned, the network is fine-tuned carefully with many epochs.” Examiner’s Note:  Here, enforcing a total number of parameters for each layer is indicated by “only the top k filters are preserved.”) *Majumdar below teaches a number of bits to represent each parameter.
pruning layers of the DNN in accordance with the maximum pruning rate;  (Luo, Section 3.2 Right Column Last Paragraph Lines 6-8, discloses leaving a fixed number of filters behind (i.e., a maximum pruning rate):  “All the filters are sorted in the  descending order according to their entropy scores, and only the top k filters are preserved.”)
and re-training the pruned DNN. (Luo, Section 3.4 Right Column Paragraph 2 Lines 2-4, discloses fine tuning (i.e., retraining) after pruning:  “Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly.”)
The combination of Luo and Erdogan fails to teach a number of bits to represent each parameter.  Majumdar teaches a number of bits to represent each parameter. (Majumdar, Para [0004] First Sentence, discloses a number of bits to represent each parameter:  “Neural signals and parameters of a neural system (e.g., synaptic weights, neural states, etc) can be represented in quantized form with a pre-defined bit precision and stored in a system memory for further use.”)
Luo, Erdogan, and Majumdar are analogous art because they are both directed to optimizing neural networks.  It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the entropy-based neural network pruning of the 

As per claim 17, claim 17 is a device claim corresponding to method claim 10. The
difference is that the device claim recites a memory and a processor.  (Luo, Abstract Lines 7-10 discloses a memory:  “Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 17 is rejected for the same reasons as claim 10.

Conclusion

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710.  The examiner can normally be reached on M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-




/L.A.S./Examiner, Art Unit 2126           
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126