Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 05/24/2019 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference sign(s) mentioned in the description: 264, as first found in Page 7 Para [0040] Line 2 of the specification.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The disclosure is objected to because of the following informalities: 
Specification Page 7 Para [0040] Line 3 states: "A high variance, for example variance 262".  However, in the Drawings, label 262 displays a distribution with a low variance, as it shows most of the probability tightly clustered around the mean.  
The end of the Specification includes “Examples” which appear to be analogous to the Claims.  However, Example 21 does not map to any claim or any other disclosure in the Specification.
Appropriate correction is required.
Claim Objections
Claims 2, 4, and 5 are objected to because of the following informalities:  In line 1, "and further comprising" should be changed to read ", further comprising".  Appropriate correction is required.
Claim 3 is objected to because of the following informalities:  "pruning neurons" should be changed to read "pruning the neurons".  Appropriate correction is required.
Claim 8 is objected to because of the following informalities:  In line 2, "dimensionality of a DNN" should be changed to read "dimensionality of the DNN".  Appropriate correction is required.
Claim 9 is objected to because of the following informalities:  In line 4, Claim 9 recites the limitation "the pruned network".  There is insufficient antecedent basis for this limitation in the claim.  Examiner suggests "the DNN after pruning."  In lines 4 and 6, Claim 9 recites the limitation "the network". There is insufficient antecedent basis for this limitation in the claim.  Examiner suggests "the DNN".  Appropriate correction is required.
Claim 11 is objected to because of the following informalities:  Lines 2 and 3 of Claim 11 recite the limitation "the DL network". There is insufficient antecedent basis for this limitation in the claim.  Examiner suggests "the DNN".  Appropriate correction is required.
Claim 12 is objected to because of the following informalities:  Line 2 of Claim 12 recites the limitation "the multi-layer DNN".  There is insufficient antecedent basis for this limitation in the claim.  Examiner suggests "the DNN".  Appropriate correction is required.
Claims 13 and 18 are objected to because of the following informalities:  In Claim 13 Line 5 and Claim 18 Line 3, “deep neural network (DNN) with a training dataset” should be changed to read “deep neural network (DNN) trained with a training dataset” to be consistent with the equivalent limitation of Claim 1.  Appropriate correction is required.
Claim 20 is objected to because of the following informalities:  Claim 20 follows after an independent claim other than the independent claim it depends on.  Claim 20 ultimately depends on independent Claim 13, but comes after independent Claim 18. Also, Claim 20 is a duplicate of Claim 15.  Examiner suggests removing the claim.  Appropriate correction is required.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claim 18 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because Claim 18 recites a machine readable 
Claim 19 is rejected for the same reason, as it is dependent upon Claim 18.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1, 13, 18, and 19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 1, 13, and 18 recite the limitation “the spreading signal is an element-wise multiplication of input activations between the neurons in a first layer to neurons in a second next layer with a corresponding weight matrix of connections between such neurons.”  It is unclear to the examiner what is meant by “Input activations between the neurons in a first layer to neurons in a second next layer”, as there is no “…and” after the “between”, so it is unclear if an implied relationship is intra- or inter- layer.  “Input activations” is also unclear to 
Claims 1, 13, and 18 recite the limitation “calculating an exponent of a volume of an area covered by the spreading signal.” It is unclear what is meant by “a volume of an area”, which implies a 3D calculation of a 2D space.  Also, “area covered by the spreading signal” is unclear because the “spreading signal” is defined as a matrix multiplication, which is in itself a matrix.  An area of a matrix is undefined.  Examiner is interpreting the limitation as “calculating an area bounded by a curve of a function of a probability distribution of each element of the spreading signal.”
Claim 19 recites the limitation "the computing device of claim 18" in Line 1.  There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claims 1, 2, 3, 4, 13, 14, 18, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Luo et. al. (“An Entropy-based Pruning Method for CNN Compression”; hereinafter “Luo”), in view of Erdogan et. al. (“Measurement Criteria for Neural Network Pruning”; hereinafter “Erdogan”).
As per claim 1, Luo teaches A computer implemented method of optimizing a neural network (Luo, Introduction Page 2 Left Column First Bullet, discloses “A simple yet effective framework is proposed, to accelerate and compress CNN models in both training and inference stage. Our method can compress the size of intermediate activations, reducing the run-time memory consumption dramatically, which is less concerned in previous works.”), the method including operations comprising:
obtaining a deep neural network (DNN) trained with a training dataset;  (Luo, Introduction 2nd Paragraph Line 7, discloses that the method is directed to deep neural networks: “One of the main issues of deep neural networks is its huge computational  and storage overhead.”  Luo, Section 3.2 Left Column 3rd Paragraph Line 4, implies an already trained neural network:  “In order to calculate the entropy, more output values need to be collected, which can be obtained using an evaluation set. In practice, the evaluation set can be simply the original training set, or a subset of it.”)
and determining neural entropies of respective connections between neurons by calculating an exponent of a volume of an area covered by the spreading signal. (Examiner is interpreting “calculating an exponent of a volume of an area covered by the spreading signal” as “calculating an area bounded by the curve of a function of a probability distribution of the spreading signal.”  Luo, Section 3.2 Left Column Paragraph 3 discloses “We first use global average pooling to convert the output of layer i, which is a c x h x w tensor, into a 1 x c vector. In this way, each channel of Ii+1 (activation of layer i / input of layer i + 1) has a corresponding score for one image. In order to calculate the entropy, more output values need to be collected, which can be obtained using an evaluation set. In practice, the evaluation set can be simply the original training set, or a subset of it. Finally, we get a matrix M e Rn x c, where n is the number of images in the evaluation set, and c is the channel number. For each channel j, we would pay attention to the distribution of M: ; j . To compute the entropy value of this channel, we first divide it into m different bins, and calculate the probability of each bin. Finally, the entropy can be calculated as follows:  Hj = -Sumi=1-m (pi log pi)
Examiner’s Note:  Here, the “spreading signal” is the 1 x c vector.  Luo runs n samples from an evaluation set through the NN, thereby constructing an n x c matrix, which is just how Luo is storing the n samples of each of the c elements of the spreading signal.  For each element of the spreading signal, the n elements are used to construct a probability distribution by dividing the values into bins.  Finally, a function of the probability distribution is calculated (f(p(x)) = p(x) * log (p(x)) and the area under the curve of this function is calculated by summing up the function values for each bin.  The result of this calculation of the area under the curve of this function is the entropy, as noted by Luo (“Finally, the entropy can be calculated as follows:  Hj = -Sumi=1-m (pi log pi)”).
Luo does not explicitly teach determining a spreading signal between neurons in multiple adjacent layers of the DNN wherein the spreading signal is an element-wise multiplication of input activations between the neurons in a first layer to neurons in a second next layer with a corresponding weight matrix of connections between such neurons  
Erdogan teaches determining a spreading signal between neurons in multiple adjacent layers of the DNN wherein the spreading signal is an element-wise multiplication of input activations between the neurons in a first layer to neurons in a second next layer with a corresponding weight matrix of connections between such neurons;  (Examiner is interpreting “the spreading signal is an element-wise multiplication of input activations between the neurons in a first layer to neurons in a second next layer with a corresponding weight matrix of connections between such neurons” as “the spreading signal comprises multiplications of weights of connections between nodes of adjacent layers with the connections’ respective activations from originating nodes”.  Erdogan, Section 2 Page 84 top of left column, discloses:  “the activity of a hidden node i, is obtained by Yi = Ai(neti) where Ai is the activation function and neti, is defined by neti = Sumk=1-Nk (Vik Xk) where Xk is an input pattern, Vik is the weight connection from input node k to hidden node i and Nk is the number of input nodes. The activity Yi is normalized as follows: pi = Yi / Sumj=1-Nk (Yj) where Nk is the number of hidden nodes.  By using this normalized activity, an entropy function can be formulated by Hj = -Sumi=1-Nk (pi log pi).”  
Examiner’s Note:  Here “input pattern” is the activation from the previous layer’s node, and Vik Xk is the said multiplication.  The calculation of the “spreading signal” comprises this multiplication, and each pi of the spreading signal is then used to calculate the entropy.


As per claim 2, the combination of Luo and Erdogan teaches the method of claim 1 and further comprising optimizing the DNN based on the determined neural entropies between the neurons in the multiple adjacent layers.  (Luo, Intro Page 2 Left Column Bullet 1, discloses optimizing a CNN (a type of DNN): “A simple yet effective framework is proposed, to accelerate and compress CNN models in both training and inference stage. Our method can compress the size of intermediate activations, reducing the run-time memory consumption dramatically, which is less concerned in previous works.”  Luo, Section 3.2 Left Column First Full Paragraph, discloses multiple adjacent layers: “Note that each filter corresponds to a single channel of its activation tensor Ii+1 (the activation of layer i and at the same time input of layer i + 1, shown as the middle green block in Figure 1), the discriminative ability of each filter is closely related to its activation channel.”  Luo, Section 3.2 Right Column First Full Paragraph, discloses that the optimizing is based on neural entropies:  “Hence, our entropy-based method can be used for evaluating the importance of each channel. A smaller score of Hj means channel j is less important in this layer, thus could be removed.”)

As per claim 3, the combination of Luo and Erdogan teaches the method of claim 2 wherein optimizing the DNN comprises pruning neurons as a function of the neural entropies to create a sparse DNN.  (Erdogan, Section 2.1 Line 4, discloses that entropy is being used to evaluate nodes (i.e., neurons):  “A state with minimum entropy means that most nodes are operating in the nonlinear region near the extreme values.”  Erdogan, Section 2.1 Lines 15-18, discloses that low entropy nodes are pruned to optimize the DNN:  “These inactive nodes can then be eliminated without affecting the performance of the original network to obtain an optimal neural network.”  Examiner’s Note:  Pruning nodes results in a sparse DNN.)

As per claim 4, the combination of Luo and Erdogan teaches the method of claim 3 and further comprising retraining the sparse DNN. (Luo, Section 3.4 Right Column Paragraph 2 Lines 2-4, discloses fine-tuning (i.e., retraining) the sparse DNN after pruning:  “Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly.”)

As per claim 13, claim 13 is a device claim corresponding to method claim 1. The
difference is that the device claim recites a memory and a processor.  (Luo, Abstract Lines 7-10 discloses a memory:  “Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 13 is rejected for the same reasons as claim 1.)

As per claim 14, claim 14 is a device claim corresponding to method claim 2. The
difference is that the device claim recites a memory and a processor.  (Luo, as shown in claim 13, discloses a memory and a processor.)  Claim 14 is rejected for the same reasons as claim 2.

As per claim 18, claim 18 is a machine readable medium claim corresponding to method claim 1. The difference is that the machine readable medium claim recites a machine readable medium and a processor.  (Luo, Introduction Paragraph 2, discloses that their method is directed to minimizing storage space to overcome deployment on machine readable media on devices:  “In spite of its great success, a typical deep model is hard to be deployed on resource constrained devices, e.g., mobile phones and embedded gadgets. A resource constrained scenario means a computing task must be accomplished with limited resource supply, such as computing time, storage space, battery power, computing capability and so on. One of the main issues of deep neural networks is its huge computational and storage overhead, which constitutes a serious challenge for a mobile device with limited computing resource.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 18 is rejected for the same reasons as claim 1.

As per claim 19, claim 19 is a device claim, dependent upon a machine readable medium claim, corresponding to method claim 2. The difference is that the claim recites a machine readable medium, a memory and a processor.  (Luo, Introduction Paragraph 2, discloses that their method is directed to minimizing storage space to overcome deployment on machine readable media on devices:  “In spite of its great success, a typical deep model is hard to be deployed on resource constrained devices, e.g., mobile phones and embedded gadgets. A resource constrained scenario means a computing task must be accomplished with limited resource supply, such as computing time, storage space, battery power, computing capability and so on. One of the main issues of deep neural networks is its huge computational and storage overhead, which constitutes a serious challenge for a mobile device with limited computing resource.” Luo, Abstract Lines 7-10 discloses a memory:  “Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor. Claim 19 is rejected for the same reasons as claim 2.)

Claims 5, 7, 8, 9, 11, 12, 15, 16, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Luo and Erdogan, further in view of Han et. al. (“DSD: Dense-Sparse-Dense Training for Deep Neural Networks”; hereinafter “Han”).

As per claim 5, the combination of Luo and Erdogan as shown above teaches the method of claim 4.  However, the combination of Luo and Erdogan fails to teach and further comprising increasing a density of the sparse DNN by adding neurons while retraining the sparse DNN.  
Han teaches and further comprising increasing a density of the sparse DNN by adding neurons while retraining the sparse DNN.  (Han, Abstract Lines 4-9, discloses adding neuron while retraining a sparse DNN:  “In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint. In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network.”)
Luo, Erdogan, and Han are analogous art because they are all directed to optimizing neural networks.  It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the entropy-based neural network pruning of the combination of Luo and Erdogan, with the dense-sparse-dense training of Han.  The modification would have 

As per claim 7, the combination of Luo and Erdogan as shown above teaches the method of claim 2.  However, the combination of Luo and Erdogan fails to teach wherein optimizing the DNN comprises regularization of the DNN during training as a function of the neural entropies. 
Han teaches wherein optimizing the DNN comprises regularization of the DNN during training [as a function of the neural entropies]. (Han, Abstract Lines 2-3, discloses regularization during training:  “We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance”.   *Erdogan discloses that the sparsing step is done by removing neurons with low entropies.  Erdogan, Section 2.1 Line 4, discloses that entropy is being used to evaluate nodes (i.e., neurons):  “A state with minimum entropy means that most nodes are operating in the nonlinear region near the extreme values.”  Erdogan, Section 2.1 Lines 15-18, discloses that low entropy nodes are pruned to optimize the DNN:  “These inactive nodes can then be eliminated without affecting the performance of the original network to obtain an optimal neural network.”)

As per claim 8, the combination of Luo, Erdogan, and Han teaches the method of claim 7 wherein regularization comprises: reducing a dimensionality of a DNN based on entropic thresholding; (Han, Abstract Lines 5-7, discloses regularization to prune (i.e., reduce dimensionality of) the DNN:  “In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint”.   Erdogan discloses removing neurons with low entropies (i.e., below a given threshold).  Erdogan, Section 2.1 Line 4, discloses that entropy is being used to evaluate nodes (i.e., neurons):  “A state with minimum entropy means that most nodes are operating in the nonlinear region near the extreme values.”  Erdogan, Section 2.1 Lines 15-18, discloses that low entropy nodes are pruned to optimize the DNN:  “These inactive nodes can then be eliminated without affecting the performance of the original network to obtain an optimal neural network.”)

As per claim 9, the combination of Luo, Erdogan, and Han teaches the method of claim 7 wherein regularization comprises: 
pruning least important neurons based on the neural entropies to induce network sparsity; (Han, Abstract Lines 5-7, discloses pruning least important neurons to induce network sparsity:  “In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint.”  Erdogan discloses removing neurons with low entropies (i.e., below a given threshold).  Erdogan, Section 2.1 Line 4, discloses that entropy is being used to evaluate nodes (i.e., neurons):  “A state with minimum entropy means that most nodes are operating in the nonlinear region near the extreme values.”  Erdogan, Section 2.1 Lines 15-18, discloses that low entropy nodes are pruned to optimize the DNN:  “These inactive nodes can then be eliminated without affecting the performance of the original network to obtain an optimal neural network.”)
fine tuning the pruned network by sparsely retraining the network; (Luo, Section 3.4 Right Column Paragraph 2 Lines 2-4, discloses fine tuning (i.e., retraining) the network after pruning (i.e., while network is sparse):  “Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly.”)
removing a sparsity constraint;  (Han, Abstract Lines 7-9, discloses removing a sparsity constraint:  “In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network.”)
and retraining the network while including all the removed neurons. (Han, Abstract Lines 7-9, discloses retraining a network after bringing back removed neurons:  “In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network.”)

As per claim 11, the combination of Luo, Erdogan, and Han teaches the method of claim 2 wherein optimizing the DNN comprises removing nuisance variables within the DL network as a function of the determined entropies while training the DL network. (Han, Abstract Lines 2-7, discloses removing unimportant connections (i.e., nuisance variables) while training the DNN:  “We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint.”  

As per claim 12, the combination of Luo, Erdogan, and Han teaches the method of claim 2 wherein optimizing the DNN comprises guiding training of the multi-layer DNN to determine a size of each layer. (Han, Abstract Lines 2-7, discloses removing unimportant connections (i.e., nuisance variables) while training the DNN:  “We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint.”   Luo, Section 3.2 Right Column Last Paragraph, discloses determining a size of each layer: “All the filters are sorted in the descending order according to their entropy scores, and only the top k filters are preserved.”

As per claim 15, claim 15 is a device claim corresponding to method claim 5. The
difference is that the device claim recites a memory and a processor.  (Luo, Abstract Lines 7-10 discloses a memory:  “Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 15 is rejected for the same reasons as claim 5.

As per claim 16, claim 16 is a device claim corresponding to method claim 8. The
difference is that the device claim recites a memory and a processor.  (Luo, as shown above, discloses a memory and a processor.)  Claim 16 is rejected for the same reasons as claim 8.

As per claim 20, claim 20 is a duplicate of Claim 15.  Claim 20 is rejected for the same reasons as claim 15.

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Luo and Erdogan, further in view of Kadav et. al. (US PGPub 2017/0337471 A1; hereinafter “Kadav”).
As per claim 6, the combination of Luo and Erdogan as shown above teaches the method of claim 3.  The combination of Luo and Erdogan further teaches wherein pruning is performed [using a greedy layer-wise pruning] based on entropic ranking to remove less entropic connections. (Luo, Section 3.2 Right Column Last Paragraph Lines 6-8, discloses pruning connections based on entropic ranking:  “All the filters are sorted in the descending order according to their entropy scores, and only the top k filters are preserved”.  Examiner’s Note:  In Luo, “filters” are being removed, which are the connections between layers of neurons.  This is shown in Luo, Section 3.2, Paragraph 1, where Ii represents the input tensor for a given layer, and Wi represents the filter weights.  *Kadav, below, teaches greedy layer-wise pruning.
However, the combination of Luo and Erdogan fails to teach using a greedy layer-wise pruning.  Kadav teaches using a greedy layer-wise pruning. (Kadav, Para [0026] Lines 2-9, discloses greedy layer-wise pruning:  “For deep networks, pruning and retraining on a layer-by-layer basis can be very time consuming. Pruning layers across the network gives a holistic view of the robustness of the network, resulting in a smaller network. In particular, a “greedy” pruning accounts for filters that have been removed in previous layers without considering the kernels for the previously pruned feature maps when calculating the sum of absolute weights.”

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Luo and Erdogan, further in view of Majumdar et. al. (US PGPub 2014/0046885 A1; hereinafter “Majumdar”).
As per claim 10, the combination of Luo and Erdogan as shown above teaches the method of claim 2. The combination of Luo and Erdogan further teaches wherein optimizing the DNN comprises: determining a maximum pruning rate for each layer of the DNN while enforcing a total number of parameters in each layer and [a number of bits to represent each parameter];  (Luo, Section 3.2 Right Column Last Paragraph, discloses determining a compression rate (i.e., maximum pruning rate):  “The next issue is how to decide the pruning boundary. One feasible method is to specify a threshold value, all channels with score below this threshold are removed from the network. However, this threshold value is a hyperparameter, which is hard to be specified. Another more practical method is using a constant compression rate. All the filters are sorted in the descending order according to their entropy scores, and only the top k filters are preserved. Of course, the corresponding channels in Wi+1 are removed too.”  Luo, Section 3.4 Right Column Final Paragraph Lines 6-7, indicates that this is done for each layer of the DNN:  “Only after the final layer has been pruned, the network is fine-tuned carefully with many epochs.” Examiner’s Note:  Here, enforcing a total number of parameters for each layer is indicated by “only the top k filters are preserved.”) *Majumdar below teaches a number of bits to represent each parameter.
pruning layers of the DNN in accordance with the maximum pruning rate;  (Luo, Section 3.2 Right Column Last Paragraph Lines 6-8, discloses leaving a fixed number of filters behind (i.e., a maximum pruning rate):  “All the filters are sorted in the  descending order according to their entropy scores, and only the top k filters are preserved.”)
and re-training the pruned DNN. (Luo, Section 3.4 Right Column Paragraph 2 Lines 2-4, discloses fine tuning (i.e., retraining) after pruning:  “Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly.”)
The combination of Luo and Erdogan fails to teach a number of bits to represent each parameter.  Majumdar teaches a number of bits to represent each parameter. (Majumdar, Para [0004] First Sentence, discloses a number of bits to represent each parameter:  “Neural signals and parameters of a neural system (e.g., synaptic weights, neural states, etc) can be represented in quantized form with a pre-defined bit precision and stored in a system memory for further use.”)


As per claim 17, claim 17 is a device claim corresponding to method claim 10. The
difference is that the device claim recites a memory and a processor.  (Luo, Abstract Lines 7-10 discloses a memory:  “Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods.”  Luo, Section 4.4 Right Column Last Paragraph, discloses that a processor is used:  “Since the parameters and FLOPs of our pruned model have been dramatically reduced, we think this accuracy degradation is acceptable.”  Note that FLOP stands for Floating Point Operation, and is a term directed to the amount of operations executed by a processor.  Claim 17 is rejected for the same reasons as claim 10.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Takeda et. al. ("Node pruning based on Entropy of Weights and Node Activity for Small-footprint Acoustic Model based on Deep Neural Networks") discloses pruning neural networks based on entropies of both nodes and weights.
Xing et. al. (CN Pub 107784360 A) discloses pruning a neural network based on the variance of neurons
Goel et. al  (US PGPub 2016/0307098 A1), Para [0027], discloses a regularization procedure, such as dropout training, that may promote neurons that have high average entropy 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710.  The examiner can normally be reached on M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished 
/L.A.S./Examiner, Art Unit 2126         
                                                                                                                                                                                               /ANN J LO/Supervisory Patent Examiner, Art Unit 2126