Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Examiner’s Note
Providing supporting paragraph(s) with a clear explanation for each limitation of amended/new claim(s) in Remarks is strongly requested for clear and definite claim interpretations by Examiner.

Priority
Acknowledgment is made of applicant's claim for the provisional application filed on 05/23/2018.

Drawings
The drawings are objected to because it appears that the drawing(s) in figs 1-4, 5C, 6-7, 15 is/are color drawings or photographs and are not black and white line drawings. Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Note that Applicant is allowed to have color drawings according to MPEP 608.01(f). However, Applicant needs to file a petition to have the color drawings accepted, meet the requirements of requirements of 37 CFR 1.84, and pay a fee.

Claim Objections
Claim(s) 1 is/are objected to because of the following informalities: in order to improve claim language readability, clarity and consistency, “each of the layers” may need to read “each of the plurality of layers” in line 4. Appropriate correction is suggested.
Claim(s) 1 is/are objected to because of the following informalities: in order to improve claim language readability, clarity and consistency, “the set of channels” may need to read “the respective set of channels” in lines 7-9. Appropriate correction is suggested.
Claim(s) 1 is/are objected to because of the following informalities: in order to improve claim language readability, clarity and consistency, “the layer” may need to read “the respective layer” in lines 7, 10, 12, 15-16. Appropriate correction is suggested.
Claim(s) 2 recite(s) the limitation “wherein generating the pruned version of at least a subset of the plurality of layers comprises” in lines 1-2. However, it appears that a new feature is introduced instead of referring to an existing limitation. Thus, it appears that it may need to read “further comprising generating a pruned version of at least a subset of the plurality of layers comprises” or something else. Appropriate correction is required.
Claim(s) 3 recite(s) the limitation “wherein performing the weight pruning on a respective thinned version of one of the subset of layers comprises:” in lines 1-2, and it appears that the claim language refers to an existing limitation in claim 2: “performing weight pruning on a corresponding thinned version of the layer in the subset”. Thus, it appears that “wherein performing the weight pruning on a respective thinned version of one of the subset of layers comprises” may need to read “wherein performing the weight pruning on a corresponding thinned version of the layer in the subset comprises” or something else. Appropriate correction is required.
Claim(s) 5 is/are objected to because of the following informalities: “statistics value” should read “statistic values” in line 1 since claim 5 has two statistic values of a mean and a standard deviation. Appropriate correction is required.
Claim(s) 5 is/are objected to because of the following informalities: “the absolute values of the weights” should read “absolute values of the weights” in line 2. Appropriate correction is required.
Claim(s) 7 is/are objected to because of the following informalities: “pre-trained neural network” should read “a pre-trained neural network” in line 1. Appropriate correction is required.
Claim(s) 9 is/are objected to because of the following informalities: “the sum of absolute values” should read “a sum of absolute values” in lines 1-2. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim(s) 1-20 is/are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim(s) 1 recite(s) the limitation “pruned versions of each of the plurality of layers” in line 5.  However, in lines 5-6, it recites “wherein determining the pruned version of a respective layer” meaning that each layer has a single pruned version. Thus, it is not clear if each layer has multiple pruned versions or a single pruned version or something else. For the purposes of examination, “a pruned version of each of the plurality of layers” is used. In addition, for the same reason, in the last limitation, “the respective pruned versions of each of the plurality of layers” should read “the pruned version of each of the plurality of layers”. Furthermore, claim(s) 16-17 is/are rejected for the same reason.
Claim(s) 2 recite(s) the limitation “the layer” in line 3. There is insufficient antecedent basis for this limitation in the claim. It is not clear if it indicates which layer in the subset or something else. In addition, claim 1 recites “the layer” and claim 2 recites “the layer in the subset”, and thus it is unclear whether those are intended to be the same or different layers. It appears that it may need to read “another layer” or something else. For the purposes of examination, “another layer” is used.
Claim(s) 4 recite(s) the limitation “the at least one of the thinned versions” in line 1. There is insufficient antecedent basis for this limitation in the claim. For the purposes of examination, “at least one of the thinned versions” is used.
Claim(s) 4 recite(s) the limitation “the weight threshold” in lines 2-3. There is insufficient antecedent basis for this limitation in the claim. It is not clear if they indicate which weight thresholds of “the respective weight threshold” in claim 3. It appears that they may need to read “a first weight threshold” and “a second weight threshold” or something else. For the purposes of examination, “a first weight threshold” and “a second weight threshold” are used.
Claim(s) 12 recite(s) the limitation “a respective iteration of the thinned neural network is generated to test thinned versions of each of the plurality of layers” in lines 1-2. However, in lines 2-4, it recites “each iteration of the thinned neural network comprises only one thinned version of the plurality of layers” meaning that each iteration has a single thinned version for a layer. Thus, it is not clear if each iteration has multiple thinned versions or a single thinned version or something else. For the purposes of examination, “a thinned version of each of the plurality of layers” is used.
Claims 1-2, 4, 12, 16-17 each recite limitations that raise issues of indefiniteness as set forth above, and dependent claims 3, 5-11, 13-15, 18-20 are rejected at least based on their direct and/or indirect dependency from independent claims. Appropriate explanation and/or amendment is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-15 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claims does/do not fall within at least one of the four categories of patent eligible subject matter. 
The claimed “machine accessible storage medium” is a propagating signal when viewed in light of as-filed specification paragraphs 90 and 104. The specification reads “In an example the storage 1158 may be implemented via a solid state disk drive (SSDD). Other devices that may be used for the storage 1158 include flash memory cards, such as SD cards, microSD cards, xD picture cards, and the like, and USB flash drives. In low power implementations, the storage 1158 may be on-die memory or registers associated with the processor 1152. However, in some examples, the storage 1158 may be implemented using a micro hard disk drive (HDD)” and “In an example, the instructions 1182 provided via the memory 1154, the storage 1158, or the processor 1152 may be embodied as a non-transitory, machine readable medium 1160 including code to direct the processor 1152 to perform electronic operations in the loT device 1150. The processor 1152 may access the non-transitory, machine readable medium 1160 over the interconnect 1156. For instance, the non-transitory, machine readable medium 1160 may be embodied by devices described for the storage 1158 of FIG. 11 or may include specific storage units such as optical disks, flash drives, or any number of other hardware devices.” However, the claimed “machine accessible storage medium” can include transitory forms of signal transmission (often referred to as “signals per se”), such as a propagating electrical or electromagnetic signal or carrier wave. Signal per se does not fall within at least one of the four statutory categories. In addition, the specification does not define the claimed “machine accessible storage medium” or provides a disavowal. The examiner suggests using the term “non-transitory machine accessible storage medium”.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

(Note: Hereinafter, if a limitation has brackets (i.e. [·]) around claim languages, the bracketed claim languages indicate that they have not been taught yet by the current prior art reference but they will be taught by another prior art reference afterwards.)

Claim(s) 1, 6-7, 9-13, 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over POLYAK et al. (Channel-Level Acceleration of Deep Face Representations) in view of Li et al. (PRUNING FILTERS FOR EFFICIENT CONVNETS), further in view of Babaeizadeh et al. (NoiseOut: A Simple Way to Prune Neural Networks)

Regarding claim 1
POLYAK teaches 
At least one machine accessible storage medium having instructions stored thereon, wherein the instructions when executed on a machine, cause the machine to: 
(POLYAK [sec(s) I] “In our implementation, we use the android port of a deep learning framework called Torch7 [2], in which computation is optimized using vectorized convolution code.”; e.g., “Torch7” read(s) on “storage medium” and “machine” since code is run on a computer.)

access data comprising a definition of a neural network, wherein the neural network comprises a plurality of layers, and each of the layers comprises a respective set of channels; 
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III] “For the purpose of benchmarking on the LFW benchmark [34], we use the scratch network in order to extract face feature representation. This representation is the collection of the 320 activations of the Avg Pool layer. Afterwards, we train a Joint Bayesian model [35] based on these extracted features for the face verification task.”; e.g., table 1 read(s) on “data comprising a definition of a neural network”.)

determine pruned versions of each of the plurality of layers, wherein determining the pruned version of a respective layer comprises: 
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “The Inbound Prune approach that we suggest focuses on reducing the number of channels each filter uses by eliminating channels that do not contribute significantly to the information the filter extracts. The amount of information each channel contributes is measured by the variance of the specific channel activation output. We do not consider directly the model accuracy during the pruning process. Instead, we fine-tune the model obtained after each prune in order to allow it to adapt. The pruning of the entire network is done sequentially on the network layers: from lower layers to the top ones, pruning is followed by fine-tuning, which is followed by pruning of the next layer.”;)

sorting the set of channels of the layer based on respective weight values of each channel in the set of channels; 
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “The inbound prune method targets the number of channels that each filter operates on. Our hypothesis is that each input channel has a different level of contribution to the feature map outputted by each filter. As a result, we may omit the computation of the filter on channels with low contribution, and suffer only a minor decrease in accuracy. In order to detect such low importance channels, we leverage the pruning scheme used previously on single connections [22]. Specifically, the notion of smallest contribution variance – min(σ) is used. Originally, [22] define scores for the contribution of a single weight to the activation of a single neuron. Below, we generalize this measure to the contribution of each channel to the filter activation by using the notion of channel activation. … The contribution variance of channel s in filter t is defined to be σts = var(||Wts ∗ Xs||F) where Xs is the s channel of the input, sampled from the training dataset. … Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold τ (see below). As a result, the pruned layer operation is defined by 
    PNG
    media_image1.png
    163
    487
    media_image1.png
    Greyscale
 where 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 is the set of channels filter t operates on after the prune operation.”; e.g., “the contribution variance is below a threshold” read(s) on “sorting”.)

pruning a first percentage of the set of channels based on the sorting to form a thinned version of the layer; 
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “The inbound prune method targets the number of channels that each filter operates on. Our hypothesis is that each input channel has a different level of contribution to the feature map outputted by each filter. As a result, we may omit the computation of the filter on channels with low contribution, and suffer only a minor decrease in accuracy. In order to detect such low importance channels, we leverage the pruning scheme used previously on single connections [22]. Specifically, the notion of smallest contribution variance – min(σ) is used. Originally, [22] define scores for the contribution of a single weight to the activation of a single neuron. Below, we generalize this measure to the contribution of each channel to the filter activation by using the notion of channel activation. … The contribution variance of channel s in filter t is defined to be σts = var(||Wts ∗ Xs||F) where Xs is the s channel of the input, sampled from the training dataset. … Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold τ (see below). As a result, the pruned layer operation is defined by 
    PNG
    media_image1.png
    163
    487
    media_image1.png
    Greyscale
 where 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 is the set of channels filter t operates on after the prune operation.”; e.g., “Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold” read(s) on “pruning a first percentage of the set of channels”.)

providing input data to a thinned version of the neural network in a test, wherein the thinned version of the neural network includes the thinned version of the layer; 
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “fine-tuning allows us to achieve three goals: (i) Prior to pruning, the model is at a local optima in parameter space. Once pruned, the model is no longer at a local optima in the parameter space. By fine-tuning, we search a new local optima while retaining the pruned model structure. (ii) Although we attempt to prune layers with minimal effect, the output of pruned layers is still changed. Fine-tuning allows the next layer in line to adapt to the pruned layer output. (iii) In our method, pruning is done sequentially. Therefore, adapting the model after each modification is crucial in order to reduce the accumulated error. The inbound pruning scheme for a single layer is summarized in Algorithm 1. As mentioned, for whole model acceleration, we prune each layer sequentially from the lowest layers to the topmost layers.” [sec(s) VI] “The model was fine-tuned using SGD with a learning rate of 0.01, momentum of 0.9 and batch size of 128. Each model was fine-tuned for a maximum of 30 epochs or until there were 5 successive epochs of no error improvement.”;)

determining accuracy of the thinned version of the neural network based on an output of the neural network in the test; 
(POLYAK [fig(s) 1] [table(s) 1, 5] “model classification accuracy before and after fine-tuning … Each row indicates pruning done up to the indicated layer.” [sec(s) VI] “During our experiments, the threshold τ was chosen empirically based on the accuracy achieved after pruning the model, prior to fine-tuning. We do not let the validation accuracy on the CASIA dataset drop below 84%. The model was fine-tuned using SGD with a learning rate of 0.01, momentum of 0.9 and batch size of 128. Each model was fine-tuned for a maximum of 30 epochs or until there were 5 successive epochs of no error improvement.” [sec(s) VII] “The Effect of Fine-Tuning: Next, we verify that fine-tuning reduces the accumulation of error rate between layers by comparing the accuracy of the models with and without fine-tuning after prune. Indeed, Table 5 shows that the detrimental effect of pruning is mitigated by performing fine-tuning. However, this effect is visible only in higher layers. At layer Conv21, post pruning results are 94% on the LFW benchmark for both methods. At Conv41, fine-tuning results are 93% compared to 88.43% – a difference of 4.5% in error rate. The increasing gap can be explained by the accumulated error caused by pruning all previous layers.”;)

adopting the thinned version of the layer to generate the pruned version of the layer based on the accuracy of the thinned version of the neural network [exceeding a threshold accuracy] value; and 
(POLYAK [fig(s) 1] [table(s) 1, 5] “model classification accuracy before and after fine-tuning … Each row indicates pruning done up to the indicated layer.” [sec(s) VI] “During our experiments, the threshold τ was chosen empirically based on the accuracy achieved after pruning the model, prior to fine-tuning. We do not let the validation accuracy on the CASIA dataset drop below 84%. The model was fine-tuned using SGD with a learning rate of 0.01, momentum of 0.9 and batch size of 128. Each model was fine-tuned for a maximum of 30 epochs or until there were 5 successive epochs of no error improvement.” [sec(s) VII] “The Effect of Fine-Tuning: Next, we verify that fine-tuning reduces the accumulation of error rate between layers by comparing the accuracy of the models with and without finetuning after prune. Indeed, Table 5 shows that the detrimental effect of pruning is mitigated by performing fine-tuning. However, this effect is visible only in higher layers. At layer Conv21, post pruning results are 94% on the LFW benchmark for both methods. At Conv41, fine-tuning results are 93% compared to 88.43% – a difference of 4.5% in error rate. The increasing gap can be explained by the accumulated error caused by pruning all previous layers.”;)

generate a pruned version of the neural network to comprise the respective pruned versions of each of the plurality of layers.
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “The Inbound Prune approach that we suggest focuses on reducing the number of channels each filter uses by eliminating channels that do not contribute significantly to the information the filter extracts. The amount of information each channel contributes is measured by the variance of the specific channel activation output. We do not consider directly the model accuracy during the pruning process. Instead, we fine-tune the model obtained after each prune in order to allow it to adapt. The pruning of the entire network is done sequentially on the network layers: from lower layers to the top ones, pruning is followed by fine-tuning, which is followed by pruning of the next layer.”;)

However, POLYAK does not appear to explicitly teach:
[sorting] the set of channels of the layer based on respective weight values of each channel in the set of channels;
pruning a first percentage of the set of channels based on the [sorting] to form a thinned version of the layer;
adopting the thinned version of the layer to generate the pruned version of the layer based on the accuracy of the thinned version of the neural network [exceeding a threshold accuracy] value; 

Li teaches
sorting the set of channels of the layer based on respective weight values of each channel in the set of channels;
pruning a first percentage of the set of channels based on the sorting to form a thinned version of the layer;
(Li [fig(s) 1-2] [sec(s) 3] “Our method prunes the less useful filters from a well-trained model for computational efficiency while minimizing the accuracy drop. We measure the relative importance of a filter in each layer by calculating the sum of its absolute weights 
    PNG
    media_image3.png
    78
    67
    media_image3.png
    Greyscale
|Fi,j|, i.e., its l1-norm ||Fi,j||1. Since the number of input channels, ni, is the same across filters, 
    PNG
    media_image3.png
    78
    67
    media_image3.png
    Greyscale
|Fi,j| also represents the average magnitude of its kernel weights. This value gives an expectation of the magnitude of the output feature map. Filters with smaller kernel weights tend to produce feature maps with weak activations as compared to the other filters in that layer. Figure 2(a) illustrates the distribution of filters’ absolute weights sum for each convolutional layer in a VGG-16 network trained on the CIFAR-10 dataset, where the distribution varies significantly across layers. We find that pruning the smallest filters works better in comparison with pruning the same number of random or largest filters (Section 4.4). Compared to other criteria for activation-based feature map pruning (Section 4.5), we find l1-norm is a good criterion for data-free filter selection.

    PNG
    media_image4.png
    283
    1165
    media_image4.png
    Greyscale
”; Note that POLYAK teaches “the set of channels of the layer based on respective weight values of each channel in the set of channels; pruning a first percentage of the set of channels based on the … to form a thinned version of the layer”.)

POLYAK teaches a system that prunes channels that do not contribute significantly to the information that filters extract based on convolutional neural networks. In addition, Li teaches a system that sorts tensors based on the sum of absolute tensor weights since pruning the smallest tensors outperforms pruning random tensors for most of the layers at different pruning ratios.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network pruning of POLYAK with the tensor sorting of Li. 
One of ordinary skill in the art would have been motived to combine in order to provide a better accuracy than the random tensor pruning.
(Li [sec(s) 4] “We compare our approach with pruning random filters and largest filters. As shown in Figure 8, pruning the smallest filters outperforms pruning random filters for most of the layers at different pruning ratios. For example, smallest filter pruning has better accuracy than random filter pruning for all layers with the pruning ratio of 90%. The accuracy of pruning filters with the largest l1-norms drops quickly as the pruning ratio increases, which indicates the importance of filters with larger l1-norms.”)

However, the combination of POLYAK, Li does not appear to explicitly teach:
adopting the thinned version of the layer to generate the pruned version of the layer based on the accuracy of the thinned version of the neural network [exceeding a threshold accuracy] value; 

Babaeizadeh teaches
adopting the thinned version of the layer to generate the pruned version of the layer based on the accuracy of the thinned version of the neural network exceeding a threshold accuracy value; 
(Babaeizadeh [fig(s) 1-2] [sec(s) 2] “NoiseOut simply repeats this process to compress the network. The pruning ends when the accuracy of the network drops below some given threshold. Note that the pruning process is happening while training. Algorithm 1 shows the final NoiseOut algorithm. For the sake of readability, this algorithm has been shown for networks with only one hidden layer. But the same algorithm can be applied to networks with more that one hidden layer by performing the same pruning on all the hidden layers independently. It can also be applied to convolutional neural networks that use dense layers, in which we often see over 90% of the network parameters [13].”; e.g., “accuracy of the network” read(s) on “accuracy of the thinned version of the neural network”. Note that POLYAK teaches “adopting the thinned version of the layer to generate the pruned version of the layer based on the accuracy of the thinned version of the neural network [exceeding a threshold accuracy] value;”.)

POLYAK teaches a system that prunes channels that do not contribute significantly to the information that filters extract based on convolutional neural networks. In addition, Li teaches a system that sorts tensors based on the sum of absolute tensor weights since pruning the smallest tensors outperforms pruning random tensors for most of the layers at different pruning ratios. Furthermore, Babaeizadeh teaches a system that ends pruning when the accuracy of the network drops below some given threshold and provides a simple but effective pruning method.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network pruning of POLYAK, Li with the threshold comparison of Babaeizadeh. 
One of ordinary skill in the art would have been motived to combine in order to provide a simple but effective pruning method to reduce the number of parameters in the dense layers of neural networks by removing neurons with correlated activation during training.
(Babaeizadeh [sec(s) 6] “In this paper, we have presented NoiseOut, a simple but effective pruning method to reduce the number of parameters in the dense layers of neural networks by removing neurons with correlated activation during training. We showed how adding noise outputs to the network can increase the correlation between neurons in the hidden layer and hence result to more effective pruning. The experimental results on different networks and various datasets validate this approach, achieving state-of-the-art compression rates without loss of accuracy.”)

Regarding claim 6
The combination of POLYAK, Li, Babaeizadeh teaches claim 1.

POLYAK further teaches 
the neural network comprises additional layers outside the plurality of layers and the additional layers are unpruned in the pruned version of the neural network.
(POLYAK [table(s) 1] “The scratch model by the authors of [12], which is the baseline model in our experiments.” [table(s) 2] “Incoming prune architectures produced by the pruning scheme” [sec(s) VI] “During our experiments, the threshold τ was chosen empirically based on the accuracy achieved after pruning the model, prior to fine-tuning. We do not let the validation accuracy on the CASIA dataset drop below 84%. The model was finetuned using SGD with a learning rate of 0.01, momentum of 0.9 and batch size of 128. Each model was fine-tuned for a maximum of 30 epochs or until there were 5 successive epochs of no error improvement. Table 2 shows the architecture of the pruned models, each column reports the number of inbound channels per layer.”; e.g., tables 1-2 read(s) on “additional layers are unpruned” since there are unpruned layers.)

Regarding claim 7
The combination of POLYAK, Li, Babaeizadeh teaches claim 1.

POLYAK further teaches 
the neural network is pre-trained neural network using a particular data set, and the input data corresponds to the particular data set.
(POLYAK [sec(s) VI] “Once we train a model with satisfying accuracy, we opt to speed the model without loss of accuracy while avoiding the need to train it from scratch. For the task at hand, pruning is an appealing method due to the heavy reliance on training that was already performed, as long as the pruning is done without reducing the model accuracy.”;)

Regarding claim 9
The combination of POLYAK, Li, Babaeizadeh teaches claim 1.

POLYAK further teaches 
the channels are to be sorted based on the sum of [absolute] values of weights of the channel.
(POLYAK [fig(s) 1] [algorithm(s) 1] “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” [table(s) 1] “scratch model” [sec(s) III-IV] “The inbound prune method targets the number of channels that each filter operates on. Our hypothesis is that each input channel has a different level of contribution to the feature map outputted by each filter. As a result, we may omit the computation of the filter on channels with low contribution, and suffer only a minor decrease in accuracy. In order to detect such low importance channels, we leverage the pruning scheme used previously on single connections [22]. Specifically, the notion of smallest contribution variance – min(σ) is used. Originally, [22] define scores for the contribution of a single weight to the activation of a single neuron. Below, we generalize this measure to the contribution of each channel to the filter activation by using the notion of channel activation. … The contribution variance of channel s in filter t is defined to be σts = var(||Wts ∗ Xs||F) where Xs is the s channel of the input, sampled from the training dataset. … Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold τ (see below). As a result, the pruned layer operation is defined by 
    PNG
    media_image1.png
    163
    487
    media_image1.png
    Greyscale
 where 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 is the set of channels filter t operates on after the prune operation.”; e.g., “the contribution variance is below a threshold” read(s) on “sorted”. Examiner notes that par(s) 45 of the Instant Specification describe(s) “the channels in a layer may be sorted based on the respective sum of the absolute values of the weights in the channel.”.)

Li further teaches
the channels are to be sorted based on the sum of absolute values of weights of the channel.
(Li [fig(s) 1-2] [sec(s) 3] “Our method prunes the less useful filters from a well-trained model for computational efficiency while minimizing the accuracy drop. We measure the relative importance of a filter in each layer by calculating the sum of its absolute weights 
    PNG
    media_image3.png
    78
    67
    media_image3.png
    Greyscale
|Fi,j|, i.e., its l1-norm ||Fi,j||1. Since the number of input channels, ni, is the same across filters, 
    PNG
    media_image3.png
    78
    67
    media_image3.png
    Greyscale
|Fi,j| also represents the average magnitude of its kernel weights. This value gives an expectation of the magnitude of the output feature map. Filters with smaller kernel weights tend to produce feature maps with weak activations as compared to the other filters in that layer. Figure 2(a) illustrates the distribution of filters’ absolute weights sum for each convolutional layer in a VGG-16 network trained on the CIFAR-10 dataset, where the distribution varies significantly across layers. We find that pruning the smallest filters works better in comparison with pruning the same number of random or largest filters (Section 4.4). Compared to other criteria for activation-based feature map pruning (Section 4.5), we find l1-norm is a good criterion for data-free filter selection.

    PNG
    media_image4.png
    283
    1165
    media_image4.png
    Greyscale
”;)

The combination of POLYAK, Li, Babaeizadeh is combinable with Li for the same rationale as set forth above with respect to claim 1.

Regarding claim 10
The combination of POLYAK, Li, Babaeizadeh teaches claim 1.

POLYAK further teaches 
the thinned version of the layer generated through pruning the first percentage of the set of channels comprises (see the rejections of claim 1) a first iteration of the thinned version of the layer, 
the thinned version of the neural network with the first iteration of the thinned version of the layer comprises a first iteration of the thinned version of the neural network, and 
determining the pruned version of a respective layer further comprises: 
(POLYAK [fig(s) 1] [algorithm(s) 1] “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” [table(s) 1] “scratch model” [sec(s) III-IV] “The inbound prune method targets the number of channels that each filter operates on. Our hypothesis is that each input channel has a different level of contribution to the feature map outputted by each filter. As a result, we may omit the computation of the filter on channels with low contribution, and suffer only a minor decrease in accuracy. … The contribution variance of channel s in filter t is defined to be σts = var(||Wts ∗ Xs||F) where Xs is the s channel of the input, sampled from the training dataset. … Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold τ (see below). As a result, the pruned layer operation is defined by 
    PNG
    media_image1.png
    163
    487
    media_image1.png
    Greyscale
 where 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 is the set of channels filter t operates on after the prune operation.”; e.g., “Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold” and “contribution variance of channel s in filter t” read(s) on “first iteration”.)

determining that [accuracy] of the first iteration of the thinned version of the neural network [exceeds the threshold accuracy] value; 
(POLYAK [fig(s) 1] [table(s) 1, 5] “model classification accuracy before and after fine-tuning … Each row indicates pruning done up to the indicated layer.” [sec(s) VI] “During our experiments, the threshold τ was chosen empirically based on the accuracy achieved after pruning the model, prior to fine-tuning. We do not let the validation accuracy on the CASIA dataset drop below 84%. The model was fine-tuned using SGD with a learning rate of 0.01, momentum of 0.9 and batch size of 128. Each model was fine-tuned for a maximum of 30 epochs or until there were 5 successive epochs of no error improvement.” [sec(s) VII] “The Effect of Fine-Tuning: Next, we verify that fine-tuning reduces the accumulation of error rate between layers by comparing the accuracy of the models with and without fine-tuning after prune. Indeed, Table 5 shows that the detrimental effect of pruning is mitigated by performing fine-tuning. However, this effect is visible only in higher layers. At layer Conv21, post pruning results are 94% on the LFW benchmark for both methods. At Conv41, fine-tuning results are 93% compared to 88.43% – a difference of 4.5% in error rate. The increasing gap can be explained by the accumulated error caused by pruning all previous layers.”;)

pruning additional channels from the first iteration of the thinned version of the layer based on the sorting to form a second iteration of the thinned version of the layer; and 
(POLYAK [fig(s) 1] [algorithm(s) 1] “Output: 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
” and “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” [table(s) 1, 5] “model classification accuracy before and after fine-tuning … Each row indicates pruning done up to the indicated layer.” [sec(s) III-IV] “The inbound prune method targets the number of channels that each filter operates on. Our hypothesis is that each input channel has a different level of contribution to the feature map outputted by each filter. As a result, we may omit the computation of the filter on channels with low contribution, and suffer only a minor decrease in accuracy. In order to detect such low importance channels, we leverage the pruning scheme used previously on single connections [22]. Specifically, the notion of smallest contribution variance – min(σ) is used. Originally, [22] define scores for the contribution of a single weight to the activation of a single neuron. Below, we generalize this measure to the contribution of each channel to the filter activation by using the notion of channel activation. … The contribution variance of channel s in filter t is defined to be σts = var(||Wts ∗ Xs||F) where Xs is the s channel of the input, sampled from the training dataset. … Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold τ (see below). As a result, the pruned layer operation is defined by 
    PNG
    media_image1.png
    163
    487
    media_image1.png
    Greyscale
 where 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 is the set of channels filter t operates on after the prune operation.”; e.g., “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” read(s) on “pruning additional channels from the first iteration of the thinned version of the layer based on the sorting to form a second iteration of the thinned version of the layer”. In addition, e.g., “Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold” and “contribution variance of channel s in filter t” read(s) on “second iteration”.)

testing [accuracy] of a second iteration of the thinned neural network, wherein the second iteration of the thinned neural network comprises the second iteration of the thinned layer.
(POLYAK [fig(s) 1] [algorithm(s) 1] “Output: 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
” and “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” [table(s) 1, 5] “model classification accuracy before and after fine-tuning … Each row indicates pruning done up to the indicated layer.” [sec(s) III-IV] as cited above;)

Li further teaches
pruning additional channels from the first iteration of the thinned version of the layer based on the sorting to form a second iteration of the thinned version of the layer; and
(Li [fig(s) 1-2] [sec(s) 3] “Our method prunes the less useful filters from a well-trained model for computational efficiency while minimizing the accuracy drop. We measure the relative importance of a filter in each layer by calculating the sum of its absolute weights 
    PNG
    media_image3.png
    78
    67
    media_image3.png
    Greyscale
|Fi,j|, i.e., its l1-norm ||Fi,j||1. Since the number of input channels, ni, is the same across filters, 
    PNG
    media_image3.png
    78
    67
    media_image3.png
    Greyscale
|Fi,j| also represents the average magnitude of its kernel weights. This value gives an expectation of the magnitude of the output feature map. Filters with smaller kernel weights tend to produce feature maps with weak activations as compared to the other filters in that layer. Figure 2(a) illustrates the distribution of filters’ absolute weights sum for each convolutional layer in a VGG-16 network trained on the CIFAR-10 dataset, where the distribution varies significantly across layers. We find that pruning the smallest filters works better in comparison with pruning the same number of random or largest filters (Section 4.4). Compared to other criteria for activation-based feature map pruning (Section 4.5), we find l1-norm is a good criterion for data-free filter selection.

    PNG
    media_image4.png
    283
    1165
    media_image4.png
    Greyscale
”; Note that POLYAK teaches “pruning additional channels from the first iteration of the thinned version of the layer based on the … to form a second iteration of the thinned version of the layer”.)

The combination of POLYAK, Li, Babaeizadeh is combinable with Li for the same rationale as set forth above with respect to claim 1.

Babaeizadeh further teaches
determining that accuracy of the first iteration of the thinned version of the neural network exceeds the threshold accuracy value; 
testing accuracy of a second iteration of the thinned neural network, wherein the second iteration of the thinned neural network comprises the second iteration of the thinned layer.
(Babaeizadeh [fig(s) 1-2] [sec(s) 2] “NoiseOut simply repeats this process to compress the network. The pruning ends when the accuracy of the network drops below some given threshold. Note that the pruning process is happening while training. Algorithm 1 shows the final NoiseOut algorithm. For the sake of readability, this algorithm has been shown for networks with only one hidden layer. But the same algorithm can be applied to networks with more that one hidden layer by performing the same pruning on all the hidden layers independently. It can also be applied to convolutional neural networks that use dense layers, in which we often see over 90% of the network parameters [13].”; e.g., “accuracy of the network” read(s) on “accuracy”. Note that POLYAK teaches “determining that [accuracy] of the first iteration of the thinned version of the neural network [exceeds the threshold accuracy] value; testing [accuracy] of a second iteration of the thinned neural network, wherein the second iteration of the thinned neural network comprises the second iteration of the thinned layer”.)

The combination of POLYAK, Li, Babaeizadeh is combinable with Babaeizadeh for the same rationale as set forth above with respect to claim 1.

Regarding claim 11
The combination of POLYAK, Li, Babaeizadeh teaches claim 10.

POLYAK further teaches 
determining the pruned version of a respective layer further comprises: (see the rejections of claim 10)
	
determining that [accuracy] of the second iteration of the thinned version of the neural network [falls below the threshold accuracy] value; 
(POLYAK [fig(s) 1] [table(s) 1, 5] “model classification accuracy before and after fine-tuning … Each row indicates pruning done up to the indicated layer.” [sec(s) VI] “During our experiments, the threshold τ was chosen empirically based on the accuracy achieved after pruning the model, prior to fine-tuning. We do not let the validation accuracy on the CASIA dataset drop below 84%. The model was fine-tuned using SGD with a learning rate of 0.01, momentum of 0.9 and batch size of 128. Each model was fine-tuned for a maximum of 30 epochs or until there were 5 successive epochs of no error improvement.” [sec(s) VII] “The Effect of Fine-Tuning: Next, we verify that fine-tuning reduces the accumulation of error rate between layers by comparing the accuracy of the models with and without fine-tuning after prune. Indeed, Table 5 shows that the detrimental effect of pruning is mitigated by performing fine-tuning. However, this effect is visible only in higher layers. At layer Conv21, post pruning results are 94% on the LFW benchmark for both methods. At Conv41, fine-tuning results are 93% compared to 88.43% – a difference of 4.5% in error rate. The increasing gap can be explained by the accumulated error caused by pruning all previous layers.”;)

adopting the first iteration of the thinned version of the layer as the thinned version of the layer to be used to generate the pruned version of the neural network based on the [accuracy] of the second iteration of the thinned version of the neural network [falling below the threshold accuracy] value.
(POLYAK [fig(s) 1] [algorithm(s) 1] “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” [table(s) 1] “scratch model” [sec(s) III-IV] “The inbound prune method targets the number of channels that each filter operates on. Our hypothesis is that each input channel has a different level of contribution to the feature map outputted by each filter. As a result, we may omit the computation of the filter on channels with low contribution, and suffer only a minor decrease in accuracy. … The contribution variance of channel s in filter t is defined to be σts = var(||Wts ∗ Xs||F) where Xs is the s channel of the input, sampled from the training dataset. … Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold τ (see below). As a result, the pruned layer operation is defined by 
    PNG
    media_image1.png
    163
    487
    media_image1.png
    Greyscale
 where 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 is the set of channels filter t operates on after the prune operation.”; e.g., “Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold” and “contribution variance of channel s in filter t” and “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” read(s) on “iteration of the thinned version of the neural network”.)

Babaeizadeh further teaches 
determining that accuracy of the second iteration of the thinned version of the neural network falls below the threshold accuracy value; 
adopting the first iteration of the thinned version of the layer as the thinned version of the layer to be used to generate the pruned version of the neural network based on the accuracy of the second iteration of the thinned version of the neural network falling below the threshold accuracy value.
(Babaeizadeh [fig(s) 1-2] [sec(s) 2] “NoiseOut simply repeats this process to compress the network. The pruning ends when the accuracy of the network drops below some given threshold. Note that the pruning process is happening while training. Algorithm 1 shows the final NoiseOut algorithm. For the sake of readability, this algorithm has been shown for networks with only one hidden layer. But the same algorithm can be applied to networks with more that one hidden layer by performing the same pruning on all the hidden layers independently. It can also be applied to convolutional neural networks that use dense layers, in which we often see over 90% of the network parameters [13].”; e.g., “accuracy of the network” read(s) on “accuracy”. Note that POLYAK teaches “determining that [accuracy] of the second iteration of the thinned version of the neural network [falls below the threshold accuracy] value; adopting the first iteration of the thinned version of the layer as the thinned version of the layer to be used to generate the pruned version of the neural network based on the [accuracy] of the second iteration of the thinned version of the neural network [falling below the threshold accuracy] value”.)

The combination of POLYAK, Li, Babaeizadeh is combinable with Babaeizadeh for the same rationale as set forth above with respect to claim 1.

Regarding claim 12
The combination of POLYAK, Li, Babaeizadeh teaches claim 1.

POLYAK further teaches 
a respective iteration of the thinned neural network is generated to test thinned versions of each of the plurality of layers, and each iteration of the thinned neural network comprises only one thinned version of the plurality of layers.
(POLYAK [fig(s) 1] [algorithm(s) 1] “Output: 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
”, “The algorithm outputs 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
 - a set of channels left after pruning for each filter.” “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” [table(s) 1] “scratch model” [sec(s) III-IV] “The Inbound Prune approach that we suggest focuses on reducing the number of channels each filter uses by eliminating channels that do not contribute significantly to the information the filter extracts. The amount of information each channel contributes is measured by the variance of the specific channel activation output. We do not consider directly the model accuracy during the pruning process. Instead, we fine-tune the model obtained after each prune in order to allow it to adapt. The pruning of the entire network is done sequentially on the network layers: from lower layers to the top ones, pruning is followed by fine-tuning, which is followed by pruning of the next layer. … The inbound prune method targets the number of channels that each filter operates on. Our hypothesis is that each input channel has a different level of contribution to the feature map outputted by each filter. As a result, we may omit the computation of the filter on channels with low contribution, and suffer only a minor decrease in accuracy. In order to detect such low importance channels, we leverage the pruning scheme used previously on single connections [22]. … The contribution variance of channel s in filter t is defined to be σts = var(||Wts ∗ Xs||F) where Xs is the s channel of the input, sampled from the training dataset. … Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold τ (see below). As a result, the pruned layer operation is defined by 
    PNG
    media_image1.png
    163
    487
    media_image1.png
    Greyscale
 where 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 is the set of channels filter t operates on after the prune operation.”; e.g., “contribution variance of channel s in filter t” and “For each filter t and each channel s” for each layer read(s) on “thinned versions of each of the plurality of layers”. In addition, e.g., “The algorithm outputs 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
 - a set of channels left after pruning for each filter” and “Output: 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
” read(s) on “each iteration of the thinned neural network comprises only one thinned version of the plurality of layers”.)

Regarding claim 13
The combination of POLYAK, Li, Babaeizadeh teaches claim 1.

POLYAK further teaches 
the neural network comprises a convolutional neural network, and the plurality of layers comprise hidden layers of the convolutional neural network.
(POLYAK [fig(s) ] [table(s) 1] “The scratch model by the authors of [12], which is the baseline model in our experiments. The network starts with a gray scale input image of size 1 × 100 × 100 pixels, and runs through 10 convolutional layers interleaved with max pooling layers. Following a spatial average pooling at the end of the process, a representation of size 320 is obtained.” [sec(s) III] “In this section, we provide an overview of our methods for accelerating deep convolutional neural networks; the following sections would provide the necessary details.”;)

Regarding claim 17
The claim is a system claim corresponding to the machine accessible storage medium claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of the machine accessible storage medium claim. 
Note that POLYAK teaches a data processing apparatus and a memory (POLYAK [sec(s) I] “In our implementation, we use the android port of a deep learning framework called Torch7 [2], in which computation is optimized using vectorized convolution code.”; e.g., “Torch7” read(s) on “data processing apparatus” and “memory” since code is run on a computer.)

Claim(s) 2-4, 8, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over POLYAK et al. (Channel-Level Acceleration of Deep Face Representations) in view of Li et al. (PRUNING FILTERS FOR EFFICIENT CONVNETS), further in view of Babaeizadeh et al. (NoiseOut: A Simple Way to Prune Neural Networks) further in view of Guo et al. (Dynamic Network Surgery for Efficient DNNs, 2016)

Regarding claim 2
The combination of POLYAK, Li, Babaeizadeh teaches claim 1.

POLYAK further teaches 
generating the pruned version of at least a subset of the plurality of layers comprises performing weight [pruning] on a corresponding thinned version of the layer in the subset.
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “The Inbound Prune approach that we suggest focuses on reducing the number of channels each filter uses by eliminating channels that do not contribute significantly to the information the filter extracts. The amount of information each channel contributes is measured by the variance of the specific channel activation output. We do not consider directly the model accuracy during the pruning process. Instead, we fine-tune the model obtained after each prune in order to allow it to adapt. The pruning of the entire network is done sequentially on the network layers: from lower layers to the top ones, pruning is followed by fine-tuning, which is followed by pruning of the next layer.”; POLYAK teaches fine-tuning weights on a thinned version of a layer.)

However, the combination of POLYAK, Li, Babaeizadeh does not appear to explicitly teach:
generating the pruned version of at least a subset of the plurality of layers comprises performing weight [pruning] on a corresponding thinned version of the layer in the subset.

Guo teaches 
generating the pruned version of at least a subset of the plurality of layers comprises performing weight pruning on a corresponding thinned version of the layer in the subset.
(Guo [fig(s) 1-2] [algorithm(s) 1] “Dynamic network surgery: the SGD method for solving optimization problem (1):” and “Forward propagation” [sec(s) 3] “Once matrix Wk and Tk are updated, they shall be applied to re-calculate the whole network activations and loss function gradient. Repeat these steps iteratively, the sparse model will be able to produce excellent accuracy. The above procedure is summarized in Algorithm 1. Note that, the dynamic property of our method is shown in two aspects. On one hand, pruning operations can be performed whenever the existing connections seem to become unimportant. Yet, on the other hand, the mistakenly pruned connections shall be re-established if they once appear to be important. The latter operation plays a dual role of network pruning, and thus it is called "network splicing" in this paper. Pruning and splicing constitute a circular procedure by constantly updating the connection weights and setting different entries in Tk, which is analogical to the synthesis of excitatory and inhibitory neurotransmitter in human nervous system [17]. See Figure 2 for the overview of our method and the method pipeline can be found in Figure 1(a).”; e.g., “Pruning and splicing constitute a circular procedure by constantly updating the connection weights” along with “Repeat these steps iteratively” read(s) on “weight pruning on a corresponding thinned version of the layer in the subset”.)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network pruning of POLYAK, Li, Babaeizadeh with the weight pruning of Guo. 
Doing so would lead to making the whole process flexible not only by better approaching the compression limit, but also by improving the learning efficiency.
(Guo [sec(s) 1] “In fact, the above strategies help to make the whole process flexible. They are beneficial not only to better approach the compression limit, but also to improve the learning efficiency, which will be validated in Section 4. In our method, pruning and splicing naturally constitute a circular procedure and dynamically divide the network connections into two categories, akin to the synthesis of excitatory and inhibitory neurotransmitter in human nervous systems [17].”)

Regarding claim 3
The combination of POLYAK, Li, Babaeizadeh, Guo teaches claim 2.

performing the weight pruning on a respective thinned version of one of the subset of layers comprises: (see the rejections of claim 2)

POLYAK further teaches 
determining a respective descriptive statistic value from weights of the thinned version of the layer;
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “The Inbound Prune approach that we suggest focuses on reducing the number of channels each filter uses by eliminating channels that do not contribute significantly to the information the filter extracts. The amount of information each channel contributes is measured by the variance of the specific channel activation output. We do not consider directly the model accuracy during the pruning process. Instead, we fine-tune the model obtained after each prune in order to allow it to adapt. The pruning of the entire network is done sequentially on the network layers: from lower layers to the top ones, pruning is followed by fine-tuning, which is followed by pruning of the next layer.” [sec(s) VI] “The model was fine-tuned using SGD with a learning rate of 0.01, momentum of 0.9 and batch size of 128. Each model was fine-tuned for a maximum of 30 epochs or until there were 5 successive epochs of no error improvement.”; e.g., “fine-tuned using SGD” read(s) on “statistic value”.)

Guo further teaches 
determining a respective descriptive statistic value from weights of the thinned version of the layer; 
determining a respective weight threshold for the thinned version of the layer based on the respective descriptive statistic value; and 
(Guo [fig(s) 1-2] [algorithm(s) 1] “Dynamic network surgery: the SGD method for solving optimization problem (1):” and “Forward propagation” [sec(s) 3] “Once matrix Wk and Tk are updated, they shall be applied to re-calculate the whole network activations and loss function gradient. Repeat these steps iteratively, the sparse model will be able to produce excellent accuracy. The above procedure is summarized in Algorithm 1. Note that, the dynamic property of our method is shown in two aspects. On one hand, pruning operations can be performed whenever the existing connections seem to become unimportant. Yet, on the other hand, the mistakenly pruned connections shall be re-established if they once appear to be important. The latter operation plays a dual role of network pruning, and thus it is called "network splicing" in this paper. Pruning and splicing constitute a circular procedure by constantly updating the connection weights and setting different entries in Tk, which is analogical to the synthesis of excitatory and inhibitory neurotransmitter in human nervous system [17]. See Figure 2 for the overview of our method and the method pipeline can be found in Figure 1(a). … That is, the parameters with relatively small magnitude are temporarily pruned, while the others with large magnitude are kept or spliced in each iteration of Algorithm 1. Obviously, the threshold values have a significant impact on the final compression rate. For a certain layer, a single threshold can be set based on the average absolute value and variance of its connection weights.”; e.g., “average absolute value and variance of its connection weights” read(s) on “statistic value”. In addition, e.g., “single threshold can be set based on the average absolute value and variance of its connection weights” read(s) on “determining a respective weight threshold for the thinned version of the layer based on the respective descriptive statistic value”.)

pruning weights in the thinned version of the layer with values below the respective weight threshold for the layer to generate the pruned version of the layer.
(Guo [fig(s) 1-2] [algorithm(s) 1] “Dynamic network surgery: the SGD method for solving optimization problem (1):” and “Forward propagation” [sec(s) 3] “That is, the parameters with relatively small magnitude are temporarily pruned, while the others with large magnitude are kept or spliced in each iteration of Algorithm 1. Obviously, the threshold values have a significant impact on the final compression rate. For a certain layer, a single threshold can be set based on the average absolute value and variance of its connection weights. However, to improve the robustness of our method, we use two thresholds ak and bk by importing a small margin t and set bk as ak + t in Equation (3). For the parameters out of this range, we set their function outputs as the corresponding entries in Tk, which means these parameters will neither be pruned nor spliced in the current iteration. 
    PNG
    media_image6.png
    144
    652
    media_image6.png
    Greyscale
 (3)”;)

The combination of POLYAK, Li, Babaeizadeh, Guo is combinable with Guo for the same rationale as set forth above with respect to claim 2.

Regarding claim 4
The combination of POLYAK, Li, Babaeizadeh, Guo teaches claim 3.

Guo further teaches 
the at least one of the thinned versions of the plurality of layers comprises a plurality of the thinned versions of the plurality of layers, the weight threshold determined for a first one of the plurality of layers is different from the weight threshold determined for a second one of the plurality of layers.
(Guo [fig(s) 1-2] [algorithm(s) 1] “Dynamic network surgery: the SGD method for solving optimization problem (1):” and “Forward propagation” [sec(s) 3] “Once matrix Wk and Tk are updated, they shall be applied to re-calculate the whole network activations and loss function gradient. Repeat these steps iteratively, the sparse model will be able to produce excellent accuracy. The above procedure is summarized in Algorithm 1. Note that, the dynamic property of our method is shown in two aspects. On one hand, pruning operations can be performed whenever the existing connections seem to become unimportant. Yet, on the other hand, the mistakenly pruned connections shall be re-established if they once appear to be important. The latter operation plays a dual role of network pruning, and thus it is called "network splicing" in this paper. Pruning and splicing constitute a circular procedure by constantly updating the connection weights and setting different entries in Tk, which is analogical to the synthesis of excitatory and inhibitory neurotransmitter in human nervous system [17]. See Figure 2 for the overview of our method and the method pipeline can be found in Figure 1(a). … That is, the parameters with relatively small magnitude are temporarily pruned, while the others with large magnitude are kept or spliced in each iteration of Algorithm 1. Obviously, the threshold values have a significant impact on the final compression rate. For a certain layer, a single threshold can be set based on the average absolute value and variance of its connection weights. However, to improve the robustness of our method, we use two thresholds ak and bk by importing a small margin t and set bk as ak + t in Equation (3). For the parameters out of this range, we set their function outputs as the corresponding entries in Tk, which means these parameters will neither be pruned nor spliced in the current iteration. 
    PNG
    media_image6.png
    144
    652
    media_image6.png
    Greyscale
 (3)”; e.g., “single threshold can be set based on the average absolute value and variance of its connection weights” and ak and bk read(s) on “different”.)

The combination of POLYAK, Li, Babaeizadeh, Guo is combinable with Guo for the same rationale as set forth above with respect to claim 2.

Regarding claim 8
The combination of POLYAK, Li, Babaeizadeh teaches claim 1.

Guo teaches 
the test comprises performing forward propagation on the thinned version of the neural network.
(Guo [fig(s) 1-2] [algorithm(s) 1] “Dynamic network surgery: the SGD method for solving optimization problem (1):” and “Forward propagation” [sec(s) 3] “Once matrix Wk and Tk are updated, they shall be applied to re-calculate the whole network activations and loss function gradient. Repeat these steps iteratively, the sparse model will be able to produce excellent accuracy. The above procedure is summarized in Algorithm 1. Note that, the dynamic property of our method is shown in two aspects. On one hand, pruning operations can be performed whenever the existing connections seem to become unimportant. Yet, on the other hand, the mistakenly pruned connections shall be re-established if they once appear to be important. The latter operation plays a dual role of network pruning, and thus it is called "network splicing" in this paper. Pruning and splicing constitute a circular procedure by constantly updating the connection weights and setting different entries in Tk, which is analogical to the synthesis of excitatory and inhibitory neurotransmitter in human nervous system [17]. See Figure 2 for the overview of our method and the method pipeline can be found in Figure 1(a).”;)

	The combination of POLYAK, Li, Babaeizadeh, Guo is combinable with Guo for the same rationale as set forth above with respect to claim 2.

Regarding claim 20
The combination of POLYAK, Li, Babaeizadeh teaches claim 17.

POLYAK further teaches 
determining the pruned version of a respective layer further comprises performing weight-[pruning] to the adopted thinned version of the layer.
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “The Inbound Prune approach that we suggest focuses on reducing the number of channels each filter uses by eliminating channels that do not contribute significantly to the information the filter extracts. The amount of information each channel contributes is measured by the variance of the specific channel activation output. We do not consider directly the model accuracy during the pruning process. Instead, we fine-tune the model obtained after each prune in order to allow it to adapt. The pruning of the entire network is done sequentially on the network layers: from lower layers to the top ones, pruning is followed by fine-tuning, which is followed by pruning of the next layer.”; POLYAK teaches fine-tuning weights on a thinned version of a layer.)

However, the combination of POLYAK, Li, Babaeizadeh does not appear to explicitly teach:
determining the pruned version of a respective layer further comprises performing weight-[pruning] to the adopted thinned version of the layer.

determining the pruned version of a respective layer further comprises performing weight-pruning to the adopted thinned version of the layer.
(Guo [fig(s) 1-2] [algorithm(s) 1] “Dynamic network surgery: the SGD method for solving optimization problem (1):” and “Forward propagation” [sec(s) 3] “Once matrix Wk and Tk are updated, they shall be applied to re-calculate the whole network activations and loss function gradient. Repeat these steps iteratively, the sparse model will be able to produce excellent accuracy. The above procedure is summarized in Algorithm 1. Note that, the dynamic property of our method is shown in two aspects. On one hand, pruning operations can be performed whenever the existing connections seem to become unimportant. Yet, on the other hand, the mistakenly pruned connections shall be re-established if they once appear to be important. The latter operation plays a dual role of network pruning, and thus it is called "network splicing" in this paper. Pruning and splicing constitute a circular procedure by constantly updating the connection weights and setting different entries in Tk, which is analogical to the synthesis of excitatory and inhibitory neurotransmitter in human nervous system [17]. See Figure 2 for the overview of our method and the method pipeline can be found in Figure 1(a).”; e.g., “Pruning and splicing constitute a circular procedure by constantly updating the connection weights” along with “Repeat these steps iteratively” read(s) on “weight-pruning”.)
	
	The combination of POLYAK, Li, Babaeizadeh is combinable with Guo for the same rationale as set forth above with respect to claim 2.

Claim(s) 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over POLYAK et al. (Channel-Level Acceleration of Deep Face Representations) in view of Li et al. (PRUNING FILTERS FOR EFFICIENT CONVNETS), further in view of Babaeizadeh et al. (NoiseOut: A Simple Way to Prune Neural Networks) further in view of Guo et al. (Dynamic Network Surgery for Efficient DNNs, 2016) further in view of Hervás et al. (Optimization of Computational Neural Network for Its Application in the Prediction of Microbial Growth in Foods)

Regarding claim 5
The combination of POLYAK, Li, Babaeizadeh, Guo teaches claim 3.

Guo further teaches 
the descriptive statistics value comprises a mean of the absolute values of the weights of the thinned version of the layer and a standard deviation of the [absolute] values of the weights of the thinned version of the layer.
(Guo [fig(s) 1-2] [algorithm(s) 1] “Dynamic network surgery: the SGD method for solving optimization problem (1):” and “Forward propagation” [sec(s) 3] “That is, the parameters with relatively small magnitude are temporarily pruned, while the others with large magnitude are kept or spliced in each iteration of Algorithm 1. Obviously, the threshold values have a significant impact on the final compression rate. For a certain layer, a single threshold can be set based on the average absolute value and variance of its connection weights. However, to improve the robustness of our method, we use two thresholds ak and bk by importing a small margin t and set bk as ak + t in Equation (3). For the parameters out of this range, we set their function outputs as the corresponding entries in Tk, which means these parameters will neither be pruned nor spliced in the current iteration. 
    PNG
    media_image6.png
    144
    652
    media_image6.png
    Greyscale
 (3)”; e.g., “average absolute value” read(s) on “mean of the absolute values”.)

	However, the combination of POLYAK, Li, Babaeizadeh, Guo does not appear to explicitly teach:
the descriptive statistics value comprises a mean of the absolute values of the weights of the thinned version of the layer and a standard deviation of the [absolute] values of the weights of the thinned version of the layer.

Hervás teaches
the descriptive statistics value comprises a mean of the absolute values of the weights of the thinned version of the layer and a standard deviation of the absolute values of the weights of the thinned version of the layer.
(Hervás [fig(s) 1-2] [sec(s) MATERIALS AND METHODS] “The connections associated with the “nonfrozen” weights, i.e., those that had not been previously pruned with a pfro probability, were eliminated when reaching a specific number of training epochs, and once the algorithm had converged, the following two conditions were verified: 
    PNG
    media_image7.png
    81
    532
    media_image7.png
    Greyscale
 (2) and 
    PNG
    media_image8.png
    158
    605
    media_image8.png
    Greyscale
 (3) where 
    PNG
    media_image9.png
    72
    241
    media_image9.png
    Greyscale
 are the mean and the variance of the absolute values of the weights in a network, 
    PNG
    media_image10.png
    94
    314
    media_image10.png
    Greyscale
 are the mean and the variance of the absolute values of the derivatives, ghiw and ghider are heuristic parameters to be determined experimentally.”; Note that Guo teaches “the descriptive statistics value comprises a mean of the absolute values of the weights of the thinned version of the layer and a standard deviation of the [absolute] values of the weights of the thinned version of the layer”.)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network pruning of POLYAK, Li, Babaeizadeh, Guo with the absolute values of the weights of Hervás. 
Doing so would lead to pruning the net connections, obtaining an improvement in the generalization and a decrease in the number of necessary patterns for the training.
(Hervás [sec(s) Abs] “The architecture of CNN was designed to contain three input parameters in the input layer and one output parameter in the output layer. For their optimization, algorithms were developed to prune the net connections, obtaining an improvement in the generalization and a decrease in the number of necessary patterns for the training. The standard error of prediction (%SEP) obtained was under 5% using twenty inputs to the net, and the result was significantly smaller than the one obtained using regression equations. Therefore, the usefulness of CNN for modeling microbial growth is appealing, and its improvement promises results that will be better than those obtained by other estimation methods up to now.”)

Claim(s) 14-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over POLYAK et al. (Channel-Level Acceleration of Deep Face Representations) in view of Li et al. (PRUNING FILTERS FOR EFFICIENT CONVNETS), further in view of Babaeizadeh et al. (NoiseOut: A Simple Way to Prune Neural Networks) further in view of Zhao et al. (An FPGA Implementation of a Convolutional Auto-Encoder) 

Regarding claim 14
The combination of POLYAK, Li, Babaeizadeh teaches claim 1.

generating the pruned version of the layer comprises (see the rejections of claim 1) 

POLYAK further teaches 
[rounding] a number of unpruned channels to a multiple corresponding to a hardware architecture.
(POLYAK [fig(s) 1] [algorithm(s) 1] “Output: 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
”, “The algorithm outputs 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
 - a set of channels left after pruning for each filter.” “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” [table(s) 1] “scratch model” [sec(s) III-IV] “The Inbound Prune approach that we suggest focuses on reducing the number of channels each filter uses by eliminating channels that do not contribute significantly to the information the filter extracts. The amount of information each channel contributes is measured by the variance of the specific channel activation output. … The contribution variance of channel s in filter t is defined to be σts = var(||Wts ∗ Xs||F) where Xs is the s channel of the input, sampled from the training dataset. … Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold τ (see below). As a result, the pruned layer operation is defined by 
    PNG
    media_image1.png
    163
    487
    media_image1.png
    Greyscale
 where 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 is the set of channels filter t operates on after the prune operation.” [sec(s) VII] “the model efficiency is captured by measuring the running time on a Samsung Galaxy S6 device which is our target platform.”; e.g., “The algorithm outputs 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
 - a set of channels left after pruning for each filter” and “Output: 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
” read(s) on “unpruned channels”. 
Examiner notes that par(s) 58 of the Instant Specification describe(s) “Further, in this example, channel rounding is applied during coarse-grained pruning such that the surviving channels in each layer are a multiple of 4 for optimal hardware utilization” and par(s) 49 of the Instant Specification describe(s) “the number of preserved (i.e., unpruned) channels may be rounded up or down to a number corresponding to the architecture of the system that is to use and perform calculations based on the neural network model. For instance, the preserved channels may be selected to be a number corresponding to the number of multiply-accumulate (MAC) circuits, the number of cores, or another number corresponding to a hardware architecture (e.g., by rounding the number of unpruned channels in a given layer up or down such that it is a multiple of 4, 8, 16, etc.)”)

However, the combination of POLYAK, Li, Babaeizadeh does not appear to explicitly teach:
[rounding] a number of unpruned channels to a multiple corresponding to a hardware architecture.

Zhao teaches
rounding a number of unpruned channels to a multiple corresponding to a hardware architecture.
(Zhao [fig(s) 1-3] [sec(s) 2] “Convolutional encoding is the most important module in the framework; therefore, its structure is further discussed below. To effectively enhance the compatibility of our framework and control the utilization of hardware resources reasonably, the number of input channels is fixed to eight rather than the maximum number of neurons in a certain layer. … The operational process of the method is shown in Figure 3. Four identical processing elements (PEs) are used to process the input data, and the ones with identical sequence number are the same. These PEs are independent of each other so they can operate simultaneously. Therefore, all the convolutional results of the eight channels can be calculated simultaneously by inputting the data in sequence. The RAM1 and RAM2 are used to cache the last two-channel data of every eight inputs, which can avoid loading repetition of the input and improve the efficiency of data utilization. In Figure 3, 8 × i + j (i = 1, . . . , 
    PNG
    media_image11.png
    58
    147
    media_image11.png
    Greyscale
 − 1 and j = 1, 2, ..., 8) represents the index of channel. Meanwhile, M represents the total number of rows of zero-padding matrix, and the symbol 
    PNG
    media_image12.png
    58
    38
    media_image12.png
    Greyscale
 represents a rounding-up operation.”; Note that POLYAK teaches “[rounding] a number of unpruned channels to a multiple corresponding to a hardware architecture”.)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network pruning of POLYAK, Li, Babaeizadeh with the tensor sorting of Zhao. 
Doing so would lead to avoiding loading repetition of the input and improving the efficiency of data utilization.
(Zhao [sec(s) 2] “Therefore, all the convolutional results of the eight channels can be calculated simultaneously by inputting the data in sequence. The RAM1 and RAM2 are used to cache the last two-channel data of every eight inputs, which can avoid loading repetition of the input and improve the efficiency of data utilization. In Figure 3, 8 × i + j (i = 1, . . . , 
    PNG
    media_image11.png
    58
    147
    media_image11.png
    Greyscale
 − 1 and j = 1, 2, ..., 8) represents the index of channel. Meanwhile, M represents the total number of rows of zero-padding matrix, and the symbol 
    PNG
    media_image12.png
    58
    38
    media_image12.png
    Greyscale
 represents a rounding-up operation.”)

Regarding claim 15
The combination of POLYAK, Li, Babaeizadeh teaches claim 14.

POLYAK further teaches 
the hardware architecture comprises hardware architecture of a resource constrained computing device.
(POLYAK [sec(s) I] “In our implementation, we use the android port of a deep learning framework called Torch7 [2], in which computation is optimized using vectorized convolution code.” [sec(s) VII] “We use the scratch model [12], as depicted in Table 1, as our baseline for the evaluation of our methods. Models are evaluated in two different ways. First, we measure the model accuracy by the classification accuracy on the CASIA dataset which we split to 90% training and 10% test. Second, we measure the score on the LFW benchmark [34] in the unrestricted mode. LFW results are mean and Standard Error estimated over fixed ten cross-validation splits. In addition, the model efficiency is captured by measuring the running time on a Samsung Galaxy S6 device which is our target platform.”)

Claim(s) 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over POLYAK et al. (Channel-Level Acceleration of Deep Face Representations) in view of Li et al. (PRUNING FILTERS FOR EFFICIENT CONVNETS), further in view of Liu et al. (Learning Efficient Convolutional Networks through Network Slimming) further in view of Babaeizadeh et al. (NoiseOut: A Simple Way to Prune Neural Networks) 

Regarding claim 16
POLYAK teaches 

A method comprising: 
accessing a neural network, wherein the neural network comprises a plurality of layers, and each of the layers comprises a respective set of channels; 
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III] “For the purpose of benchmarking on the LFW benchmark [34], we use the scratch network in order to extract face feature representation. This representation is the collection of the 320 activations of the Avg Pool layer. Afterwards, we train a Joint Bayesian model [35] based on these extracted features for the face verification task.”; e.g., table 1 read(s) on “neural network”.)

determining pruned versions of each of the plurality of layers, wherein determining the pruned version of a respective layer comprises: 
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “The Inbound Prune approach that we suggest focuses on reducing the number of channels each filter uses by eliminating channels that do not contribute significantly to the information the filter extracts. The amount of information each channel contributes is measured by the variance of the specific channel activation output. We do not consider directly the model accuracy during the pruning process. Instead, we fine-tune the model obtained after each prune in order to allow it to adapt. The pruning of the entire network is done sequentially on the network layers: from lower layers to the top ones, pruning is followed by fine-tuning, which is followed by pruning of the next layer.”;)

sorting the set of channels of the layer based on respective weight values of each channel in the set of channels; 
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “The inbound prune method targets the number of channels that each filter operates on. Our hypothesis is that each input channel has a different level of contribution to the feature map outputted by each filter. As a result, we may omit the computation of the filter on channels with low contribution, and suffer only a minor decrease in accuracy. In order to detect such low importance channels, we leverage the pruning scheme used previously on single connections [22]. Specifically, the notion of smallest contribution variance – min(σ) is used. Originally, [22] define scores for the contribution of a single weight to the activation of a single neuron. Below, we generalize this measure to the contribution of each channel to the filter activation by using the notion of channel activation. … The contribution variance of channel s in filter t is defined to be σts = var(||Wts ∗ Xs||F) where Xs is the s channel of the input, sampled from the training dataset. … Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold τ (see below). As a result, the pruned layer operation is defined by 
    PNG
    media_image1.png
    163
    487
    media_image1.png
    Greyscale
 where 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 is the set of channels filter t operates on after the prune operation.”; e.g., “the contribution variance is below a threshold” read(s) on “sorting”.)

iteratively pruning [different-sized] portions of the set of channels based on the sorting to form iterations of a thinned version of the layer; 
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “The inbound prune method targets the number of channels that each filter operates on. Our hypothesis is that each input channel has a different level of contribution to the feature map outputted by each filter. As a result, we may omit the computation of the filter on channels with low contribution, and suffer only a minor decrease in accuracy. In order to detect such low importance channels, we leverage the pruning scheme used previously on single connections [22]. Specifically, the notion of smallest contribution variance – min(σ) is used. Originally, [22] define scores for the contribution of a single weight to the activation of a single neuron. Below, we generalize this measure to the contribution of each channel to the filter activation by using the notion of channel activation. … The contribution variance of channel s in filter t is defined to be σts = var(||Wts ∗ Xs||F) where Xs is the s channel of the input, sampled from the training dataset. … Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold τ (see below). As a result, the pruned layer operation is defined by 
    PNG
    media_image1.png
    163
    487
    media_image1.png
    Greyscale
 where 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 is the set of channels filter t operates on after the prune operation.”; e.g., “Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold” and “contribution variance of channel s in filter t” read(s) on “iterations of a thinned version of the layer”.)

generating iterations of a thinned version of the neural network, wherein each iteration substitutes an original version of the layer with one of the iterations of the thinned version of the layer; 
(POLYAK [fig(s) 1] [algorithm(s) 1] “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” [table(s) 1] “scratch model” [sec(s) III-IV] “The inbound prune method targets the number of channels that each filter operates on. Our hypothesis is that each input channel has a different level of contribution to the feature map outputted by each filter. As a result, we may omit the computation of the filter on channels with low contribution, and suffer only a minor decrease in accuracy. In order to detect such low importance channels, we leverage the pruning scheme used previously on single connections [22]. Specifically, the notion of smallest contribution variance – min(σ) is used. Originally, [22] define scores for the contribution of a single weight to the activation of a single neuron. Below, we generalize this measure to the contribution of each channel to the filter activation by using the notion of channel activation. … The contribution variance of channel s in filter t is defined to be σts = var(||Wts ∗ Xs||F) where Xs is the s channel of the input, sampled from the training dataset. … Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold τ (see below). As a result, the pruned layer operation is defined by 
    PNG
    media_image1.png
    163
    487
    media_image1.png
    Greyscale
 where 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 is the set of channels filter t operates on after the prune operation.”; e.g., “Once we compute the contribution variance for all channels in all filters, we prune all filter connections to a given channel where the contribution variance is below a threshold” and “contribution variance of channel s in filter t” and “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” read(s) on “iterations of a thinned version of the neural network”.)

testing [each of] the iterations of the thinned version of the neural network to determine whether [accuracy] of the iteration of the thinned version of the neural network [exceeds a threshold accuracy] value; 
(POLYAK [fig(s) 1] [table(s) 1, 5] “model classification accuracy before and after fine-tuning … Each row indicates pruning done up to the indicated layer.” [sec(s) VI] “During our experiments, the threshold τ was chosen empirically based on the accuracy achieved after pruning the model, prior to fine-tuning. We do not let the validation accuracy on the CASIA dataset drop below 84%. The model was fine-tuned using SGD with a learning rate of 0.01, momentum of 0.9 and batch size of 128. Each model was fine-tuned for a maximum of 30 epochs or until there were 5 successive epochs of no error improvement.” [sec(s) VII] “The Effect of Fine-Tuning: Next, we verify that fine-tuning reduces the accumulation of error rate between layers by comparing the accuracy of the models with and without fine-tuning after prune. Indeed, Table 5 shows that the detrimental effect of pruning is mitigated by performing fine-tuning. However, this effect is visible only in higher layers. At layer Conv21, post pruning results are 94% on the LFW benchmark for both methods. At Conv41, fine-tuning results are 93% compared to 88.43% – a difference of 4.5% in error rate. The increasing gap can be explained by the accumulated error caused by pruning all previous layers.”;)

determining that a particular iteration of the thinned version of the layer has a highest percentage of pruned channels amongst iterations of the thinned version of the layer included in iterations of the thinned version of the neural network tested to have an [accuracy in excess of the threshold accuracy] value; and 
(POLYAK [fig(s) 1] [algorithm(s) 1] “Output: 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
” and “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” [table(s) 1, 5] “model classification accuracy before and after fine-tuning … Each row indicates pruning done up to the indicated layer.” [sec(s) VI] “During our experiments, the threshold τ was chosen empirically based on the accuracy achieved after pruning the model, prior to fine-tuning. We do not let the validation accuracy on the CASIA dataset drop below 84%. The model was fine-tuned using SGD with a learning rate of 0.01, momentum of 0.9 and batch size of 128. Each model was fine-tuned for a maximum of 30 epochs or until there were 5 successive epochs of no error improvement.” [sec(s) VII] “The Effect of Fine-Tuning: Next, we verify that fine-tuning reduces the accumulation of error rate between layers by comparing the accuracy of the models with and without finetuning after prune. Indeed, Table 5 shows that the detrimental effect of pruning is mitigated by performing fine-tuning. However, this effect is visible only in higher layers. At layer Conv21, post pruning results are 94% on the LFW benchmark for both methods. At Conv41, fine-tuning results are 93% compared to 88.43% – a difference of 4.5% in error rate. The increasing gap can be explained by the accumulated error caused by pruning all previous layers.”; e.g., “Output:
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
” read(s) on “particular iteration of the thinned version of the layer has a highest percentage of pruned channels”.)

using the particular iteration of the thinned version of the layer to generate the pruned version of the layer; and 
(POLYAK [fig(s) 1] [algorithm(s) 1] “Output: 
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
” and “For each filter t and each channel s if σts ≥ τ then 
    PNG
    media_image2.png
    49
    56
    media_image2.png
    Greyscale
 ← s” [table(s) 1, 5] “model classification accuracy before and after fine-tuning … Each row indicates pruning done up to the indicated layer.” [sec(s) VI] “During our experiments, the threshold τ was chosen empirically based on the accuracy achieved after pruning the model, prior to fine-tuning. We do not let the validation accuracy on the CASIA dataset drop below 84%. The model was fine-tuned using SGD with a learning rate of 0.01, momentum of 0.9 and batch size of 128. Each model was fine-tuned for a maximum of 30 epochs or until there were 5 successive epochs of no error improvement.” [sec(s) VII] “The Effect of Fine-Tuning: Next, we verify that fine-tuning reduces the accumulation of error rate between layers by comparing the accuracy of the models with and without finetuning after prune. Indeed, Table 5 shows that the detrimental effect of pruning is mitigated by performing fine-tuning. However, this effect is visible only in higher layers. At layer Conv21, post pruning results are 94% on the LFW benchmark for both methods. At Conv41, fine-tuning results are 93% compared to 88.43% – a difference of 4.5% in error rate. The increasing gap can be explained by the accumulated error caused by pruning all previous layers.”; e.g., “Output:
    PNG
    media_image5.png
    41
    37
    media_image5.png
    Greyscale
” read(s) on “particular iteration of the thinned version of the layer”.)

generating a pruned version of the neural network to comprise the respective pruned versions of each of the plurality of layers.
(POLYAK [fig(s) 1] [table(s) 1] “scratch model” [sec(s) III-IV] “The Inbound Prune approach that we suggest focuses on reducing the number of channels each filter uses by eliminating channels that do not contribute significantly to the information the filter extracts. The amount of information each channel contributes is measured by the variance of the specific channel activation output. We do not consider directly the model accuracy during the pruning process. Instead, we fine-tune the model obtained after each prune in order to allow it to adapt. The pruning of the entire network is done sequentially on the network layers: from lower layers to the top ones, pruning is followed by fine-tuning, which is followed by pruning of the next layer.”;)

However, POLYAK does not appear to explicitly teach:
[sorting] the set of channels of the layer based on respective weight values of each channel in the set of channels; 
iteratively pruning [different-sized] portions of the set of channels based on the [sorting] to form iterations of a thinned version of the layer; 
testing [each of] the iterations of the thinned version of the neural network to determine whether [accuracy] of the iteration of the thinned version of the neural network [exceeds a threshold accuracy] value; 
determining that a particular iteration of the thinned version of the layer has a highest percentage of pruned channels amongst iterations of the thinned version of the layer included in iterations of the thinned version of the neural network tested to have an [accuracy in excess of the threshold accuracy] value; and 

Li teaches
sorting the set of channels of the layer based on respective weight values of each channel in the set of channels;
iteratively pruning [different-sized] portions of the set of channels based on the sorting to form iterations of a thinned version of the layer; 
(Li [fig(s) 1-2] [sec(s) 3] “Our method prunes the less useful filters from a well-trained model for computational efficiency while minimizing the accuracy drop. We measure the relative importance of a filter in each layer by calculating the sum of its absolute weights 
    PNG
    media_image3.png
    78
    67
    media_image3.png
    Greyscale
|Fi,j|, i.e., its l1-norm ||Fi,j||1. Since the number of input channels, ni, is the same across filters, 
    PNG
    media_image3.png
    78
    67
    media_image3.png
    Greyscale
|Fi,j| also represents the average magnitude of its kernel weights. This value gives an expectation of the magnitude of the output feature map. Filters with smaller kernel weights tend to produce feature maps with weak activations as compared to the other filters in that layer. Figure 2(a) illustrates the distribution of filters’ absolute weights sum for each convolutional layer in a VGG-16 network trained on the CIFAR-10 dataset, where the distribution varies significantly across layers. We find that pruning the smallest filters works better in comparison with pruning the same number of random or largest filters (Section 4.4). Compared to other criteria for activation-based feature map pruning (Section 4.5), we find l1-norm is a good criterion for data-free filter selection.

    PNG
    media_image4.png
    283
    1165
    media_image4.png
    Greyscale
”; Note that POLYAK teaches “the set of channels of the layer based on respective weight values of each channel in the set of channels; iteratively pruning [different-sized] portions of the set of channels based on the … to form iterations of a thinned version of the layer”.)

However, the combination of POLYAK, Li does not appear to explicitly teach:
iteratively pruning [different-sized] portions of the set of channels based on the sorting to form iterations of a thinned version of the layer; 
testing [each of] the iterations of the thinned version of the neural network to determine whether [accuracy] of the iteration of the thinned version of the neural network [exceeds a threshold accuracy] value; 
determining that a particular iteration of the thinned version of the layer has a highest percentage of pruned channels amongst iterations of the thinned version of the layer included in iterations of the thinned version of the neural network tested to have an [accuracy in excess of the threshold accuracy] value; and 

Liu teaches
iteratively pruning different-sized portions of the set of channels based on the sorting to form iterations of a thinned version of the layer; 
(Liu [fig(s) 4-5] “The effect of pruning varying percentages of channels” [sec(s) 5] “Once we obtain a model trained with sparsity regularization, we need to decide what percentage of channels to prune from the model. If we prune too few channels, the resource saving can be very limited. However, it could be destructive to the model if we prune too many channels, and it may not be possible to recover the accuracy by fine-tuning. We train a DenseNet40 model with λ=10−5 on CIFAR-10 to show the effect of pruning a varying percentage of channels. The results are summarized in Figure 5.”;)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network pruning of POLYAK, Li with the tensor sorting of Liu. 
Doing so would lead to enabling the model to perform better than the original model when trained with sparsity, even without fine-tuning, based on the regularization effect of L1 sparsity on channel scaling factors.
(Liu [sec(s) 5] “From Figure 5, it can be concluded that the classification performance of the pruned or fine-tuned models degrade only when the pruning ratio surpasses a threshold. The fine-tuning process can typically compensate the possible accuracy loss caused by pruning. Only when the threshold goes beyond 80%, the test error of fine-tuned model falls behind the baseline model. Notably, when trained with sparsity, even without fine-tuning, the model performs better than the original model. This is possibly due the the regularization effect of L1 sparsity on channel scaling factors.”)

Babaeizadeh teaches
testing each of the iterations of the thinned version of the neural network to determine whether accuracy of the iteration of the thinned version of the neural network exceeds a threshold accuracy value; 
determining that a particular iteration of the thinned version of the layer has a highest percentage of pruned channels amongst iterations of the thinned version of the layer included in iterations of the thinned version of the neural network tested to have an accuracy in excess of the threshold accuracy value; and 
 (Babaeizadeh [fig(s) 1-2] [sec(s) 2] “NoiseOut simply repeats this process to compress the network. The pruning ends when the accuracy of the network drops below some given threshold. Note that the pruning process is happening while training. Algorithm 1 shows the final NoiseOut algorithm. For the sake of readability, this algorithm has been shown for networks with only one hidden layer. But the same algorithm can be applied to networks with more that one hidden layer by performing the same pruning on all the hidden layers independently. It can also be applied to convolutional neural networks that use dense layers, in which we often see over 90% of the network parameters [13].”; e.g., “accuracy of the network” read(s) on “accuracy”. Note that POLYAK teaches “determining that a particular iteration of the thinned version of the layer has a highest percentage of pruned channels amongst iterations of the thinned version of the layer included in iterations of the thinned version of the neural network tested to have an accuracy [in excess of the threshold accuracy value]”.)

The combination of POLYAK, Li, Babaeizadeh is combinable with Babaeizadeh for the same rationale as set forth above with respect to claim 1.

Claim(s) 18-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over POLYAK et al. (Channel-Level Acceleration of Deep Face Representations) in view of Li et al. (PRUNING FILTERS FOR EFFICIENT CONVNETS), further in view of Babaeizadeh et al. (NoiseOut: A Simple Way to Prune Neural Networks) further in view of Collobert et al. (Torch7: A Matlab-like Environment for Machine Learning)

Regarding claim 18
The combination of POLYAK, Li, Babaeizadeh teaches claim 17.

POLYAK further teaches 
an [interface] to provide the pruned version of the neural network to a computing device.
(POLYAK [sec(s) I] “In our implementation, we use the android port of a deep learning framework called Torch7 [2], in which computation is optimized using vectorized convolution code.” [sec(s) VII] “We use the scratch model [12], as depicted in Table 1, as our baseline for the evaluation of our methods. Models are evaluated in two different ways. First, we measure the model accuracy by the classification accuracy on the CASIA dataset which we split to 90% training and 10% test. Second, we measure the score on the LFW benchmark [34] in the unrestricted mode. LFW results are mean and Standard Error estimated over fixed ten cross-validation splits. In addition, the model efficiency is captured by measuring the running time on a Samsung Galaxy S6 device which is our target platform.”)

However, the combination of POLYAK, Li, Babaeizadeh does not appear to explicitly teach:
an [interface] to provide the pruned version of the neural network to a computing device.

Collobert teaches
an interface to provide the pruned version of the neural network to a computing device.
(Collobert [fig(s) 2] “MLP”, “CNN”, “GPU”, “CPU” [sec(s) 4] “Once understood, these concepts were sufficient to allow us to write our own 2D convolutions, which are computed at about 200GFLOP/s on a GTX580, for large enough inputs. For smaller inputs, our OpenMP+SSE implementation remains more efficient. Once built against CUDA, Torch7 provides a new Tensor type: torch.CudaTensor. Once created, such a Tensor lives in the GPU’s DRAM memory. All operators defined on standard Tensors are also defined on CudaTensors, which completely abstracts the use of the graphics processor. Here is a small illustrative example, that demonstrates the simplicity of the interface:

    PNG
    media_image13.png
    131
    1202
    media_image13.png
    Greyscale
 
On top of the Tensors’ main operators, all the matrix-based operators are available, as well as most standard convolution routines.”; Note that POLYAK teaches “an [interface] to provide the pruned version of the neural network to a computing device”.)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network pruning of POLYAK, Li, Babaeizadeh with the interface of Collobert. 
Doing so would lead to providing efficiency by leveraging SSE when possible and supporting two ways of parallelization: OpenMP and CUDA base on the interface.
(Collobert [sec(s) 4] “Torch7 has been designed with efficiency in mind, leveraging SSE when possible and supporting two ways of parallelization: OpenMP and CUDA. The Tensor library (interfaced with the “torch” package in Lua) makes a heavy usage of these techniques. From the user viewpoint, enabling CUDA and OpenMP can lead to great speedups in any “Lua” script, at zero implementation cost (because most packages rely on the Tensor library). Other packages (like the “nn” package) also leverage OpenMP and CUDA for more specific usages not covered by the Tensor library.”)

In the alternative, Zhao can also be interpreted to teach the following limitation:
an interface to provide the pruned version of the neural network to a computing device.
(Zhao [fig(s) 6] [sec(s) 3] “In this experiment, the Xilinx KCU105 evaluation board (Xilinx, San Jose, CA, USA) with a XCKU040-2FFVA1156E FPGA chip (Xilinx, San Jose, CA, USA) is employed to act as the hardware platform. The photograph of the implemented prototype is shown in Figure 6, from which we can see that the evaluation board is connected to a computer by JTAG cable and PCI-e bus. FPGA is programmed and debugged through the JTAG cable, while the data are transferred between the computer and FPGA through the PCI-e bus.” [sec(s) 4] “An implementation of CAE based on FPGA is presented in this paper, which newly introduces a periodic layer-multiplexing framework. The encoder part of the proposed CAE framework, which is similar to the decoder, contains five modules, zero padding, a channel distributor, convolutional encoding, a channel arbitrator, and an output controller.”; Note that POLYAK teaches “an [interface] to provide the pruned version of the neural network to a computing device”.)

The combination of POLYAK, Li, Babaeizadeh is combinable with Zhao for the same rationale as set forth above with respect to claim 14.

Regarding claim 19
The combination of POLYAK, Li, Babaeizadeh teaches claim 18.

POLYAK further teaches 
comprising the computing device, wherein the computing device comprises a resource-constrained computing device.
(POLYAK [sec(s) I] “In our implementation, we use the android port of a deep learning framework called Torch7 [2], in which computation is optimized using vectorized convolution code.” [sec(s) VII] “We use the scratch model [12], as depicted in Table 1, as our baseline for the evaluation of our methods. Models are evaluated in two different ways. First, we measure the model accuracy by the classification accuracy on the CASIA dataset which we split to 90% training and 10% test. Second, we measure the score on the LFW benchmark [34] in the unrestricted mode. LFW results are mean and Standard Error estimated over fixed ten cross-validation splits. In addition, the model efficiency is captured by measuring the running time on a Samsung Galaxy S6 device which is our target platform.”)

Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Li et al. (PRUNING FILTERS FOR EFFICIENT CONVNETS) teaches sorting the sum of absolute kernel weights of each filter, and l2-norm-based contribution variance of channel in sec 4.
Guo et al. (Dynamic Network Surgery for Efficient DNNs) teaches the same iterations as the iterations of fig 5C in the present application. 
Kadav et al. (US10832136 B2) teaches sorting the sum of absolute kernel weights of each filter.
Gupta et al. (Deep Learning with Limited Numerical Precision) teaches rounding operation precisions (e.g., stochastic rounding)
Anwar et al. (COARSE PRUNING OF CONVOLUTIONAL NEURAL NETWORKS WITH RANDOM MASKS) teaches weight prunings based on different granularities. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEHWAN KIM whose telephone number is (571)270-7409. The examiner can normally be reached Mon - Thu 7:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/S.K./Examiner, Art Unit 2129                                                                                                                                                                                                        
9/19/2022
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129