DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
	This Office Action is in response to the communication filed on 01/03/2020.
	Claims 1-22 are being considered on the merits.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 8/16/2022 has been considered. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, initialed and dated copies of Applicant's IDS forms 1449 filed 8/16/2022 is attached to the instant Office action.
Drawings
	The drawings filed on 01/03/2020 are accepted. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 11-14, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Molchanov, et. al. (“Pruning Convolutional Neural Networks For Resource Efficient Inference”, 2017, ICLR; hereinafter “Molchanov”) in view of Ye, et. al. (“Rethinking The Smaller-Norm-Less informative Assumption In Channel Pruning Of Convolution Layers”, 2018, ICLR; hereinafter “Ye”)
Regarding Claims 1, 11, and 21, Molchanov teaches a computer-implemented method, a computing system, and a non-transitory computer-readable medium:
comparing the channel sparsity metrics of the channel kernel corresponding to one channel of the plurality of channels against a sparsity inference threshold (Molchanov, Pg. 15, Sec. A.4: “let us denote a set of image feature maps by                                 
                                    
                                        
                                            z
                                        
                                        
                                            l
                                        
                                    
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    l
                                                
                                            
                                             
                                        
                                    
                                
                            with dimensionality                                 
                                    
                                        
                                            H
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            W
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            C
                                        
                                        
                                            l
                                        
                                    
                                     
                                
                            individual maps (or channels)…fine-tuning with high                                 
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                              or                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              regularization causes unimportant connections to be suppressed…The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              norm of those kernels is below a predefined threshold”. Examiner notes that the broadest reasonable interpretation of channel sparsity metrics includes data about data including energy levels or the kernel norm as described in Molchanov.). 
controlling inference operations of the one channel in response to determining that the channel sparsity metrics of the channel kernel corresponding to the one channel is greater than the sparsity inference threshold (Molchanov, Pg. 15, Sec. A.4: “let us denote a set of image feature maps by                                 
                                    
                                        
                                            z
                                        
                                        
                                            l
                                        
                                    
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    l
                                                
                                            
                                             
                                        
                                    
                                
                            with dimensionality                                 
                                    
                                        
                                            H
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            W
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            C
                                        
                                        
                                            l
                                        
                                    
                                     
                                
                            individual maps (or channels)…fine-tuning with high                                 
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                              or                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              regularization causes unimportant connections to be suppressed. Connections with energy lower than some threshold can be removed on the assumption that they do not contribute much to subsequent layers…The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              norm of those kernels is below a predefined threshold”. Examiner notes that the broadest reasonable interpretation of “controlling inference operations” includes pruning kernels in a channel such that the pruning controls i.e. affects inference operations).
Molchanov fails to explicitly disclose: 
reading metadata of an inference layer of the NN model, the inference layer including a plurality of channels, and the metadata including channel sparsity metrics of a plurality of channel kernels corresponding to the plurality of channels 
However, Yee teaches: 
reading metadata of an inference layer of the NN model, the inference layer including a plurality of channels, and the metadata including channel sparsity metrics of a plurality of channel kernels corresponding to the plurality of channels (Yee, pg. 7-8, sec. 5.1 and Table 1: “We start with a standard 4-layer convolutional neural network whose network attributes are specified in Table 1…The detailed statistics and its pruned channel size are reported in Table 1.” “Table 1: Comparisons between different pruned networks and the base network”. Examiner notes that the broadest reasonable interpretation of “reading metadata” means discerning data about data including discerning sparse penalty and resulting test accuracy of base channels and pruned channels as given in Table 1).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Ye into Molchanov. Molchanov teaches pruning convolutional kernels in neural networks to enable efficient inference by interleave greedy criteria-based pruning with finetuning by backpropagation. Ye teaches a channel pruning technique for accelerating the computations of deep convolutional neural networks (CNNs) that focuses on direct simplification of the channel-to-channel computation graph of a CNN. One of ordinary skill would have motivation to combine the teachings of Ye into Molchanov in order to look whether an overparameterized network can be easily converted to a smaller one, which would significantly reduce the number of channels and improve performance (Ye, pg. 7, sec. 5.1). 


Regarding claims 2 and 12, Molchanov and Ye teaches the method of claims 1 and 11 (above). Molchanov further teaches:
enabling the inference operations in response to determining that the channel sparsity metrics of the channel kernel corresponding to the one channel is greater than the sparsity inference threshold (Molchanov, Pg. 15, Sec. A.4: “let us denote a set of image feature maps by                                 
                                    
                                        
                                            z
                                        
                                        
                                            l
                                        
                                    
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    l
                                                
                                            
                                             
                                        
                                    
                                
                            with dimensionality                                 
                                    
                                        
                                            H
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            W
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            C
                                        
                                        
                                            l
                                        
                                    
                                     
                                
                            individual maps (or channels)…fine-tuning with high                                 
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                              or                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              regularization causes unimportant connections to be suppressed. Connections with energy lower than some threshold can be removed on the assumption that they do not contribute much to subsequent layers…The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              norm of those kernels is below a predefined threshold”. Examiner notes that the corollary of removing kernels whose norms fall below a threshold is retaining kernels whose norms are above a threshold. Examiner additionally notes that broadest reasonable interpretation of “enabling inference operations” includes any step or process that allow the kernel to operate including allowing the kernel to remain).
disabling the inference operations in response to determining that the channel sparsity metrics of the channel kernel corresponding to the one channel is less than or equal to the sparsity inference threshold (Molchanov, Pg. 15, sec. A.4: “let us denote a set of image feature maps by                                 
                                    
                                        
                                            z
                                        
                                        
                                            l
                                        
                                    
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    l
                                                
                                            
                                             
                                        
                                    
                                
                            with dimensionality                                 
                                    
                                        
                                            H
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            W
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            C
                                        
                                        
                                            l
                                        
                                    
                                     
                                
                            individual maps (or channels)…fine-tuning with high                                 
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                              or                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              regularization causes unimportant connections to be suppressed. Connections with energy lower than some threshold can be removed on the assumption that they do not contribute much to subsequent layers…The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              norm of those kernels is below a predefined threshold”. Examiner additionally notes that broadest reasonable interpretation of “disabling inference operations” includes any step or process that disallows the kernel including removal of the kernel).

Regarding claim 3 and 13, Molchanov and Ye teaches the method of claims 1 and 11 (above). Molchanov further teaches:
the sparsity inference threshold is dynamically adjusted in accordance with a required speed or accuracy of the inference operations (Molchanov, pg. 10, sec. 3.6 and pg. 15, sec. A.4: “During pruning we were measuring reduction in computations by FLOPs” “Connections with energy lower than some threshold can be removed on the assumption that they do not contribute much to subsequent layers. The same work also finds that thresholds must be set separately for each layer depending on its sensitivity to pruning”; Examiner notes that the broadest reasonable interpretation of “sparsity inference threshold” means some predetermined limit regarding sparsity levels for inferences such as an increased threshold to allow for more sparsity in a model with lower energy connections).

Regarding claims 4 and 14, Molchanov and Yee teaches the method of claims 1 and 11 (above). Molchanov further teaches:
comparing channel sparsity metrics of a channel kernel corresponding to one channel of the additional inference layer against a second sparsity inference threshold (Molchanov, Pg. 15, Sec. A.4: “let us denote a set of image feature maps by                                 
                                    
                                        
                                            z
                                        
                                        
                                            l
                                        
                                    
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    l
                                                
                                            
                                             
                                        
                                    
                                
                            with dimensionality                                 
                                    
                                        
                                            H
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            W
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            C
                                        
                                        
                                            l
                                        
                                    
                                     
                                
                            individual maps (or channels)…fine-tuning with high                                 
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                              or                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              regularization causes unimportant connections to be suppressed…The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              norm of those kernels is below a predefined threshold”. Examiner notes that the broadest reasonable interpretation of channel sparsity metrics includes any measurement of channel sparsity including measures energy levels as a first inference threshold or measures of kernel norm as a second sparsity inference threshold).
controlling inference operations of the one channel of the additional inference layer in response to determining that the channel sparsity metrics of the channel kernel corresponding to the one channel of the additional inference layer is greater than the second sparsity inference threshold (Molchanov, Pg. 15, Sec. A.4: “let us denote a set of image feature maps by                                 
                                    
                                        
                                            z
                                        
                                        
                                            l
                                        
                                    
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    l
                                                
                                            
                                             
                                        
                                    
                                
                            with dimensionality                                 
                                    
                                        
                                            H
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            W
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            C
                                        
                                        
                                            l
                                        
                                    
                                     
                                
                            individual maps (or channels)…fine-tuning with high                                 
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                              or                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              regularization causes unimportant connections to be suppressed…The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              norm of those kernels is below a predefined threshold”. Examiner notes that the broadest reasonable interpretation of channel sparsity metrics includes any measurement of channel sparsity including measures energy levels as a first inference threshold or measures of kernel norm as a second sparsity inference threshold. Examiner additionally notes that broadest reasonable interpretation of “enabling inference operations” includes any step or process that allow the kernel to operate including allowing the kernel to remain). 
Molchanov does not explicitly disclose: 
reading metadata of an additional inference layer of the NN model
However, Ye teaches:
reading metadata of an additional inference layer of the NN model Yee, pg. 7-8, sec. 5.1 and Table 1: “We start with a standard 4-layer convolutional neural network whose network attributes are specified in Table 1…The detailed statistics and its pruned channel size are reported in Table 1.” “Table 1: Comparisons between different pruned networks and the base network”. Examiner notes that the broadest reasonable interpretation of “reading metadata” means discerning data about data including discerning sparse penalty and resulting test accuracy of base channels and pruned channels as given in Table 1. Examiner additionally notes that “an additional layer” means a layer other than the first layer in the model as given in Table 1).

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Ye into Molchanov. Molchanov teaches pruning convolutional kernels in neural networks to enable efficient inference by interleave greedy criteria-based pruning with finetuning by backpropagation. Ye teaches a channel pruning technique for accelerating the computations of deep convolutional neural networks (CNNs) that focuses on direct simplification of the channel-to-channel computation graph of a CNN. One of ordinary skill would have motivation to combine the teachings of Ye into Molchanov in order to test and compare each layer in a model with other models using channel pruning and without channel pruning (Ye, pg. 7, sec. 5.1). 


Claims 5-10, 15-20 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Molchanov in view of Ye and further in view of Kung, et. al. (“Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization”, 2019, ASPLOS; hereinafter “Kung”).

Regarding claims 5 and 15, Molchanov and Ye teaches the method of claims 1 and 11 (above). Molchanov further teaches:
training the NN model to generate a plurality of original channel kernels corresponding to a plurality of original channels of the inference layer (Molchanov, pg. 2, section 2 and Figure 1: “The proposed method for pruning consists of the following steps: 1) Fine-tune the network until convergence on the target task…Starting with a full set of parameters W, we iteratively identify and remove the least important parameters, as illustrated in Figure 1”. Examiner notes that Molchanov teaches a method for training a neural network on a  target task which task may include generating a plurality of channels such that the channels will then be pruned and fine-tuned. Examiner additionally notes that footnote defines a “parameter” as an individual weight, convolutional kernel, or entire set of kernels that compute a feature map). 
ranking the plurality of original channel kernels of the inference layer according to channel sparsity metrics of the plurality of original channel kernels to determine whether the corresponding original channels of the inference layer are a dense channel or a sparse channel (Molchanov, pg. 2, section 2 and Figure 1: “Starting with a full set of parameters W, we iteratively identify and remove the least important parameters, as illustrated in Figure 1”. Examiner notes that footnote 1 of Molchanov defines a “parameter” as an individual weight, convolutional kernel, or entire set of kernels that compute a feature map. Examiner additionally notes that in order to identify the least important parameters, such parameters are ranked—Molchanov proposes by weight—such that the least dense weights are more sparse).
…while keeping fixed the original channel kernels corresponding to all of the dense channels of the inference layer (Molchanov, Pg. 15, Sec. A.4: “The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                             norm of those kernels is below a predefined threshold…We observe that our approach has higher test accuracy for the same number of remaining unpruned feature maps, when pruning 85% or more of the feature maps”; Examiner notes that pruning sparse kernels that below a predefined threshold leaves the original dense channels to remain).
Molchanov does not explicitly disclose:
consolidating all of the sparse channels of the inference layer into a consolidated sparse channel to generate a consolidated channel kernel 
retraining the consolidated channel kernel to generate a retrained consolidated channel kernel corresponding to the consolidated sparse channel..
However, Kung discloses:
consolidating all of the sparse channels of the inference layer into a consolidated sparse channel to generate a consolidated channel kernel (Kung, pg. 823, sec. 2.4; pg. 824, sec. 3.2: “This has led to recent work on structured pruning techniques, which add constraints so that the remaining filter matrix after pruning is still dense[16,21,23,37,39,53]. This is generally achieved by removing entire rows (filters) and columns (channels) from the filter matrix, with some reduction to classification accuracy” “we combine the sparse columns in the group into a single combined column by applying column-combine pruning.” Examiner notes that the broadest reasonable interpretation of consolidating includes removing i.e. pruning, redundant or unnecessary elements from a unit and combining elements to form a unit).
retraining the consolidated channel kernel to generate a retrained consolidated channel kernel corresponding to the consolidated sparse channel…(Kung, pg. 823, sec. 2.4: “For high-density packing, we adopt a dense-column-first combining policy that favors selections of combining columns which result in high-density combined columns, where the density of a column is the percentage of nonzeros in the column. For high-classification accuracy, we then retrain the remaining weights after column-combine pruning.”) 

Regarding claims 6 and 16, Molchanov, Ye, and Kung teaches the method of claims 5 and 15 (above). Molchanov further teaches:
generating the plurality of channels of the inference layer as including all of the dense channels (Molchanov, pg. 2, section 2 and Figure 1: “The proposed method for pruning consists of the following steps: 1) Fine-tune the network until convergence on the target task…Starting with a full set of parameters W, we iteratively identify and remove the least important parameters, as illustrated in Figure 1”. Examiner notes that Molchanov provides a method for training a neural network on target task which includes generating a plurality of original channels such that the channels will then be pruned and fine-tuned.). 
…and the channel sparsity metrics of the original channel kernels corresponding to all of the dense channels… (Molchanov, Pg. 15, Sec. A.4: “The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                             norm of those kernels is below a predefined threshold…We observe that our approach has higher test accuracy for the same number of remaining unpruned feature maps, when pruning 85% or more of the feature maps”; Examiner notes that the broadest reasonable interpretation of sparsity metrics is a measurement such that the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                             norm is a metric about the sparsity of the dense channels. Examiner additionally notes that pruning sparse kernels that below a predefined threshold leaves the original dense channels to remain.)
Molchanov does not explicitly disclose: 
and the consolidated sparse channel of the inference layer 
determining channel sparsity metrics of the retrained consolidated channel kernel
storing the channel sparsity metrics of the retrained consolidated channel kernel corresponding to the consolidated sparse channel…of the inference layer into the metadata.
However, Kung teaches:
and the consolidated sparse channel of the inference layer (Kung, pg. 823, sec. 2.4; pg. 824, sec. 3.2: “This has led to recent work on structured pruning techniques, which add constraints so that the remaining filter matrix after pruning is still dense[16,21,23,37,39,53]. This is generally achieved by removing entire rows (filters) and columns (channels) from the filter matrix, with some reduction to classification accuracy” “we combine the sparse columns in the group into a single combined column by applying column-combine pruning.”)
determining channel sparsity metrics of the retrained consolidated channel kernel (Kung, pg. 829, sec. 5.1: “Training a network with column combining occurs over a series of pruning iterations (Algorithm 1), where, at each pruning stage, unstructured pruning and column combining are performed to decrease the model size. Figure 15a shows the classification accuracy and number of nonzero weights for the VGG-19 (1×1) model on the CIFAR-10 dataset over each training epoch.” Examiner notes that the broadest definition of “sparsity metrics” includes any measurement of sparsity, including the number of nonzero weights after each training.)
storing the channel sparsity metrics of the retrained consolidated channel kernel corresponding to the consolidated sparse channel…of the inference layer into the metadata (Kung, pg. 826, sec. 4.1: “The systolic array system which implements packed filter matrices after column combining is shown in Figure 6. The weights of the packed filter matrices corresponding to each convolutional layer of a CNN are stored in the weight buffer”; Examiner notes that the broadest reasonable interpretation of metadata is a set of data about other data such that the weight of the filters is metadata about the sparsity of the channels).

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Kung into Molchanov and Ye. Molchanov teaches pruning convolutional kernels in neural networks to enable efficient inference by interleave greedy criteria-based pruning with finetuning by backpropagation. Ye teaches a channel pruning technique for accelerating the computations of deep convolutional neural networks (CNNs) that focuses on direct simplification of the channel-to-channel computation graph of a CNN. Kung teaches packing sparse convolutional neural networks into a denser format for efficient implementations using systolic arrays. One of ordinary skill would have motivation to combine the teachings of Kung into Molchanov and Ye for more utilization efficiency and accuracy (Kung, pg. 822, sec. 1). 

Regarding claims 7 and 17, Molchanov, Ye, and Kung teaches the method of claim 6 and 16 (above). Molchanov further teaches:
comparing the channel sparsity metrics of the channel kernel corresponding to the one channel of the plurality of channels against a sparsity inference threshold comprises: comparing the channel sparsity metrics…against the sparsity inference threshold. (Molchanov, Pg. 15, Sec. A.4: “let us denote a set of image feature maps by                                 
                                    
                                        
                                            z
                                        
                                        
                                            l
                                        
                                    
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    l
                                                
                                            
                                             
                                        
                                    
                                
                            with dimensionality                                 
                                    
                                        
                                            H
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            W
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            C
                                        
                                        
                                            l
                                        
                                    
                                     
                                
                            individual maps (or channels)…fine-tuning with high                                 
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                              or                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              regularization causes unimportant connections to be suppressed…The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              norm of those kernels is below a predefined threshold”. Examiner notes that the broadest reasonable interpretation of channel sparsity metrics includes data about data including energy levels or the kernel norm as described in Molchanov.). 
Molchanov does not explicitly disclose:
…of the retrained consolidated channel kernel corresponding to the consolidated sparse channel… 
However, Kung teaches:
…of the retrained consolidated channel kernel corresponding to the consolidated sparse channel…(Kung, pg. 832, sec. 6.3: “                                
                                    
                                        
                                            E
                                        
                                        
                                            m
                                            a
                                            c
                                        
                                    
                                
                             is the energy consumption for a single MAC and SRAM,                                 
                                    
                                        
                                            N
                                        
                                        
                                            m
                                            a
                                            c
                                        
                                    
                                
                             is the number of MAC operations in the CNN after column combining, and                                 
                                    
                                        
                                            N
                                        
                                        
                                            m
                                            a
                                            c
                                        
                                        
                                            o
                                            p
                                            t
                                        
                                    
                                
                             is the optimal number of MAC operations where no multiplications with zeros are performed.” Examiner notes that the broadest definition of “sparsity metrics” includes any measurement of sparsity, including the energy consumption of the consolidated channel) 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Kung into Molchanov and Ye. Molchanov teaches pruning convolutional kernels in neural networks to enable efficient inference by interleave greedy criteria-based pruning with finetuning by backpropagation. Ye teaches a channel pruning technique for accelerating the computations of deep convolutional neural networks (CNNs) that focuses on direct simplification of the channel-to-channel computation graph of a CNN. Kung teaches packing sparse convolutional neural networks into a denser format for efficient implementations using systolic arrays. One of ordinary skill would have motivation to combine the teachings of Kung into Molchanov and Ye for more utilization efficiency and accuracy (Kung, pg. 822, sec. 1). 


Regarding claims 8 and 18, Molchanov, Ye, and Kung teaches the method of claims 6 and 16 (above). Molchanov further teaches:
Comparing the channel sparsity metrics of the original channel kernel corresponding to one channel of the dense channels of the inference layer against the sparsity inference threshold  (Molchanov, Pg. 15, Sec. A.4: “The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                             norm of those kernels is below a predefined threshold…We observe that our approach has higher test accuracy for the same number of remaining unpruned feature maps, when pruning 85% or more of the feature maps…let us denote a set of image feature maps by                                 
                                    
                                        
                                            z
                                        
                                        
                                            l
                                        
                                    
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    l
                                                
                                            
                                             
                                        
                                    
                                
                            with dimensionality                                 
                                    
                                        
                                            H
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            W
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            C
                                        
                                        
                                            l
                                        
                                    
                                     
                                
                            individual maps (or channels)…fine-tuning with high                                 
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                              or                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              regularization causes unimportant connections to be suppressed…The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              norm of those kernels is below a predefined threshold”. Examiner additionally notes that pruning sparse kernels that below a predefined threshold leaves the original dense channels to remain. Examiner notes that the broadest reasonable interpretation of channel sparsity metrics includes data about data including energy levels or the kernel norm as described in Molchanov.).
controlling inference operations of the one channel of the dense channels of the inference layer in response to determining that the channel sparsity metrics of the original channel kernel corresponding to the one channel of the dense channels is greater than the sparsity inference threshold. (Molchanov, Pg. 15, Sec. A.4: “The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                             norm of those kernels is below a predefined threshold…We observe that our approach has higher test accuracy for the same number of remaining unpruned feature maps, when pruning 85% or more of the feature maps…let us denote a set of image feature maps by                                 
                                    
                                        
                                            z
                                        
                                        
                                            l
                                        
                                    
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    l
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    l
                                                
                                            
                                             
                                        
                                    
                                
                            with dimensionality                                 
                                    
                                        
                                            H
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            W
                                        
                                        
                                            l
                                        
                                    
                                    ×
                                    
                                        
                                            C
                                        
                                        
                                            l
                                        
                                    
                                     
                                
                            individual maps (or channels)…fine-tuning with high                                 
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                              or                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              regularization causes unimportant connections to be suppressed. Connections with energy lower than some threshold can be removed on the assumption that they do not contribute much to subsequent layers…The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                              norm of those kernels is below a predefined threshold”. Examiner additionally notes that pruning sparse kernels that below a predefined threshold leaves remaining the original dense channels. Examiner additionally notes that the broadest reasonable interpretation of “controlling inference operations” includes pruning kernels in a channel such that whether a kernel is pruned or not affects, i.e. controls, inference operations).

Regarding claims 9 and 19, Molchanov, Ye, and Kung teaches the method of claims 6 and 16 (above). Molchanov further teaches:
ranking the plurality of original channel kernels of the inference layer according to channel sparsity metrics of the plurality of original channel kernels to determine whether the corresponding original channels of the inference layer are a dense channel or a sparse channel comprises: rearranging the plurality of original channels of the inference layer to group separately all of the dense channels and all of the sparse channels (Molchanov, pg. 6, sec. 3.1: “We rank feature maps by their contributions to the loss, where rank 1 indicates the most important feature map—removing it results in the highest increase in loss—and rank 4224 indicates the least important…(3) maximum and minimum ranks show that every layer has some feature maps that are globally important and others that are globally less important” Examiner notes that Molchanov uses the term “maps” for “channels”. Examiner additionally notes that the broadest reasonable interpretation of “to group” means enabling grouping such that ranking feature maps would enable grouping of such feature maps). 

Regarding claims 10 and 20, Molchanov, Ye, and Kung teaches the method of claims 6 and 16 (above). Kung further teaches:
consolidating all of the sparse channels of the inference layer into the consolidated sparse channel to generate the consolidated channel kernel comprises: concatenating the plurality of original channel kernels corresponding to all of the sparse channels of the inference layer to generate the consolidated channel kernel (Kung, pg. 824, sec. 3.2: “Given a sparse filter matrix, we first partition it into column groups by grouping columns that have minimal conflicts. Then, for each column group, we combine the sparse columns in the group into a single combined column by applying column-combine pruning”. Examiner notes that Kung defines uses the term “columns” to mean “channels”) 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Kung into Molchanov and Ye. Molchanov teaches pruning convolutional kernels in neural networks to enable efficient inference by interleave greedy criteria-based pruning with finetuning by backpropagation. Ye teaches a channel pruning technique for accelerating the computations of deep convolutional neural networks (CNNs) that focuses on direct simplification of the channel-to-channel computation graph of a CNN. Kung teaches packing sparse convolutional neural networks into a denser format for efficient implementations using systolic arrays. One of ordinary skill would have motivation to combine the teachings of Kung into Molchanov and Ye for more utilization efficiency and accuracy (Kung, pg. 822, sec. 1). 

Regarding claim 22, Molchanov and Ye teaches the method of claim 21 (above). Molchanov further teaches: 
training the NN model to generate a plurality of original channel kernels corresponding to a plurality of original channels of the inference layer (Molchanov, pg. 2, section 2 and Figure 1: “The proposed method for pruning consists of the following steps: 1) Fine-tune the network until convergence on the target task…Starting with a full set of parameters W, we iteratively identify and remove the least important parameters, as illustrated in Figure 1”. Examiner notes that Molchanov provides a method for training a neural network on target task which includes generating a plurality of original channels such that the channels will then be pruned and fine-tuned. Examiner additionally notes that footnote defines a “parameter” as an individual weight, convolutional kernel, or entire set of kernels that compute a feature map.). 
ranking the plurality of original channel kernels of the inference layer according to channel sparsity metrics of the plurality of original channel kernels to determine whether the corresponding original channels of the inference layer are a dense channel or a sparse channel (Molchanov, pg. 6, sec. 3.1: “We rank feature maps by their contributions to the loss, where rank 1 indicates the most important feature map—removing it results in the highest increase in loss—and rank 4224 indicates the least important…(3) maximum and minimum ranks show that every layer has some feature maps that are globally important and others that are globally less important” Examiner notes that Molchanov uses the term “maps” for “channels”. Examiner additionally notes that the broadest reasonable interpretation of “to determine” means enabling a determination such that ranking maps would enable grouping of such maps).
...while keeping fixed the original channel kernels corresponding to all of the dense channels of the inference layer (Molchanov, Pg. 15, Sec. A.4: “The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                             norm of those kernels is below a predefined threshold…We observe that our approach has higher test accuracy for the same number of remaining unpruned feature maps, when pruning 85% or more of the feature maps”; Examiner notes that pruning sparse kernels that below a predefined threshold leaves the original dense channels to remain fixed).
…including all of the dense channels and… (Molchanov, Pg. 15, Sec. A.4: “The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                             norm of those kernels is below a predefined threshold…We observe that our approach has higher test accuracy for the same number of remaining unpruned feature maps, when pruning 85% or more of the feature maps”; Examiner notes that pruning sparse kernels that below a predefined threshold leaves the original dense channels to remain).
…all of the dense channels…(Molchanov, Pg. 15, Sec. A.4: “The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                             norm of those kernels is below a predefined threshold…We observe that our approach has higher test accuracy for the same number of remaining unpruned feature maps, when pruning 85% or more of the feature maps”; Examiner notes that the broadest reasonable interpretation of metadata is a set of data about other data such that the                                 
                                    
                                        
                                            l
                                        
                                        
                                            2
                                        
                                    
                                
                             norm is metadata about the sparsity of the dense channels. Examiner additionally notes that pruning sparse kernels that below a predefined threshold leaves the original dense channels to remain).
Molchanov does not explicitly disclose:
consolidating all of the sparse channels of the inference layer into a consolidated sparse channel to generate a consolidated channel kernel (Kung, pg. 823, sec. 2.4; pg. 824, sec. 3.2: “This has led to recent work on structured pruning techniques, which add constraints so that the remaining filter matrix after pruning is still dense[16,21,23,37,39,53]. This is generally achieved by removing entire rows (filters) and columns (channels) from the filter matrix, with some reduction to classification accuracy” “we combine the sparse columns in the group into a single combined column by applying column-combine pruning.” Examiner notes that the broadest reasonable interpretation of consolidating includes removing i.e. pruning, redundant or unnecessary elements from a unit and combining elements to form a unit).
retraining the consolidated channel kernel to generate a retrained consolidated channel kernel corresponding to the consolidated sparse channel… (Kung, pg. 823, sec. 2.4: “For high-density packing, we adopt a dense-column-first combining policy that favors selections of combining columns which result in high-density combined columns, where the density of a column is the percentage of nonzeros in the column. For high-classification accuracy, we then retrain the remaining weights after column-combine pruning.”) 
generating the plurality of channels of the inference layer as…the consolidated sparse channel of the inference layer (Kung, pg. 823, sec. 2.4; pg. 824, sec. 3.2: “This has led to recent work on structured pruning techniques, which add constraints so that the remaining filter matrix after pruning is still dense[16,21,23,37,39,53]. This is generally achieved by removing entire rows (filters) and columns (channels) from the filter matrix, with some reduction to classification accuracy” “we combine the sparse columns in the group into a single combined column by applying column-combine pruning.”) 
determining channel sparsity metrics of the retrained consolidated channel kernel (Kung, pg. 829, sec. 5.1: “Training a network with column combining occurs over a series of pruning iterations (Algorithm 1), where, at each pruning stage, unstructured pruning and column combining are performed to decrease the model size. Figure 15a shows the classification accuracy and number of nonzeros weights for the VGG-19 (1×1) model on the CIFAR-10 dataset over each training epoch.” Examiner notes that the broadest definition of “sparsity metrics” includes any measurement of sparsity, including the number of nonzero weights after each training.)
storing the channel sparsity metrics of the retrained consolidated channel kernel corresponding to the consolidated sparse channel and the channel sparsity metrics of the original channel kernels corresponding to…of the inference layer into the metadata (Kung, pg. 826, sec. 4.1: “The systolic array system which implements packed filter matrices after column combining is shown in Figure 6. The weights of the packed filter matrices corresponding to each convolutional layer of a CNN are stored in the weight buffer”; Examiner notes that the broadest reasonable interpretation of metadata is a set of data about other data such that the weight of the filters is metadata about the sparsity of the channels). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Kung into Molchanov and Ye. Molchanov teaches pruning convolutional kernels in neural networks to enable efficient inference by interleave greedy criteria-based pruning with finetuning by backpropagation. Ye teaches a channel pruning technique for accelerating the computations of deep convolutional neural networks (CNNs) that focuses on direct simplification of the channel-to-channel computation graph of a CNN. Kung teaches packing sparse convolutional neural networks into a denser format for efficient implementations using systolic arrays. One of ordinary skill would have motivation to combine the teachings of Kung into Molchanov and Ye for more utilization efficiency and accuracy (Kung, pg. 822, sec. 1). 

Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Fan, et. al. (US 20200034645 A1) teaches Computer-implemented automatic object detection in data including include flattening the feature vector and inputting the flattened feature vector as a layer in a neural network.
Xu, et. al. (US 20190362235 A1) teaches configuring neural network models for resource constrained computing systems.
Chen, et. al. (US 20190122113 A1) teaches techniques for selectively pruning neurons and kernels of deep convolutional neural networks. 
Yao, et. al. (“Balanced Sparsity for Efficient DNN Inference on GPU”, 12 Dec 2019, arXiv) teaches a fine-grained sparsity approach, Balanced Sparsity, to achieve high model accuracy with commercial hardwares efficiently
Dettmers, et. al (“Sparse Networks from Scratch: Faster Training without Losing Performance” 23 Aug 2019, arXiv) teaches accelerated training of deep neural networks that maintain sparse weights throughout training while achieving dense performance levels
Han, et. al. (“EIE: Efficient Inference Engine on Compressed Deep Neural Network”, 3 May 2016, arXiv) teaches an energy efficient inference engine (EIE) that performs inference on a compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing
Graham et. al. (“Spatially-sparse convolutional neural networks” 22 Sept 2014, arXiv) a CNN for processing spatially-sparse inputs
Liu, et. al. (“Learning Efficient Convolutional Networks through Network Slimming”, 22 Aug 2017, arXiv) teaches a learning scheme for CNNs to simultaneously reduce the model size, decrease the run-time memory footprint, and lower the number of computing operations, without compromising accuracy.
Huang, et. al. (“Data-Driven Sparse Structure Selection for Deep Neural Networks”, 5 Sept 2018, arXiv) teaches a framework to learn and prune deep models in an end-to-end manner.
Yu, et. al. (“NISP: Pruning Networks using Neuron Importance Score Propagation”, 12 Mar 2018, arXiv) teaches the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of final responses to every neuron in the network
Li, et. al. (“OICSR: Out-In-Channel Sparsity Regularization for Compact Deep Neural Networks”, 1 Jul 2019, ArXiv) teaches an Out-In-Channel Sparsity Regularization (OICSR) that considers correlations between successive layers to further retain predictive power of the compact network
Hurley, et. al. (“Comparing Measures of Sparsity” 27 Apr 2009, arXiv) teaches comparisons of several commonly used sparsity measures based on intuitive attributes
Wen, et. al. (“Learning Structured Sparsity in Deep Neural Networks”, 18 Oct 2016, ArXiv) teaches a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SALLY T. NGUYEN whose telephone number is (571)272-3406. The examiner can normally be reached Monday - Friday 9:00am - 5:00pm Eastern Time.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Amir Mehrmanesh can be reached on (571) 270-3351. The fax phone number for the organization where this application or proceeding is assigned is (571) 273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or (571) 272-1000.



/STN/Examiner, Art Unit 4163                                                                                                                                                                                                        
/VIKER A LAMARDO/Primary Examiner, Art Unit 2126