DETAILED ACTION
Currently claims 1-20 are pending for application 16/829205 filed on 25 March 2020. All references in the IDS have been considered. It is noted that a translated version of the priority document (CN201910228917) has not been filed. Should applicant desire to obtain the benefit of foreign priority under 35 U.S.C. 119(a)-(d) prior to declaration of an interference, a certified English translation of the foreign application is required according to 37 CFR 41.154(b) and 41.202(e).

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claims 3-6 and 19 are objected to because of the following informalities
Claim 3 recites “wherein, the determining the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers, comprises:” which should read instead “wherein the determining the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers comprises:”
Claim 4 recites “wherein, the determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers, comprises:” which should read instead “wherein the determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers comprises:”
Claim 5 recites “wherein, the determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning losses of each of the hidden layers, and the respective weight of each of the hidden layers, comprises:” which should read instead “wherein the determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning losses of each of the hidden layers, and the respective weight of each of the hidden layers comprises:”
Claim 6 recites “wherein, the determining the relationship between the pruning number of the respective channels of each hidden layer and the corresponding pruning loss of each of the hidden layers, comprises:” which should read instead “wherein the determining the relationship between the pruning number of the respective channels of each hidden layer and the corresponding pruning loss of each of the hidden layers comprises:”
Claim 19 recites “wherein, the compression parameter comprises a pruning number of each hidden layer, and the least one processor is further configured to” which should read instead “wherein the compression parameter comprises a pruning number of each hidden layer and the least one processor is further configured to”
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 5, 9, 10, 12-14, and 19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 5 recites the limitation “the corresponding weighted pruning loss" in line 7.  There is insufficient antecedent basis for this limitation in the claim. 
Claim 9 recites the limitation “the weight of each of the hidden layers" in line 8.  There is insufficient antecedent basis for this limitation in the claim. For analysis purposes, “the weight of each of the hidden layers” is being interpreted as any weight associated with a hidden layer. Claim 10 is also rejected because it depends from claim 9.
Claim 10 recites the limitation “the corresponding weighted quantization loss" in line 7.  There is insufficient antecedent basis for this limitation in the claim. 
Claim 12 recites the limitation “the corresponding weighted pruning loss" in line 9.  There is insufficient antecedent basis for this limitation in the claim. Claims 13 and 14 are also rejected because they depends from claim 12.
Claim 19 recites the limitation “the set hidden layer" in lines 6-7.  There is insufficient antecedent basis for this limitation in the claim. For analysis purposes, “the set hidden layer” is being interpreted as “each hidden layer”.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.



Claims 1-20 are rejected under 35 U.S.C. 101. because the claims are directed to an abstract idea; and because the claims as a whole, considering all claim elements both individually and in combination, do not amount to significantly more than the abstract idea, see Alice Corporation Pty. Ltd. v. CLS Bank International, et al, 573 U.S. (2014). In determining whether the claims are subject matter eligible, the Examiner applies the 2019 USPTO Patent Eligibility Guidelines. (2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50, Jan. 7, 2019.)
Step 1: Is the claim to a process, machine, manufacture, or composition of matter? Yes—claim 1 recites a method which is a process. Claims 20 and 18 recite a product and system, respectively.
Step 2A, prong one: Does claim 1 recite an abstract idea, law of nature or natural phenomenon? Yes—the limitations of “determining a pruning number for each of a plurality of channels included in a hidden layer of the … model”, determining a pruning loss of the hidden layer based on the determined pruning numbers of the hidden layer”, “determining a compression parameter of the hidden layer based on the pruning loss of hidden layer of the … model”, “compressing the … model based on the determined compression parameter of the hidden layer, wherein the compression parameter is related to a pruning of the … model”   as drafted, are mathematical steps of computing a pruning number, pruning loss, and compression parameter for the channels of a hidden layer of a model and computing a compressed model based on those computed parameters. In addition and alternatively, “determining a pruning number for each of a plurality of channels included in a hidden layer of the … model” is also a mental step (observation, evaluation) for determining a pruning number for a layer in a model. These limitations, therefore fall within the mathematical concepts and mental processes group.
Step 2A, prong two: Does the claim recite additional elements that integrate the judicial exception into a practical application? No—the judicial exception is not integrated into a practical application. Although the claim recites that the recited functionality includes “electronic device”, the computer/device is recited at a high-level of generality such that it amounts to no more than a mere instructions to apply the exception using a generic computer component. Further, the element of “machine learning model” is recited at a high level of generality that merely generally links the judicial exception to a particular, respective, technological environment and does not impose a meaningful limitation on the judicial exception.  
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? No—the only limitation on the performance of the described method is that it must be computer implemented with other limitations reciting “machine learning model”. These elements are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity (generic computer system, processing resources, links the judicial exception to a particular, respective, technological environment).  The claim thus recites computing components only at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using generic computer components; mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. 
Taken alone, their additional elements do not amount to significantly more than the above- identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation.
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claims 18 and 20, which recite additional generic computing components that include a processor (claim 18) and memory with instructions (claims 18 and 20) that are recited at a high level of generality that does not impose a meaningful limitation on the judicial exception. These elements are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity (generic computer system, processing resources, links the judicial exception to a particular, respective, technological environment).  The claims thus recite computing components only at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using generic computer components; mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.
Taken alone, their additional elements do not amount to significantly more than the above- identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation.
As to dependent claims 2-16 and 19 which depend from claims 1 and 18 respectively, additional limitations are recited that fall under Step2A prong 1 as mathematical steps: 
Claim 2: … “the determining the compression parameter of the hidden layer in the … model, comprises: determining a relationship between the pruning number of respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers; determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels and the corresponding pruning loss of each of the hidden layers”, wherein the … model comprises a plurality of hidden layers, including the hidden layer, the compression parameter comprises a pruning number of each of the hidden layers” (mathematical steps, and additional details therein, for determining relationship between parameters given parameters descriptive of the model)
Claims 3: … “wherein, the determining the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers, comprises: determining a relationship between the pruning number of the respective channels of a current hidden layer and the corresponding pruning loss of the current hidden layer, based on training data of at least one next hidden layer next to the current hidden layer, wherein the training data comprises a relationship between an output channel of the at least one next hidden layer and each input channel of the at least one next hidden layer determining a plurality of array position data from the input feature map based on the one or more positional characteristics” (mathematical steps of determining relationships between various parameters using (training) data)
Claims 4: “wherein, the determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers, comprises: determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers, and a respective weight of each of the hidden layers”   (mathematical steps of determining relationships between various parameters including a weight)
Claims 5: “wherein, the determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning losses of each of the hidden layers, and the respective weight of each of the hidden layers, comprises: determining a relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding weighted pruning loss of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers, and the weight of each of the hidden layers; and determining the pruning number of each hidden layer, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding weighted pruning loss of each of the hidden layers” (mathematical steps of determining relationships between various parameters including a weight)
Claims 6: “wherein, the determining the relationship between the pruning number of the respective channels of each hidden layer and the corresponding pruning loss of each of the hidden layers, comprises: calculating, using an incremental manner or a decreasing manner, the pruning loss of each current candidate channel to be pruned respectively, wherein the incremental manner includes any current candidate channel to be pruned comprising a pruning number that includes the pruning numbers for each of the pruned channels determined by a previous channel pruning number and at least one unpruned 95channel, and the decreasing manner includes any current candidate channel to be pruned comprising a pruning number that corresponds to remaining pruned channels after removing at least one channel from pruned channels determined by the previous channel pruning number; and determining a current candidate channel to be pruned with a minimum pruning loss as the pruned channel corresponding to the current channel determined by the previous channel pruning number, to obtain the relationship between the current channel pruning number and the corresponding pruning loss of each of the hidden layers”   (mathematical steps of iteratively determining relationships between various parameters)
Claims 7: “wherein the compression parameter comprises a quantization rate, and the determining the compression parameter of the hidden layer in the … model, comprises: determining a relationship between respective candidate quantization rates of each of the hidden layers in the … model and corresponding quantization loss of each of the hidden layers; and determining a quantization rate of each of the hidden layers based on a relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers”  (mathematical steps of determining relationships between various parameters)
Claims 8: “wherein the determining the relationship between respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, comprises:  96determining the relationship between respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, based on training data of a current hidden layer, wherein, the training data of the current hidden layer comprises a relationship between an output channel of the current hidden layer and each input channel of the current hidden layer” (mathematical steps of determining relationships between various parameters using (training) data)
Claims 9: “wherein the determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layer and the corresponding quantization loss of each of the hidden layers, comprises: determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, and the weight of each of the hidden layers.” (mathematical steps of determining relationships between various parameters)
Claims 10 : “wherein the determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, and the weight of each of the hidden layers, comprises: determining a relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding weighted quantization loss, based on the relationship between the respective candidate quantization rates of 97each of the hidden layers and the corresponding quantization loss, and the weight of each of the hidden layers; and determining the quantization rate of each of the hidden layers, based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding weighted quantization loss of each of the hidden layers.” (mathematical steps of determining relationships between various parameters)
Claims 11: “wherein based on a current hidden layer corresponding one-to-one to a next hidden layer, the weight of the current hidden layer is the same as the weight of the next hidden layer; based on the current hidden layer corresponding to at least two next hidden layers by a multi-out structure, the weight of the current hidden layer is the sum of the weights of the at least two next hidden layers; and based on at least two current hidden layers corresponding to one next hidden layer by a multi-in structure, the weight of each current hidden layer is the weight of the next hidden layer allocated according to a channel proportion of each current hidden layer” (more details of the mathematical steps for determining relationships between various parameters using weights for the hidden layers)
Claims 12 : “wherein the determining the compression parameter of the hidden layer in the … model, comprises: determining the compression parameter of the hidden layer, based on a loss relationship and an overall compression target parameter of the … model, 98wherein, based on the compression parameter being a pruning number, the loss relationship comprises a relationship between the pruning number of the respective channels of each hidden layer in the hidden layer in the … model and the corresponding weighted pruning loss of each of a plurality of hidden layers, based on the compression parameter being a quantization rate, the loss relationship comprises a relationship between respective candidate quantization rates of each of the hidden layers in the … model and a corresponding weighted quantization loss of each of the hidden layers, and based on the compression parameter comprising a pruning number and a quantization rate, then the loss relationship comprises the relationship between the pruning number of respective channels of each of the hidden layers in the … model and the corresponding weighted pruning loss, and the relationship between respective candidate quantization rates of each of the hidden layers in the … model and the corresponding weighted quantization loss of each of the hidden layers” (mathematical steps of determining relationships between various parameters)
Claims 13 : “wherein the overall compression target parameter comprises at least one of an overall compression rate of the … model or an overall loss of the … model. ” (more details of the mathematical steps of determining relationships between various parameters)
Claims 14: “wherein the determining the compression parameter of the hidden layer, based on the loss relationship and the overall 99compression target parameter of the … model, comprises any one of the following: calculating a compression parameter of each of the hidden layers that minimizes an overall loss of the … model based on the loss relationship and the overall compression rate of the … model; and calculating a compression parameter of each of the hidden layers that maximizes the overall compression rate of the … model based on the loss relationship and the overall loss of the … model” (mathematical steps of determining relationships between various parameters)
Claims 15: “and fine-tuning the compressed model based on the selected … model to obtain an optimized model” (mathematical steps of mathematically tuning parameters of a model using optimization criteria)
Claims 16: “wherein the fine-tuning the compressed model based on the selected … model to obtain the optimized model, comprises: based on determining that the fine-tuned model does not satisfy a preset condition, repeatedly performing the operation of determining the compression parameter of the hidden layer in the … model to be optimized, the operation of compressing the … model to be optimized based on the 100compression parameter of the hidden layer, and the operation of fine-tuning the compressed model based on the learning model, until the fine-tuned model satisfies the preset condition, to obtain the optimized model.” (mathematical steps of iteratively determining model parameters until the satisfaction of a mathematical criterion)
Claim 19: “wherein, the compression parameter comprises a pruning number of each hidden layer, and the … is further configured to, … determine a relationship between the pruning number of respective channels of each hidden layer and corresponding pruning loss of each hidden layer in the set hidden layer; determine the pruning number of each hidden layer, based on the relationship between the pruning number of respective channels and corresponding pruning loss of each hidden layer.” (mathematical steps, and details therein, of determining parameters for model components)
In addition, it is noted that claims 15 and 17 recite additional limitations that fall under Step2A prong 1 as mental steps in the mental processes group:
Claims 15: “selecting at least one of the following … models as the … model: a … model before compression; a … model obtained after compressing at least one hidden layer; a … model after historical fine-tuning” (observation or judgement for performing a selection)  
Claims 17: splitting channels of at least one hidden layer of the optimized model into at least two groups of sub-channels respectively, and determining a network parameter of each group of the at least two groups of sub-channels, wherein each group of the at least two groups of sub-channels comprises corresponding input channels and output channels after grouping and adding a combination layer to obtain a group compressed model, wherein the input of the combination layer is connected to the output channel of each group of the at least two sub-channels.” (mental steps of configuration/re-configuration/re-grouping architecture of model including the insertion of a layer which can be done with pen and paper and observation or judgement for determining a parameter)
In addition, claims 2, 7, 12, 13, 14, 16, and 19 recite additional elements to be addressed at Step 2A, Prong 2 and at Step 2B as follows: 
Claim 19 recites the generic computer components  “electronic device”, “processor”,  and executable instructions that are recited at a high level of generality that does not impose a meaningful limitation on the judicial exception. Claims 2, 7, 12, 13, 14, and 16, each recites the element of “machine learning model”, is recited at a high level of generality that merely generally links the judicial exception to a particular, respective, technological environment. These elements are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity (generic computer system, processing resources, links the judicial exception to a particular, respective, technological environment).   
In summary, as shown in the analysis above, claims 1-20 do not provide any additional elements that when considered individually or as an ordered combination, amount to significantly more than the abstract idea identified. Therefore, as a whole claims 1-20 do not recite what have the courts have identified as "significantly more”. In particular, there is no indication that the combination of elements improves the functioning of a computer or improves another technology when claims are considered individually or as an ordered combination.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

 (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-6, 15, 16, and 18-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Zhuang et al. (“Discrimination-aware Channel Pruning for Deep Neural Networks”, https://arxiv.org/pdf/1810.11809.pdf, arXiv:1810.11809v3 [cs.CV] 14 Jan 2019, pp. 1-18), hereinafter referred to as Zhuang.

In regards to claim 1, Zhuang teaches a method for compressing a machine learning model by an electronic device, the method comprising: determining a pruning number for each of a plurality of channels included in a hidden layer of the machine learning model; ([Abstract, p. 3, Section 3, Figure 1], Channel pruning is one of the predominant approaches for deep model compression., Given a pre-trained model M, the task of Channel Pruning is to prune those redundant channels in W to save the model size and accelerate the inference speed in Eq. (1). In order to choose channels, we introduce a variant of `2,0-norm ||W||2,0 = Pc k=1 Ω(Pn j=1 ||Wj,k,:,: ||F ), where Ω(a) = 1 if a <not equal to> 0 and Ω(a) = 0 if a = 0, and || · ||F represents the Frobenius norm., wherein a method for compressing a deep machine learning model computes a set of pruning numbers Ω(a), each of which is associated with a respective channel at a given (lth) hidden layer of the deep model (CNN as shown in Figure 1) but wherein this loss corresponds to the model parameters W (which are related to the pruning numbers as shown in equation 2).) determining a pruning loss of the hidden layer based on the determined pruning numbers of the hidden layer; ([p. 4, Section 3.1, Equation 2, Figure 1, equation 6], Given a pre-trained model M, existing methods [14, 28] conduct channel pruning by minimizing the reconstruction error of feature maps between the pre-trained model M and the pruned one.  Formally, the reconstruction error can be measured by the mean squared error (MSE) between feature maps of the baseline network and the pruned one as follows: <equation 3> where Q = N · n · hout · zout and Ob i,j,:,: denotes the feature maps of the baseline network., wherein the model compression method computes a (MSE) loss function corresponding to the differences in the output between the unpruned and pruned models such that the pruned model is based on the values of the pruning numbers Ω(a) (which are related to model parameters as shown in equation 2) and wherein it is noted that equation 6 also depicts a pruning loss for a hidden layer.) determining a compression parameter of the hidden layer based on the pruning loss of hidden layer of the machine learning model; ([p. 3, Section 3, pp. 4-5, Section 3.2, Equation 2, Figure 1], To induce sparsity, we can impose an `2,0-norm constraint on W: ||W||2,0 = Pc k=1 Ω(Pn j=1 ||Wj,k,:,: ||F ) ≤ κl , (2) where κl denotes the desired number of channels at the layer l. Or equivalently, given a predefined pruning rate η ∈ (0, 1) [1, 27], it follows that κl = dηce., Last, the optimization problem for discrimination-aware channel pruning can be formulated as minW L(W), s.t. ||W||2,0 ≤ κl , (7) where κl < c is the number channels to be selected. In our method, the sparsity of W can be either determined by a pre-defined pruning rate (See Section 3) or automatically adjusted by the stopping conditions in Section 3.5., wherein the model compression method applies various constraint conditions (e.g., equations 2, 3, 5, 6, and 7) to determine a resultant compression for each layer (a compression parameter for a layer) in the form of the number of channels that are retained (e.g., summation  Ω(a) which will be less than kappa for a given layer such that this can be in the form of a pruning amount/rate) such that this compression depends on the pruning loss since the number of pruned or selected channels is optimized with respect that loss.) and compressing the machine learning model based on the determined compression parameter of the hidden layer, wherein the compression parameter is related to a pruning of the machine learning model.  ([p. 6, Section 3.5, p. 9, Section 6, Algorithm 1, Algorithm 2, equations 2, 3, 5, 6, and 7,  Table 1], Given a predefined parameter κl in problem (7), Algorithm 2 will be stopped if ||W||2,0>κl . However, in practice, the parameter κl is hard to be determined. Since L is convex, L(Wt ) will monotonically decrease with iteration index t in Algorithm 2. We can therefore adopt the following stopping condition: |L(Wt−1 ) − L(Wt )|/L(W0 ) ≤ epsilon, (10) where  is a tolerance value. If the above condition is achieved, the algorithm is stopped, and the number of selected channels will be automatically determined, i.e., ||Wt ||2,0. An empirical study over the tolerance value epsilon is put in Section 5.3., In particular for MobileNet v2, DCP improves it by reducing 30% of channels on CIFAR-10., wherein the deep model compression method iteratively (Algorithms 1, 2) compresses CNN by pruning iteratively the least significant channels in each layer based on the pruning number for each respective channel in a given hidden layer as well as on the constrained number of channels per layer (i.e., a number less than kappa_layer – a compression parameter) as well as the overall pruning loss/rate (also a compression parameter) from the application of algorithms 1 and 2 and equations 2, 3, 5, 6, and 7  in a convex constraint optimization process (with resultant overall compression shown, for example in Table 1).)

In regards to claim 2, the rejection of claim 1 is incorporated and Zhuang further teaches wherein the machine learning model comprises a plurality of hidden layers, including the hidden layer, the compression parameter comprises a pruning number of each of the hidden layers, and the determining the compression parameter of the hidden layer in the machine learning model, comprises: determining a relationship between the pruning number of respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers; and  93determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels and the corresponding pruning loss of each of the hidden layers.   ([p. 3, Section 3, p. 4, Section 3.1, Algorithm 1, Algorithm 2, equations 2, 3, 5, 6, and 7], Given a pre-trained model M, the task of Channel Pruning is to prune those redundant channels in W to save the model size and accelerate the inference speed in Eq. (1). In order to choose channels, we introduce a variant of `2,0-norm ||W||2,0 = Pc k=1 Ω(Pn j=1 ||Wj,k,:,: ||F ), where Ω(a) = 1 if a <not equal> 0 and Ω(a) = 0 if a = 0, and || · ||F represents the Frobenius norm. To induce sparsity, we can impose an `2,0-norm constraint on W:  <equation 2>, In this paper, we insert P discrimination-aware losses {Lp S } P p=1 evenly into the network, as shown in Figure 1. Let {L1, ..., LP , LP +1} be the layers at which we put the losses, with LP +1 = L being the final layer. For the p-th loss L p S , we consider doing channel pruning for layers l ∈ {Lp−1 + 1, ..., Lp}, where Lp−1 = 0 if p = 1., wherein the deep model to be compressed includes a plurality of hidden layers (see Figure 1 for example) such that, for a given hidden layer (lth or pth), the number of pruned channels for each layer (a compression parameter) is computed as a summation over the pruning number (Ω(a)) associated with each channel within that given hidden layer, the determination of which depends on the satisfaction of constraints including the pruning loss and wherein this compression parameter depends both the pruning number for the channels not only for the particular hidden layer but also on the others in this optimization scheme which optimizes the overall pruning/compression performance across all layers in the CNN with the number of channels pruned (or retained) at each hidden layer contingent on the channel pruning-number based pruning loss (i.e., on the presence or absence of the pruning of any given channel in the loss function).  

In regards to claim 3, the rejection of claim 2 is incorporated and Zhuang further teaches wherein, the determining the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers, comprises: determining a relationship between the pruning number of the respective channels of a current hidden layer and the corresponding pruning loss of the current hidden layer, based on training data of at least one next hidden layer next to the current hidden layer, wherein the training data comprises a relationship between an output channel of the at least one next hidden layer and each input channel of the at least one next hidden layer. ([p. 3, Section 3, p. 5, Section 3.3, Algorithm 1, Algorithm 2, equations 2, 3, 5, 6, and 7], Let {xi , yi} N i=1 be the training samples, where N indicates the number of samples. Given an L-layer CNN model M, let W ∈ R n×c×hf ×zf be the model parameters w.r.t. the l-th convolutional layer (or block), as shown in Figure 1. Here, hf and zf denote the height and width of filters, respectively; c and n denote the number of input and output channels, respectively. For convenience, hereafter we omit the layer index l. Let X ∈ R N×c×hin×zin and O ∈ R N×n×hout×zout be the input feature maps and the involved output feature maps, respectively… Moreover, let Xi,k,:,: be the feature map of the k-th channel for the i-th sample. Wj,k,:,: denotes the parameters w.r.t. the k-th input channel and j-th output channel. The output feature map of the j-th channel for the i-th sample, denoted by Oi,j,:,: , is computed by <equation 1> where ∗ denotes the convolutional operation., Algorithm 1 is called discrimination aware in the sense that an additional loss and the final loss are considered to fine-tune the model….At each stage of Algorithm 1, for example, in the p-th stage, we first construct the additional loss L p S and put them at layer Lp (See Figure 1). After that, we learn the model parameters θ w.r.t. L p S and fine-tune the model M at the same time with both the additional loss L p S and the final loss Lf . In the fine-tuning, all the parameters in M will be updated.4 Here, with the fine-tuning, the parameters regarding the additional loss can be well learned. Besides, fine-tuning is essential to compensate the accuracy loss from the previous pruning to suppress the accumulative error. After fine-tuning with L p S and Lf , the discriminative power of layers l ∈ {Lp−1 + 1, ..., Lp} can be significantly improved. Then, we can perform channel selection for the layers in {Lp−1 + 1, ..., Lp}., wherein the model compression method computes the pruning number (Ω(a)) associated with each channel in a hidden layer and the pruning loss of that layer using training data in the form of the output feature maps (i.e., the pruning loss is based on the MSE of the output feature maps) such that each (channel) output feature map for a given hidden layer is computed according to the (channel) input feature map into that layer and the (pruning-sensitive) model parameters for that layer and such that both the (channel) input feature map into a current layer is the (channel) output from a previous/next layer (see Figure 1) which, therefore, in turn is based on the (channel) input feature map and (channel) output associated with the predecessor/next hidden layer (and accordingly similarly on all hidden layers in the backward next direction) but wherein that relationship is also based on the input/output loss function associated with subsequent (forward next) layers according to the propagation of the final loss back to the current hidden layer (as shown also in Figure 1, even if used in the fine tuning process).)  

In regards to claim 4, the rejection of claim 2 is incorporated and Zhuang further teaches wherein, the determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers, comprises: determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers, and a respective weight of each of the hidden layers.  ([p. 4, Section 3.1, p. 4, Section 3.2, Algorithm 1, Algorithm 2, equations 2, 3, 5, 6, and 7], In this paper, we seek to do channel pruning by keeping those channels that really contribute to the discriminative power of the network…. However, it is not practical when the network is very deep. In fact, for deep models, its shallow layers often have little discriminative power due to the long path of propagation. …To increase the discriminative power of intermediate layers, one can introduce additional losses to the intermediate layers of the deep networks [8, 22, 43]. In this paper, we insert P discrimination-aware losses {Lp S } P p=1 evenly into the network, as shown in Figure 1., As shown in Figure 1, each loss uses the output of layer Lp as the input feature maps…. The discrimination-aware loss w.r.t. the p-th loss is formulated as <equation 5> where I{·} is the indicator function, θ ∈ R np×m denotes the classifier weights of the fully connected layer, np denotes the number of input channels of the fully connected layer and m is the number of classes. Note that we can use other losses such as angular softmax loss [26] as the additional loss., wherein the model compression method determines the number of channels pruned in each hidden layer (a pruning number for a hidden layer in the form of the summation of Ω(a) as previously pointed out) through a constrained optimization process which includes the summation of the pruning loss and a discrimination aware loss (L_S^P – equation 5), the latter of  which weights the significance of different channels in a given hidden layer in contributing to a final loss (i.e., equation 5 includes a soft max channel-based weighting that varies according to the strength of the weight θ associated with a particular channel) such that this forms a different weighting across different hidden layers.) 

In regards to claim 5, the rejection of claim 4 is incorporated and Zhuang further teaches wherein, the determining the pruning number of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning losses of each of the hidden layers, and the respective weight of each of the hidden layers, comprises: determining a relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding weighted pruning loss of each of the hidden layers, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding pruning loss of each of the hidden layers, and the weight of each of the hidden layers; and determining the pruning number of each hidden layer, based on the relationship between the pruning number of the respective channels of each of the hidden layers and the corresponding weighted pruning loss of each of the hidden layers.   ([p. 4, Section 3.1, p. 4, Section 3.2, p. 5, Section 3.3, Algorithm 1, Algorithm 2, equations 2, 3, 5, 6, 7, and 8], In this paper, we seek to do channel pruning by keeping those channels that really contribute to the discriminative power of the network…. However, it is not practical when the network is very deep. In fact, for deep models, its shallow layers often have little discriminative power due to the long path of propagation. …To increase the discriminative power of intermediate layers, one can introduce additional losses to the intermediate layers of the deep networks [8, 22, 43]. In this paper, we insert P discrimination-aware losses {Lp S } P p=1 evenly into the network, as shown in Figure 1., As shown in Figure 1, each loss uses the output of layer Lp as the input feature maps…. The discrimination-aware loss w.r.t. the p-th loss is formulated as <equation 5> where I{·} is the indicator function, θ ∈ R np×m denotes the classifier weights of the fully connected layer, np denotes the number of input channels of the fully connected layer and m is the number of classes. Note that we can use other losses such as angular softmax loss [26] as the additional loss., At each stage of Algorithm 1, for example, in the p-th stage, we first construct the additional loss L p S and put them at layer Lp (See Figure 1). After that, we learn the model parameters θ w.r.t. L p S and fine-tune the model M at the same time with both the additional loss L p S and the final loss Lf ., wherein the model compression method determines the pruning number (Ω(a)) associated with respective channels in each hidden layer based on/according to the pruning loss for a given layer and the associated loss-based weighting of importance of that layer (e.g., equations 5 and 6 as previously pointed out) such that this relationship between the channel-specific pruning number (Ω(a)) and the pruning loss of each hidden layer (equation 3, as previously pointed out) as well as the weighted pruning loss term (equation 5, also as previously pointed out) such that the pruning number of each hidden layer (e.g., the summation of  Ω(a) for each hidden layer as well as the specific corresponding contribution to the pruning loss in equation 3) and the layer weighting pruning loss (equation 6) both determine the pruning number (Ω(a)) for each layer since all of the layers are optimized together according to the greedy channel selection process (algorithm 2) (i.e., optimization in any layer is affected by optimization in other layers to minimize the overall/final pruning loss).)

In regards to claim 6, the rejection of claim 2 is incorporated and Zhuang further teaches wherein, the determining the relationship between the pruning number of the respective channels of each hidden layer and the corresponding pruning loss of each of the hidden layers, comprises: calculating, using an incremental manner or a decreasing manner, the pruning loss of each current candidate channel to be pruned respectively, …, and the decreasing manner includes any current candidate channel to be pruned comprising a pruning number that corresponds to remaining pruned channels after removing at least one channel from pruned channels determined by the previous channel pruning number; and determining a current candidate channel to be pruned with a minimum pruning loss as the pruned channel corresponding to the current channel determined by the previous channel pruning number, to obtain the relationship between the current channel pruning number and the corresponding pruning loss of each of the hidden layers.  ([p. 5, Section 3.4, Algorithm 1, Algorithm 2, equations 2, 3, 5, 6, 7, and 8], To be specific, we first remove all the channels and then select those channels that really contribute to the discriminative power of the deep networks. Let A ⊂ {1, . . . , c} be the index set of the selected channels, where A is empty at the beginning. As shown in Algorithm 2, the channel selection method can be implemented in two steps. First, we select the most important channels of input feature maps. At each iteration, we compute the gradients Gj = ∂L/∂Wj , where Wj denotes the parameters for the j-th input channel. We choose the channel k = arg maxj /∈A{||Gj ||F } as an active channel and put k into A. Second, once A is determined, we optimize W w.r.t. the selected channels by minimizing the following problem:, wherein the model compression method determines the pruning number (Ω(a)) associated with respective channels in each hidden layer based on/according to the pruning loss for a given layer (equation 3, as previously pointed out) using a greedy channel selection optimization process (see Algorithm 2) which incrementally augments the number of selected (unpruned) channels (thus iteratively decreasing the number of pruned channels) with the best currently non-selected candidate channel (until a stopping criterion is reached) such that this greedy channel selection process at any given iteration includes the determined pruning number (Ω(a) as being 0 or 1 depending on whether the channel is selected or not for inclusion with that number changing for any given channel initially part of the pruned set but transitioned into the non-pruned set according to the pruning loss, and wherein it is noted that the claims require either the incremental function or the decreasing function, not both.)

In regards to claim 15, the rejection of claim 1 is incorporated and Zhuang further teaches further comprising: selecting at least one of the following machine learning models as the machine learning model: a machine learning model before compression; ([p. 3, Section 3], Given a pre-trained model M, the task of Channel Pruning is to prune those redundant channels in W to save the model size and accelerate the inference speed in Eq. (1)., wherein the model compression framework starts with a pre-trained model (with known accuracy and parameters).) a machine learning model obtained after compressing at least one hidden layer; ([Algorithm 1, Algorithm 2], wherein the model parameters W_A are updated after pruning channels in a hidden layer in a fine-tuning process such that the optimization process.) a machine learning model after historical fine-tuning; and fine-tuning the compressed model based on the selected machine learning model to obtain an optimized model. ([p. 5, Section 3.3, Algorithm 1, Algorithm 2], In the fine-tuning, all the parameters in M will be updated., wherein the model parameters of the model are updated/optimized after pruning channels in a hidden layer using fine-tuning (based on loss functions) such that at any given iteration the model parameters will have included also any previous/historical fine-tuning.)
 
In regards to claim 16, the rejection of claim 15 is incorporated and Zhuang further teaches wherein the fine-tuning the compressed model based on the selected machine learning model to obtain the optimized model, comprises: based on determining that the fine-tuned model does not satisfy a preset condition, repeatedly performing the operation of determining the compression parameter of the hidden layer in the machine learning model to be optimized, the operation of compressing the machine learning model to be optimized based on the 100compression parameter of the hidden layer, and the operation of fine-tuning the compressed model based on the learning model, until the fine-tuned model satisfies the preset condition, to obtain the optimized model.  ([p. 3, Section 3.3, p. 6, Section 3.5, Algorithm 1, Algorithm 2], In fact, at each stage we will consider two losses only, i.e., L p S and the final loss Lf ., Given a predefined parameter κl in problem (7), Algorithm 2 will be stopped if ||W||2,0>κl . However, in practice, the parameter κl is hard to be determined. Since L is convex, L(Wt ) will monotonically decrease with iteration index t in Algorithm 2. We can therefore adopt the following stopping condition: <equation 10> where epsilon is a tolerance value., wherein, as shown in Algorithms 1 and 2, once the fine-tuned model is computed in algorithm 1 at a given iteration, the greedy channel selection algorithm first determines/updates the loss parameter L based on that model and imposes an exit condition based on the satisfaction of stopping conditions (preset conditions), interpreted as corresponding to the application of equation 10 such that if the loss function, computed in response to the determination of the compression parameter/rate for each hidden layer and subsequent fine-tuning, satisfies equation 10, then the exit condition is satisfied resulting in the output of the optimized model.)  

Claim 18 is also rejected because it is just a system implementation of the same subject matter of claim 1 which can be found in Zhuang. It is noted that claim 18 also recites a processor with memory having instructions with in Zhuang (e.g., [p. 6, Section 4.1] We implement the proposed method on PyTorch [32]…. The source code of our method can be found at https://github.com/SCUT-AILab/DCP.).

Claim 19/18 is also rejected because it is just a system implementation of the same subject matter of claim 2/1 which can be found in Zhuang.

Claim 20 is also rejected because it is just a computer program product implementation of the same subject matter of claim 1 which can be found in Zhuang. It is noted that claim 20 also recites a computer readable storage medium with program code which is also found in Zhuang (e.g., [p. 6, Section 4.1] We implement the proposed method on PyTorch [32]…. The source code of our method can be found at https://github.com/SCUT-AILab/DCP.).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 7-10 and 12-14 are rejected under 35 U.S.C. 103 as being unpatentable over Zhuang in view of Ye et al. (“A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM”, https://arxiv.org/pdf/1811.01907.pdf, arXiv:1811.01907v1 [cs.NE] 5 Nov 2018, pp. 1-9), hereinafter referred to as Ye.

In regards to claim 7, the rejection of claim 1 is incorporated and Zhuang does not further teach wherein the compression parameter comprises a quantization rate, and the determining the compression parameter of the hidden layer in the machine learning model, comprises: determining a relationship between respective candidate quantization rates of each of the hidden layers in the machine learning model and corresponding quantization loss of each of the hidden layers; and determining a quantization rate of each of the hidden layers based on a relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers.  Zhuang indicates that he plans to incorporate quantization compression strategies into his deep model compression method as future work (p. 9, Section 6).
However, Ye, in the analogous environment of deep model compression, teaches wherein the compression parameter comprises a quantization rate, and the determining the compression parameter of the hidden layer in the machine learning model, comprises: determining a relationship between respective candidate quantization rates of each of the hidden layers in the machine learning model and corresponding quantization loss of each of the hidden layers; and determining a quantization rate of each of the hidden layers based on a relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers.  ([p. 3, “problem formulation”, p. 4, “Simplification for the Proposed Framework”, p. 5, “parameter initialization”, Algorithms 1-3, equations 6-11, 15],Consider an N-layer DNN, the collections of weights and biases of the i-th layer are respectively denoted by Wi and bi ; The loss function of the N-layer DNN is denoted by f {Wi} N i=1, {bi} N i=1,N} . When we combine DNN weight pruning with clustering or quantization, the overall problem is defined by <equation 3>… When we combine weight pruning with weight quantization, S 0 i = {Wi | the weights in Wi only take values from the set {Q1, Q2, · · · , QMi }}. Here the Q values are quantization levels, and we consider equal-distance quantization (the same distance between quantization levels) to facilitate hardware implementation., After weight pruning, we solve DNN weight clustering or quantization problem. We consider the constraints for weight clustering or quantization on the pruned model (the remaining weights)… . We use the weight pruning ratios αi’s and clustering/quantization levels Mi’s from prior work (Han et al. 2015; Han, Mao, and Dally 2016) as starting points, and further increase the pruning ratios and decrease the number of clustering/quantization level., For finding a value qi , we denote w j i as j-th weight in layer i and f(w j i ) as a quantization function to the closest quantization level. Then the total square error in a single quantization step is given by Summation|  w j i −f(w j i )^2 . In order to minimize the total square error, we use binary search method to determine qi ., wherein a deep model compression framework includes a quantization rate in the form of the number of quantization levels M_i associated with a given layer (algorithm 1) such that the overall compression of the model includes both a pruning rate (alpha_i) and the quantization rate/number of quantization levels such that the number of quantization levels (compression parameter) for a given hidden layer is based on a computation of the quantization loss (e.g., a loss function to be minimized in the context of clustering or quantization including f in equations 3, 5, 15)  such that, for each hidden layer, the optimal quantization rate (number of quantization levels) is iteratively determined (algorithm 1) based on the candidate number of quantization levels and the computed (quantization) loss (i.e., the solution of the minimization problem in equation 15 along with associated sub-problems expressed in equations 6-11) and wherein it is noted that the MSE quantization error (difference between quantized and unquantized weights) is also a quantization loss to be minimized in the framework.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhuang to incorporate the teachings of Ye for the compression parameter to comprise a quantization rate, and the determining the compression parameter of the hidden layer in the machine learning model, to comprise: determining a relationship between respective candidate quantization rates of each of the hidden layers in the machine learning model and corresponding quantization loss of each of the hidden layers; and determining a quantization rate of each of the hidden layers based on a relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved compression rates with nearly no accuracy degradation by implementing a unified framework for pruning and quantizing in which both pruning and quantization are performed in a coupled constrained optimization process to minimize common loss functions (Ye, [Abstract, pp. 8-9, “Conclusion”, Tables 1, 2]).

In regards to claim 8, the rejection of claim 7 is incorporated and Zhuang further teaches wherein the determining the relationship between respective candidate … of each of the hidden layers and the corresponding … loss of each of the hidden layers, comprises:  96determining the relationship between respective candidate … of each of the hidden layers and the corresponding … loss of each of the hidden layers, based on training data of a current hidden layer, wherein, the training data of the current hidden layer comprises a relationship between an output channel of the current hidden layer and each input channel of the current hidden layer.  ([p. 3, Section 3, Algorithm 1, Algorithm 2,], Let {xi , yi} N i=1 be the training samples, where N indicates the number of samples. Given an L-layer CNN model M, let W ∈ R n×c×hf ×zf be the model parameters w.r.t. the l-th convolutional layer (or block), as shown in Figure 1. Here, hf and zf denote the height and width of filters, respectively; c and n denote the number of input and output channels, respectively. For convenience, hereafter we omit the layer index l. Let X ∈ R N×c×hin×zin and O ∈ R N×n×hout×zout be the input feature maps and the involved output feature maps, respectively… Moreover, let Xi,k,:,: be the feature map of the k-th channel for the i-th sample. Wj,k,:,: denotes the parameters w.r.t. the k-th input channel and j-th output channel. The output feature map of the j-th channel for the i-th sample, denoted by Oi,j,:,: , is computed by <equation 1> where ∗ denotes the convolutional operation., wherein the model compression method computes the compression parameters (pruning number Ω(a) for a channel, number of pruned channels per layer) using an optimization (across candidate compression parameters such as specific channels) based on an MSE loss function of that layer using training data in the form of the feature maps (which comprise the training data) such that with the output feature map for a given hidden layer (in the CNN) computed according to the input feature map into that layer.)
However, Zhuang does not explicitly teach wherein the determining the relationship between respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, comprises:  96determining the relationship between respective candidate quantization rates of each of the hidden layers and the corresponding quantization  of each of the hidden layers, based on training data of a current hidden layer, … Although, Zhuang teaches an MSE-based optimization framework, as noted above, Zhuang does not address compression via quantization (even though he acknowledges an intention for incorporating it).
However, Ye, in the analogous environment of deep model compression, teaches wherein the determining the relationship between respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, comprises:  96determining the relationship between respective candidate quantization rates of each of the hidden layers and the corresponding quantization  of each of the hidden layers, based on training data of a current hidden layer….  ([p. 3, “problem formulation”, p. 4, “Simplification for the Proposed Framework”, p. 5, “parameter initialization”, p. 9, “Iterative Weight Quantization and Retraining”, Algorithms 1-3, equations 6-11, 15], For finding a value qi , we denote w j i as j-th weight in layer i and f(w j i ) as a quantization function to the closest quantization level. Then the total square error in a single quantization step is given by Summation|  w j i −f(w j i )^2 . In order to minimize the total square error, we use binary search method to determine qi ., Consider an N-layer DNN, the collections of weights and biases of the i-th layer are respectively denoted by Wi and bi ; The loss function of the N-layer DNN is denoted by f {Wi} N i=1, {bi} N i=1,N} . When we combine DNN weight pruning with clustering or quantization, the overall problem is defined by <equation 3>… When we combine weight pruning with weight quantization, S 0 i = {Wi | the weights in Wi only take values from the set {Q1, Q2, · · · , QMi }}. Here the Q values are quantization levels, and we consider equal-distance quantization (the same distance between quantization levels) to facilitate hardware implementation., After weight pruning, we solve DNN weight clustering or quantization problem. We consider the constraints for weight clustering or quantization on the pruned model (the remaining weights)… . We use the weight pruning ratios αi’s and clustering/quantization levels Mi’s from prior work (Han et al. 2015; Han, Mao, and Dally 2016) as starting points, and further increase the pruning ratios and decrease the number of clustering/quantization level., To address this degradation, we present an iterative weight quantization method. In our method, we iteratively project a portion of weights to the nearby quantization levels, fix these values (i.e., we quantize these weights), and retrain the rest of them. More specifically, we quantize α% of weights closest to every quantization level after the ADMM procedure and then retrain the rest of weights. After we quantize the weights, we observe accuracy degradation of the DNN, while the retraining step can retrieve the accuracy. After the retraining step, we again quantize α% of weights closest to every quantization level and implement another retraining step., wherein a deep model compression framework determines distinct quantization rates (number of quantization levels) for the weights in each layer based on the quantization loss (e.g., a loss function to be minimized in the context of clustering or quantization including f in equations 3, 5, 15) for each layer (i.e., the layer-specific updates are dependent upon weights and biases optimized for that layer based on a loss function that includes the loss for that layer such as seen in equations 7, 9, 10, and 11 based on the loss function f) such that this loss is based on the training data that includes that for a given layer since the weights for each layer are trained and retrained after each ADMM iteration for each hidden layer, and wherein it is noted that the MSE quantization error (difference between quantized and unquantized weights) is also a quantization loss to be minimized on a layer-by-layer sense in the framework based also on the weights being trained/retrained after each ADMM iteration)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhuang to incorporate the teachings of Ye for the determining the relationship between respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, to comprise:  96determining the relationship between respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, based on training data of a current hidden layer, wherein, the training data of the current hidden layer comprises a relationship between an output channel of the current hidden layer and each input channel of the current hidden layer. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved compression rates with nearly no accuracy degradation by implementing a unified framework for pruning and quantizing in which both pruning and quantization are performed in a coupled constrained optimization process to minimize common loss functions (Ye, [Abstract, pp. 8-9, “Conclusion”, Tables 1, 2]).

In regards to claim 9, the rejection of claim 7 is incorporated and Zhuang does not further teach wherein the determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layer and the corresponding quantization loss of each of the hidden layers, comprises: determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, and the weight of each of the hidden layers.  As noted above, Zhuang does not address compression via quantization (even though he acknowledges an intention for incorporating it).
However, Ye, in the analogous environment of deep model compression, teaches wherein the determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layer and the corresponding quantization loss of each of the hidden layers, comprises: determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, and the weight of each of the hidden layers.  ([p. 3, “problem formulation”, p. 4, “Simplification for the Proposed Framework”, p. 5, “parameter initialization”, p. 9, “Iterative Weight Quantization and Retraining”, Algorithms 1-3, equations 6-11, 15], For finding a value qi , we denote w j i as j-th weight in layer i and f(w j i ) as a quantization function to the closest quantization level. Then the total square error in a single quantization step is given by Summation|  w j i −f(w j i )^2 . In order to minimize the total square error, we use binary search method to determine qi ., Consider an N-layer DNN, the collections of weights and biases of the i-th layer are respectively denoted by Wi and bi ; The loss function of the N-layer DNN is denoted by f {Wi} N i=1, {bi} N i=1,N} . When we combine DNN weight pruning with clustering or quantization, the overall problem is defined by <equation 3>… When we combine weight pruning with weight quantization, S 0 i = {Wi | the weights in Wi only take values from the set {Q1, Q2, · · · , QMi }}. Here the Q values are quantization levels, and we consider equal-distance quantization (the same distance between quantization levels) to facilitate hardware implementation., After weight pruning, we solve DNN weight clustering or quantization problem. We consider the constraints for weight clustering or quantization on the pruned model (the remaining weights)… . We use the weight pruning ratios αi’s and clustering/quantization levels Mi’s from prior work (Han et al. 2015; Han, Mao, and Dally 2016) as starting points, and further increase the pruning ratios and decrease the number of clustering/quantization level., To address this degradation, we present an iterative weight quantization method. In our method, we iteratively project a portion of weights to the nearby quantization levels, fix these values (i.e., we quantize these weights), and retrain the rest of them. More specifically, we quantize α% of weights closest to every quantization level after the ADMM procedure and then retrain the rest of weights. After we quantize the weights, we observe accuracy degradation of the DNN, while the retraining step can retrieve the accuracy. After the retraining step, we again quantize α% of weights closest to every quantization level and implement another retraining step., wherein a deep model compression framework determines distinct quantization rates (number of quantization levels) for the weights in a each respective layer based on the quantization loss (e.g., a loss function to be minimized in the context of clustering or quantization including f in equations 3, 5, 15 but also the quantization loss corresponding to the difference between quantized and unquantized weights as previously pointed out) for each layer such that the quantization rate (number of quantization levels) is based on the minimization of the (various) quantization losses for each hidden layer (and across the set of hidden layers) and a weight of each hidden layer in the form of the weight clusters but also in the form of a loss function (equation 15) which includes the weight rho_i for each layer.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhuang to incorporate the teachings of Ye for the determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layer and the corresponding quantization loss of each of the hidden layers, to comprise: determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, and the weight of each of the hidden layers. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved compression rates with nearly no accuracy degradation by implementing a unified framework for pruning and quantizing in which both pruning and quantization are performed in a coupled constrained optimization process to minimize common weighted or weight-based loss functions (Ye, [Abstract, pp. 8-9, “Conclusion”, Equation 15, Tables 1, 2]).

In regards to claim 10, the rejection of claim 9 is incorporated and Zhuang does not further teach wherein the determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, and the weight of each of the hidden layers, comprises: determining a relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding weighted quantization loss, based on the relationship between the respective candidate quantization rates of 97each of the hidden layers and the corresponding quantization loss, and the weight of each of the hidden layers; and determining the quantization rate of each of the hidden layers, based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding weighted quantization loss of each of the hidden layers.  As noted above, Zhuang does not address compression via quantization (even though he acknowledges an intention for incorporating it).
However, Ye, in the analogous environment of deep model compression, teaches wherein the determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, and the weight of each of the hidden layers, comprises: determining a relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding weighted quantization loss, based on the relationship between the respective candidate quantization rates of 97each of the hidden layers and the corresponding quantization loss, and the weight of each of the hidden layers; and determining the quantization rate of each of the hidden layers, based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding weighted quantization loss of each of the hidden layers. ([p. 3, “problem formulation”, p. 4, “Simplification for the Proposed Framework”, p. 5, “parameter initialization”, p. 9, “Iterative Weight Quantization and Retraining”, Algorithms 1-3, equations 6-11, 15], For finding a value qi , we denote w j i as j-th weight in layer i and f(w j i ) as a quantization function to the closest quantization level. Then the total square error in a single quantization step is given by Summation|  w j i −f(w j i )^2 . In order to minimize the total square error, we use binary search method to determine qi ., Consider an N-layer DNN, the collections of weights and biases of the i-th layer are respectively denoted by Wi and bi ; The loss function of the N-layer DNN is denoted by f {Wi} N i=1, {bi} N i=1,N} . When we combine DNN weight pruning with clustering or quantization, the overall problem is defined by <equation 3>… When we combine weight pruning with weight quantization, S 0 i = {Wi | the weights in Wi only take values from the set {Q1, Q2, · · · , QMi }}. Here the Q values are quantization levels, and we consider equal-distance quantization (the same distance between quantization levels) to facilitate hardware implementation., After weight pruning, we solve DNN weight clustering or quantization problem. We consider the constraints for weight clustering or quantization on the pruned model (the remaining weights)… . We use the weight pruning ratios αi’s and clustering/quantization levels Mi’s from prior work (Han et al. 2015; Han, Mao, and Dally 2016) as starting points, and further increase the pruning ratios and decrease the number of clustering/quantization level., To address this degradation, we present an iterative weight quantization method. In our method, we iteratively project a portion of weights to the nearby quantization levels, fix these values (i.e., we quantize these weights), and retrain the rest of them. More specifically, we quantize α% of weights closest to every quantization level after the ADMM procedure and then retrain the rest of weights. After we quantize the weights, we observe accuracy degradation of the DNN, while the retraining step can retrieve the accuracy. After the retraining step, we again quantize α% of weights closest to every quantization level and implement another retraining step., wherein a deep model compression framework determines distinct quantization rates (number of quantization levels according to discrete candidate numbers of quantization levels evaluated over each ADMM iteration) for each respective layer based on a weighted quantization loss (e.g., see equation 15 which is being interpreted as a loss function with the second term specifically including a layer-specific weight rho_i such that the minimization is used to determine the quantized weights for w_i for any given layer i) such that this constrained optimization framework determines a relationship between the number of quantization levels and the weighted (quantization) loss based on distinct quantization levels for each hidden layer and their respective (quantization) loss as well as (alternatively) the weight associated with each hidden layer corresponding to the weight clusters for each hidden layer; wherein this iterative optimization scheme thereby determines the optimal number of quantization levels (quantization rate – including the alpha’s) for each hidden layer based on the (weighted) loss function response for the candidate number of quantization levels for each layer.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhuang to incorporate the teachings of Ye for the determining the quantization rate of each of the hidden layers based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding quantization loss of each of the hidden layers, and the weight of each of the hidden layers, to comprise: determining a relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding weighted quantization loss, based on the relationship between the respective candidate quantization rates of 97each of the hidden layers and the corresponding quantization loss, and the weight of each of the hidden layers; and determining the quantization rate of each of the hidden layers, based on the relationship between the respective candidate quantization rates of each of the hidden layers and the corresponding weighted quantization loss of each of the hidden layers. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved compression rates with nearly no accuracy degradation by implementing a unified framework for pruning and quantizing in which both pruning and quantization are performed in a coupled constrained optimization process to minimize common weighted or weight-based loss functions (Ye, [Abstract, pp. 8-9, “Conclusion”, Equation 15, Tables 1, 2]).

In regards to claim 12, the rejection of claim 1 is incorporated and Zhuang further teaches wherein the determining the compression parameter of the hidden layer in the machine learning model, comprises: determining the compression parameter of the hidden layer, based on a loss relationship and an overall compression target parameter of the machine learning model, 98wherein, based on the compression parameter being a pruning number, the loss relationship comprises a relationship between the pruning number of the respective channels of each hidden layer in the hidden layer in the machine learning model and the corresponding weighted pruning loss of each of a plurality of hidden layers, … based on the compression parameter comprising a pruning number …, then the loss relationship comprises the relationship between the pruning number of respective channels of each of the hidden layers in the machine learning model and the corresponding weighted pruning loss, ….  ([p. 3, Section 3, p. 3, Section 3.3, pp. 4-5, Section 3.2, p. 6, Section 3.5, Algorithm 1, Algorithm 3, Equation 2, Figure 1],To induce sparsity, we can impose an `2,0-norm constraint on W: ||W||2,0 = Pc k=1 Ω(Pn j=1 ||Wj,k,:,: ||F ) ≤ κl , (2) where κl denotes the desired number of channels at the layer l. Or equivalently, given a predefined pruning rate η ∈ (0, 1) [1, 27], it follows that κl = dηce., In fact, at each stage we will consider two losses only, i.e., L p S and the final loss Lf ., Last, the optimization problem for discrimination-aware channel pruning can be formulated as minW L(W), s.t. ||W||2,0 ≤ κl , (7) where κl < c is the number channels to be selected. In our method, the sparsity of W can be either determined by a pre-defined pruning rate (See Section 3) or automatically adjusted by the stopping conditions in Section 3.5., Given a predefined parameter κl in problem (7), Algorithm 2 will be stopped if ||W||2,0>κl . However, in practice, the parameter κl is hard to be determined. Since L is convex, L(Wt ) will monotonically decrease with iteration index t in Algorithm 2. We can therefore adopt the following stopping condition: <equation 10> where epsilon is a tolerance value., wherein the model compression method applies various constraint conditions (e.g., equations 2, 3, 5, 6, and 7) to determine a resultant compression for each layer (a compression parameter for a layer that corresponds to a pruning number) in the form of the number of channels that are retained (e.g., summation  Ω(a) which will be less than kappa but also this can be in the form of a pruning rate) such that this compression depends on the weighted pruning loss since the number of pruned or selected channels is optimized with respect that loss with equations 5 and 6 expressing the hidden layer-specific weighted pruning loss (as previously pointed out), wherein this (pruning) compression parameter/number/rate for each channel is based on an overall compression parameter k_l but also on a tolerance based on the pruning loss but also on the optimization of the overall final loss L_f as shown in algorithms 1and 3 (with respect to which gradient based optimization is performed), and wherein it is noted that the claims as presented only require one of the “based on” conditions to be satisfied.)
However, Zhuang does not explicitly teach … based on the compression parameter being a quantization rate, the loss relationship comprises a relationship between respective candidate quantization rates of each of the hidden layers in the machine learning model and a corresponding weighted quantization loss of each of the hidden layers, and …and a quantization rate, … and the relationship between respective candidate quantization rates of each of the hidden layers in the machine learning model and the corresponding weighted quantization loss of each of the hidden layers. As noted above, Zhuang does not address compression via quantization (even though he acknowledges an intention for incorporating it).
However, Ye, in the analogous environment of deep model compression, teaches based on the compression parameter being a quantization rate, the loss relationship comprises a relationship between respective candidate quantization rates of each of the hidden layers in the machine learning model and a corresponding weighted quantization loss of each of the hidden layers, ([p. 4, “Simplification for the Proposed Framework”, Algorithms 1-3, equations 6-11, 15], After weight pruning, we solve DNN weight clustering or quantization problem. We consider the constraints for weight clustering or quantization on the pruned model (the remaining weights). To solve this problem, we update Wi and bi according to <equation 15>, wherein the compression parameter (i.e., an objective dimension of compression) includes a compression based on quantization (quantization rate) such that the optimization of the constrained loss function (equation 15) associated with the quantization determines a relationship between layer-specific candidate numbers of quantization levels (quantization rate, algorithm 2) and the corresponding values of the loss function (equation 15), which as previously noted has layer-specific weights in the form of the parameter rho_i but also in the form of layer-specific iteration dependent weight clustering values.) and based on the compression parameter comprising a pruning number and a quantization rate, then the loss relationship comprises the relationship between the pruning number of respective … of each of the hidden layers in the machine learning model and the corresponding weighted pruning loss, and the relationship between respective candidate quantization rates of each of the hidden layers in the machine learning model and the corresponding weighted quantization loss of each of the hidden layers.  ([p. 3, “Background of ADMM”, p. 4, “Simplification for the Proposed Framework”, Algorithms 1-3, equations 6-11, 14, 15], When we combine DNN weight pruning with clustering or quantization, the overall problem is defined by minimize {Wi},{bi} f {Wi} N i=1, {bi} N i=1 , subject to Wi ∈ Si , Wi ∈ S 0 i , i = 1, . . . , N. (3) The set Si reflects the constraint for the weight pruning problem, i.e., Si = {the number of nonzero elements is less than or equal to αi}, where αi is the desired number of weights after pruning in the i-th layer….Both constraints Si and S 0 i need to be satisfied simultaneously in the joint problem of DNN weight pruning and weight clustering/quantization. In this way we can make sure that most of the DNN weights are pruned (set to zero), while the remaining weights are clustered/quantize., In the first step, we only account for the constraints for DNN weight pruning. We update Wi and bi according to  <equation 14>… After weight pruning, we solve DNN weight clustering or quantization problem. We consider the constraints for weight clustering or quantization on the pruned model (the remaining weights). To solve this problem, we update Wi and bi according to <equation 15>, wherein the compression parameter (i.e., an objective dimension of compression) includes both a quantization amount (e.g., alpha percent) and a pruning number (e.g., alpha_i number for the ith layer) such that the optimization of the joint constrained loss functions for the respective pruning loss problem (equation 14) and the quantization loss problem (equation 15) determines respective relationships between layer-specific pruning amounts (pruning number – number of weights regularized to 0) and the corresponding values of the pruning loss function (equation 14) and between the candidate numbers of quantization levels (quantization rate, algorithm 2) and the corresponding values of the quantization loss function (equation 15), each of which, as previously noted has layer-specific weights in the form of the parameter rho_i but also in the form of layer-specific iteration dependent weight clustering values.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhuang to incorporate the teachings of Ye for the determining the compression parameter of the hidden layer in the machine learning model, to comprise: determining the compression parameter of the hidden layer, based on a loss relationship and an overall compression target parameter of the machine learning model, 98wherein, based on the compression parameter being a quantization rate, the loss relationship comprises a relationship between respective candidate quantization rates of each of the hidden layers in the machine learning model and a corresponding weighted quantization loss of each of the hidden layers, and wherein, based on the compression parameter comprising a pruning number and a quantization rate, then the loss relationship comprises the relationship between the pruning number of respective channels of each of the hidden layers in the machine learning model and the corresponding weighted pruning loss, and the relationship between respective candidate quantization rates of each of the hidden layers in the machine learning model and the corresponding weighted quantization loss of each of the hidden layers.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved compression rates with nearly no accuracy degradation by implementing a unified framework for both pruning and quantizing in which both pruning and quantization are performed in a coupled constrained optimization process to minimize common weighted or weight-based loss functions (Ye, [Abstract, pp. 8-9, “Conclusion”, Equation 15, Tables 1, 2]).

In regards to claim 13, the rejection of claim 12 is incorporated and Zhuang further teaches wherein the overall compression target parameter comprises at least one of an overall compression rate of the machine learning model or an overall loss of the machine learning model.  ([p. 3, Section 3.3, p. 6, Section 3.5, Algorithm 1, Algorithm 3], In fact, at each stage we will consider two losses only, i.e., L p S and the final loss Lf ., Given a predefined parameter κl in problem (7), Algorithm 2 will be stopped if ||W||2,0>κl . However, in practice, the parameter κl is hard to be determined. Since L is convex, L(Wt ) will monotonically decrease with iteration index t in Algorithm 2. We can therefore adopt the following stopping condition: <equation 10> where epsilon is a tolerance value., wherein the (pruning) compression parameter/number/rate for each channel is based on an overall compression parameter κl (overall compression rate) but also based on an overall pruning loss stopping criterion as shown in equation 10 as well as the overall loss function L_f which is optimized by gradient descent (Algorithm 3).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhuang to incorporate the teachings of Ye for the same reasons as pointed out for claim 12.

In regards to claim 14, the rejection of claim 13 is incorporated and Zhuang further teaches wherein the determining the compression parameter of the hidden layer, based on the loss relationship and the overall 99compression target parameter of the machine learning model, comprises any one of the following: calculating a compression parameter of each of the hidden layers that minimizes an overall loss of the machine learning model based on the loss relationship and the overall compression rate of the machine learning model;  and calculating a compression parameter of each of the hidden layers that maximizes the overall compression rate of the machine learning model based on the loss relationship and the overall loss of the machine learning model.  ([p. 3, Section 3.3, p. 6, Section 3.5, Algorithm 1, Algorithm 3, Table 1, Table 2], In fact, at each stage we will consider two losses only, i.e., L p S and the final loss Lf ., Given a predefined parameter κl in problem (7), Algorithm 2 will be stopped if ||W||2,0>κl . However, in practice, the parameter κl is hard to be determined. Since L is convex, L(Wt ) will monotonically decrease with iteration index t in Algorithm 2. We can therefore adopt the following stopping condition: <equation 10> where epsilon is a tolerance value., wherein the (pruning) compression parameter/number/rate for each channel is based on an overall (target) compression parameter κl (overall compression rate) as well as the optimization of the (overall over successive iterations) pruning loss for each layer according to equation 10 but also based on the overall loss function L_f which is optimized by gradient descent (Algorithm 3) as well as on the relationship between the weighted pruning loss (equations 5, 6) and the number of pruned channels (candidate or otherwise in the iterative greedy channel selection process) and wherein this optimization framework also maximizes the overall compression rate (i.e., the number of pruned channels is optimized while preserving accuracy) based on both the loss relationship and the overall losses (L_f, L).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhuang to incorporate the teachings of Ye for the same reasons as pointed out for claim 12.

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Zhuang in view of Lu et al. (US2019/0251441, filed 13 February 2018), hereinafter referred to as Lu, in view of Lu et al.  (“Rethinking the Smaller-Norm-Less Informative Assumption in Channel Pruning of Convolution Layers”, https://arxiv.org/pdf/1802.00124.pdf, arXiv:1802.00124v2 [cs.LG] 2 Feb 2018, pp. 1-11), hereinafter referred to as Lu2, and in further view of  Wen et al. (“Learning Structured Sparsity in Deep Neural Networks”, https://arxiv.org/pdf/1608.03665.pdf, arXiv:1608.03665v4 [cs.NE] 18 Oct 2016, pp. 1-10), hereinafter referred to as Wen.

In regards to claim 11, the rejection of claim 4 is incorporated and Zhuang does not further teach wherein based on a current hidden layer corresponding one-to-one to a next hidden layer, the weight of the current hidden layer is the same as the weight of the next hidden layer; based on the current hidden layer corresponding to at least two next hidden layers by a multi-out structure, the weight of the current hidden layer is the sum of the weights of the at least two next hidden layers; and based on at least two current hidden layers corresponding to one next hidden layer by a multi-in structure, the weight of each current hidden layer is the weight of the next hidden layer allocated according to a channel proportion of each current hidden layer.  Zhuang discloses a hidden layer weighting that varies from hidden layer-to-hidden layer for a CNN implementation (i.e., a 1-1 next layer configuration). Although Zhuang applies his technique to ResNets (multi-out/branch structure), he does not clearly disclose how the weighting scheme varies with that architecture.
However, Lu, in the analogous environment of pruning deep neural networks, teaches based on the current hidden layer corresponding to at least two next hidden layers by a multi-out structure, the weight of the current hidden layer is the sum of the weights of the at least two next hidden layers; and based on at least two current hidden layers corresponding to one next hidden layer by a multi-in structure, the weight of each current hidden layer is the weight of the next hidden layer allocated according to a channel proportion of each current hidden layer.  ([Abstract, 0033, Figure 6]
The cost metric indicates a resource cost per channel for the channels of the layer . Training the neural network includes , for channels of the layer , updating a channel - scaling coefficient based on the cost metric ., The cost metric for a particular layer indicates a computational resource ( such as but not limited to memory ) cost per channel for the channels included in particular layer …. The cost metric for each convolution layer may be determined via various expressions that indicate a ratio of a measure of the computational cost associated with a particular layer to the number of channels in the layer , prior to pruning of the channels . On exemplary , but non - limiting embodiment is as follows . For the l - th layer , the cost metric ( a ) may be determined as follows , where 1 serves as a layer index for CNN 240 : 

    PNG
    media_image1.png
    89
    407
    media_image1.png
    Greyscale

where … k ' kh is the size of the convolution kernel for the 1 - th layer k_w^l’ , k_h^l’ , is the size of the convolution kernel of the follow - up , downstream , or subsequent layer (l ' ) , and tau (l ) represents the set of follow - up , downstream , or adjacent layers , relative to the l - th layer , in CNN 240 …. Similarly , c ” denotes the channel size of follow - up , subsequent , or downstream layers . 15 . 1 represents the image size of the feature map of the 1 - th layer ., wherein the salience weighting (lambda^l but in general computational cost) associated with a given/current layer in the model is based on both a multi-in and multi-out branch elements in the neural network architecture (e.g., see figure 6) such that, for a multi-out branch component, the penalty lambda^l (a cost function/weight/salience associated with the current layer) is a sum over the weights (processing costs) associated with subsequent adjacent layers (indexed by tau(l) as seen in the summation on the RHS of the above equation) but also such that, for a multi-in branch component, the penalty lambda^l (a cost function/weight/salience associated with the current layer even for a layer which is adjacent to another layer) is also determined according to the computational cost (weight) associated with a subsequent) layer (i.e., a layer that for which any two current layers feeds into such as in the architecture of Figure 6) such that the contribution of the next hidden layer includes a contribution from each of the adjacent layers based on their respective cost metric – i.e., proportional to the cost metric) and wherein a “follow-up” relationship between channels is also being interpreted to corresponding to a multi-in architecture for which the contribution of each adjacent layer is contributing proportionately to their respective cost metric.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhuang to incorporate the teachings of Lu for, based on the current hidden layer corresponding to at least two next hidden layers by a multi-out structure, the weight of the current hidden layer to be the sum of the weights of the at least two next hidden layers; and for, based on at least two current hidden layers corresponding to one next hidden layer by a multi-in structure, the weight of each current hidden layer to be the weight of the next hidden layer allocated according to a … proportion of each current hidden layer. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved compression rates and efficient deployment for deep neural networks while maintaining performance accuracy by optimizing the sparsity/architecture using a layer-specific processing cost penalty weight for each layer which characterizes the computational (e.g., resource, memory) cost of that layer taking, including the contribution from related adjacent, subsequent, or follow-up layers (Lu, [Abstract, 0016, 0023, 0033]).
However Zhuang and Lu do not explicitly teach wherein based on a current hidden layer corresponding one-to-one to a next hidden layer, the weight of the current hidden layer is the same as the weight of the next hidden layer; … channel …. In other words, Lu does not clearly disclose that a 1-1 layer-to-layer architecture would have equal weights assigned to each layer (i.e., the cost metrics would need to be the same). In addition, although Lu teaches the multi-in weighting characteristics based on cost metric proportionality and, in the context of the cost metric equation of [0033] indicates an additional parameter c^l’ which corresponds to the channel size of the follow-up, subsequent, or downstream layers, this equation does not indicate how the cost metric contribution for those layers (specifically in tau) depends upon the channel size.

However, Lu2, in the analogous environment of pruning deep neural networks, teaches and based on at least two current hidden layers corresponding to one next hidden layer by a multi-in structure, the weight of each current hidden layer is the weight of the next hidden layer allocated according to a channel proportion of each current hidden layer.  ([p. 4, Section 3, p. 6, Section 4.2, Figure 1] Not every layer contributes equally in a neural net. It is expected that some layers act critically for the performance but only use a small computation and memory budget, while some other layers help marginally for the performance but consume a lot resources., Given a training loss l, a convolutional neural net N , and hyper-parameters ρ, α, µ0, our method proceeds as follows:


    PNG
    media_image2.png
    121
    897
    media_image2.png
    Greyscale

k l w · k l h is the kernel size of the convolution at layer l. Likewise, k l 0 w · k l 0 h is the kernel size of subsequent convolution at layer l 0 . • T (l) represents the set of the subsequent convolutional layers of layer l • c^ l−1 denotes the channel size of the previous layer, which the l-th convolution operates over; and c^’ denotes the channel size of one subsequent layer l 0 .,
wherein the salience weighting (lambda^l in equation 1 but in general computational/memory cost) associated with a given/current layer in the model is based on both a multi-in and multi-out branch elements in the neural network architecture (e.g., see figure 1) such that the penalty lambda^l (a cost function/weight/salience associated with the current layer even for a layer which is adjacent to another layer) is determined according to the computational memory cost (weight) associated with any layer that contributes to the evaluation of the computational cost at a given layer such that the contribution to this weighting/computational cost is directly proportional to the channel size.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhuang and Lu to incorporate the teachings of Lu2 for, based on the current hidden layer corresponding to at least two next hidden layers by a multi-out structure, the weight of the current hidden layer to be the sum of the weights of the at least two next hidden layers; and for, based on at least two current hidden layers corresponding to one next hidden layer by a multi-in structure, the weight of each current hidden layer to be the weight of the next hidden layer allocated according to a channel proportion of each current hidden layer. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved efficiency while maintaining good accuracy in the compression of deep neural networks by optimizing the sparsity/architecture using a layer-specific processing channel size-dependent cost penalty weight for each layer which characterizes the computational (e.g., resource, memory) cost of that layer (Lu2, [Abstract, p. 7, Section 5.1, Table 1, Table 3]).
However, Zhang, Lu, and Lu2 do not explicitly disclose wherein based on a current hidden layer corresponding one-to-one to a next hidden layer, the weight of the current hidden layer is the same as the weight of the next hidden layer. In other words, Lu2 does not clearly disclose or suggest that a 1-1 layer-to-layer architecture would have equal weights assigned to each layer (i.e., the cost metrics would need to be the same).
However, Wen, in the analogous environment of pruning deep neural networks, teaches wherein based on a current hidden layer corresponding one-to-one to a next hidden layer, the weight of the current hidden layer is the same as the weight of the next hidden layer; ([p. 3, Section 3.1, Equation 2] Then the proposed generic optimization target of a DNN with structured sparsity regularization can be formulated as: <equation 1> Here W represents the collection of all weights in the DNN; ED(W) is the loss on data; R(·) is non-structured regularization applying on every weight, e.g., `2-norm; and Rg(·) is the structured sparsity regularization on each layer. Because Group Lasso can effectively zero out all weights in some groups [14][15], we adopt it in our SSL. The regularization of group Lasso on a set of weights w can be represented as Rg(w) = PG g=1 ||w(g) ||g, where w(g) is a group of partial weights in w and G is the total number of groups., wherein for a generic structure (including one with a 1-1 layer-to-layer structure – i.e., an architecture generally without multi-branch components), the loss term associated with the pruning of channels for a given (current) layer is based on a weighted regularization component for that layer (including the summation over the norm of weights for each layer) such that this component is weighted by the weight lambda_g (in general as shown inequations 1) and lambda_c (in particular as shown in equation 2) – in other words, all of the layers have the weight/penalty term lambda.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhuang, Lu, and Lu2 to incorporate the teachings of Wen for, based on a current hidden layer corresponding one-to-one to a next hidden layer, the weight of the current hidden layer to be the same as the weight of the next hidden layer; for, based on the current hidden layer corresponding to at least two next hidden layers by a multi-out structure, the weight of the current hidden layer to be the sum of the weights of the at least two next hidden layers; and for, based on at least two current hidden layers corresponding to one next hidden layer by a multi-in structure, the weight of each current hidden layer to be the weight of the next hidden layer allocated according to a channel proportion of each current hidden layer. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved compression rates without accuracy degradation by implementing a structure regularization framework that compresses the model simultaneously in both channel and kernel filter dimensions with a regularization penalty term weighted in common for each layer (Wen, [Abstract, pp. 4-5, Section 3.2, p. 9, Section 5, Tables 1, 3, Figure 7]).

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Zhuang in view of Ioannou et al. (“Deep roots Improving cnn efficiency with hierarchical filter groups”, Proceedings of the IEEE conference on computer vision and pattern recognition”, 2017, pp. 1231-1240), hereinafter referred to as Ioannou.

In regards to claim 17, the rejection of claim 1 is incorporated and Zhuang does not further teach wherein, after obtaining the optimized model, the method further comprises: splitting channels of at least one hidden layer of the optimized model into at least two groups of sub-channels respectively, and determining a network parameter of each group of the at least two groups of sub-channels, wherein each group of the at least two groups of sub-channels comprises corresponding input channels and output channels after grouping; and adding a combination layer to obtain a group compressed model, wherein the input of the combination layer is connected to the output channel of each group of the at least two sub-channels.  Zhuang does not disclose the splitting of channels into sub-channels.
However, Ioannou, in the analogous environment of deep model compression, teaches wherein, after obtaining the optimized model, the method further comprises: splitting channels of at least one hidden layer of the optimized model into at least two groups of sub-channels respectively, and determining a network parameter of each group of the at least two groups of sub-channels, wherein each group of the at least two groups of sub-channels comprises corresponding input channels and output channels after grouping; ([Figure 1,p. 1232, Section 2, p. 1233, Section 3, Table 6], Figure 1 However, (b) with filter grouping, g independent groups of c2/g filters operate on a fraction c1/g of the input feature map channels, reducing filter dimensions from h×w×c1 to h×w×c1/g. This change does not affect the dimensions of the input and output feature maps but significantly reduces computational complexity and the number of model parameters…. A root module has a given number of filter groups, the more filter groups, the fewer the number of connections to the previous layer’s outputs., However, we will show that even such an efficient and optimized network architecture benefits from our method., We use filter groups (see Fig. 1) to force the network to learn filters with only limited dependence on previous layers. Each of the filters in the filter groups is smaller in the channel extent, since it operates on only a subset of the channels of the input feature map., wherein a deep model compression framework divides/splits the channel filters at a layer in the (CNN) neural network (including an optimized model such as GoogleNet) into a set of 2 or more groups (to form a root module as shown in Figures 1 and 4) with the training of that network learning the model parameters (network parameters) such as weights based on this connectivity between the channel filter groups and (corresponding) input channels which also are split into g groups (e.g., as shown in Figure 1b).) and adding a combination layer to obtain a group compressed model, wherein the input of the combination layer is connected to the output channel of each group of the at least two sub-channels.  ([p. 1233, Section 3, Figures 1, 4, 7], Each spatial convolutional layer is followed by a lowdimensional embedding (1×1 convolution). Like in [9], this configuration learns a linear combination of the basis filters (filter groups), implicitly representing a filter of full channel depth, but with limited filter dependence., wherein, as shown in Figures 4 and 7, a 1x1 convolutional layer performs an embedding of the set of group filter responses (group of split channel responses) to combine  a deep model compression framework divides/splits the channel filters at a layer in those responses such that this results in learning a linear combination of respective (channel) group basis filters that compresses the neural network model by virtue of reducing channel connections.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhuang to incorporate the teachings of Ioannou for, after obtaining the optimized model, the method further to comprise: splitting channels of at least one hidden layer of the optimized model into at least two groups of sub-channels respectively, and determining a network parameter of each group of the at least two groups of sub-channels, wherein each group of the at least two groups of sub-channels comprises corresponding input channels and output channels after grouping; and adding a combination layer to obtain a group compressed model, wherein the input of the combination layer is connected to the output channel of each group of the at least two sub-channels. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved optimized compression rates with reduced computational costs by implementing a compression/parameter pruning framework by imposing filter groups on the model connectivity architecture, particularly when the number of filter channels is large  (Ioannou, [Abstract, p. 1231, Section 2,  p. 1238, Section 7, Equation 15, Tables 4-6]).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Luo et al. (“ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression”, https://arxiv.org/pdf/1707.06342.pdf, arXiv:1707.06342v1 [cs.CV] 20 Jul 2017, pp. 1-9) teach an incremental process for greedily selecting the set of pruned channels by iteratively increasing/incrementing the number of pruned channels based on a pruning loss function.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT LEWIS KULP whose telephone number is (571)272-7983. The examiner can normally be reached M, Th, F 8-5:30; Tu 8-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT LEWIS KULP/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124