DETAILED ACTION
This office action is in response to the Application No. 15895795 filed on
08/29/2022. Claims 1-20 are presented for examination and are currently pending. Applicant’s arguments have been carefully and respectfully considered.

Response to Arguments
2.	The claim amendments filed on 08/29/2022 has overcome the US 35 U.S.C 101 rejections of 07/07/2022 and therefore the rejections are withdrawn.
	The claim amendments filed on 08/29/2022 has overcome the 35 U.S.C. 112(f) of 07/07/2022 and therefore the rejections are withdrawn.
	Applicant’s arguments are moot in view of the new grounds of rejection. The Examiner is withdrawing the rejections of the previous office action on 07/07/2022 because applicant amendments necessitated the new grounds of rejection presented in this office action. Accordingly, this action is made final.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



3.	Claims 1-3, 8-10, 15 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Yan et al (US20200234130 filed on 8/18/2017) in view of Howard et al (US20180137406 filed on 09/18/2017)

	Regarding claim 1, Yan teaches a computer storage medium having instructions stored thereon for providing a neural network, which, when executed by a processor of a computing device cause the computing device to perform actions (In some embodiments, the one or more processors 102 each include one or more processor cores 107 to process instructions which, when executed, perform operations for system and user software. In some embodiments, each of the one or more processor cores 107 is configured to process a specific instruction set 109 [0043]; FIG. 20 illustrates a computing device 2000 hosting a neural network slimming mechanism (“slimming mechanism”) 2010 according to one embodiment [0204]) comprising:
	training the neural network including updating a channel-scaling coefficient for each channel of the plurality of channels based on the computation value for the first layer, (In one embodiment, pruning logic 2109 may then be used to prune the channels for each layer with a scaling factor near or at zero to obtain a pruned narrow CNN. Further, in one embodiment, training/fine-tuning logic 2111 may then be triggered to train or fine-tune the narrow CNN and, if necessitated, continue to repeat one or more the above operations [0232])
	wherein the channel-scaling coefficient for each channel linearly scales an output of each of the plurality of channels; (In one embodiment, as illustrated, scaling factors associated with channels Ci2 2208 and Ci4 2209 in cony-layer 1A 2203 are computed to be 0.001 and 0.003, respectively, which are regarded as near zero and thus, as further described with reference to FIG. 21, any channels nearing a zero scaling factor are removed from the convolution layer, such as channels Ci2 2208 and Ci4 2209 are removed from cony-layer 1A 2203 that then results in a more compact and slim cony-layer 1B 2213, while producing cony-layer 2B 2217 that is the same as cony-layer 2A 2207 [0246]; Further, channel scaling factors A 2205, B 2215 may be computed, assigned, referenced, and/or used in terms of batch normalization, such as … [0247], Fig. 22; The output from the convolution stage 1616 defines a set of linear activations that are processed by successive stages of the convolutional layer 1614 [0182])
	identifying a constant channel of the plurality of channels based on the updated channel-scaling coefficient for the constant channel; (In one embodiment, block 2355, based on the results or learned data of block 2353, addition/computation logic 2103 of FIG. 21 is triggered to compute and, in some embodiments, even predict a channel scaling factor for each channel based on channel sparsity of the channels as revealed from or identified in the results/learned data of block 2353. At block 2357, as facilitated by pruning logic 2109 of FIG. 21, any channels having associated a low channel scaling factor, such as zero or near zero or any other predetermined number, may be regarded as of low importance or significance to the wide network and/or the machine/deep learning procedures [0257])
	updating the trained neural network by removing the constant channel from the first layer, such that the updated neural network is a channel-pruned neural network. (In one embodiment, as illustrated, scaling factors associated with channels Ci2 2208 and Ci4 2209 in cony-layer 1A 2203 are computed to be 0.001 and 0.003, respectively, which are regarded as near zero and thus, as further described with reference to FIG. 21, any channels nearing a zero scaling factor are removed from the convolution layer, such as channels Ci2 2208 and Ci4 2209 are removed from cony-layer 1A 2203 that then results in a more compact and slim cony-layer 1B 2213, while producing cony-layer 2B 2217 that is the same as cony-layer 2A 2207 [0246], Fig. 22A)
	Yan does not explicitly teach determining a computation value for a first layer of the neural network, wherein the determination of the computation value for the first layer includes quantifying a computational resource cost for each channel included in the first layer;
	Howard teaches determining a computation value for a first layer of the neural network, (The computational cost for the core layers of an example network can be expressed as depthwise separable convolutions [0111])
	wherein the determination of the computation value for the first layer includes quantifying a computational resource cost for each channel included in the first layer;
 (Depthwise convolutions can be used to apply a single filter per each input channel [0091]; Depthwise convolution has a computational cost of: D K ·D K ·M·D F ·D F (Equation 2) [0093])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Yan to incorporate the teachings of Howard for the benefit of reducing computational costs associated with convolutional neural networks (Howard, abstract)

	Regarding claim 2, Modified Yan teaches the computer storage medium of claim 1, Yan teaches the actions further comprising: scaling the channel-scaling coefficient for each of the plurality of channels of the first layer based on a hyper-parameter; (Upon imposing sparse constraint to each channel scaling factor, s, the loss function may be re-defined as:
		
    PNG
    media_image1.png
    81
    218
    media_image1.png
    Greyscale

where g ( ) refers to a function encourage scaling factor, s, close to zero, where a sparse function may be L-norm, such as g(s)=|s|, where λ controls the tradeoff between empirical loss and sparsity of s [0230-0231], “Examiner notes: λ is the hyper-parameter.”; As illustrated, here in neural network 2250, convolution layer 1 (conv-layer1) 2251 is shown as having a number of channels, such as C11 2261, C12 2263, C13 2265, C14 2267, corresponding to scale layer 2253 having a number of corresponding channel scaling factors, such as S11 2271, S12 2273, S13 2275, S14 2277, respectively, and a resulting layer, such as convolution layer 2 (cony-layer2) 2255 having the i-th channel 2281. As further described above, particularly with reference to FIGS. 21-22A, channel scaling factors 2271-2277 of scale layer 2253 are added to or associated with channels 2261-2267 of cony-layer 1 2251 to measure and indicate the importance of each channel 2261-2267 and that whether during training or fine-tuning procedures, sparse constraints can be imposed on certain channels 2261-2267 based on their assigned channel scaling factors 2271-2277 [0250])
	scaling each of a plurality of model weights associated with a second layer of the neural network that is subsequent to the first layer based on the hyper-parameter; (In one embodiment, addition/computation logic 2103 may then be triggered to add one or more scale-parameters to each output channel (such as in terms of scale layer), and sparsely impose these scalar values [0226] “Examiners notes: parameter is interpreted as weights”)
	training the neural network based on the scaled channel-scaling coefficients of the first layer and the scaled models weights of the second layer; (The learning and analysis of the sparse scalar values by learning/analyzing logic 2105 allows for training/fine-tuning logic 2111 to perform one or more training procedures to produce sparse scale-values for each of the output channels, while pruning logic 2109 is used to remove any or all of the channels having scale-value near or at zero and subsequently, obtain narrowed neural networks for those such channels [0226])
	re-scaling the channel-scaling coefficient for each of the plurality of channels of the first layer (In one embodiment, this wide network 2201 hosts a long first column of convolution layers 1A (“cony-layer 1A”) 2203 having correspondingly assigned channel scaling factors A 2205 as part of a scale layer, resulting in a shorter second column of convolution layers 2A (“cony-layer 2A”) 2207. [0242]; Further, in one embodiment, training/fine-tuning logic 2111 may then be triggered to train or fine-tune the narrow CNN and, if necessitated, continue to repeat one or more the above operations [0232])
	based on the hyper-parameter and (Upon imposing sparse constraint to each channel scaling factor, s, the loss function may be re-defined as:

    PNG
    media_image1.png
    81
    218
    media_image1.png
    Greyscale

where g () refers to a function encourage scaling factor, s, close to zero, where a sparse function may be L-norm, such as g(s)=|s|, where λ controls the tradeoff between empirical loss and sparsity of s [0230-0231], “Examiner notes: λ is the hyper-parameter.”) and 
	re-scaling each of the plurality of model weights associated with the second layer based on the hyper-parameter ( For example, as illustrated, slim network 2211 contains a far shorter first column of convolution layers 1B (“conv-layer 1B”) 2213, corresponding to cony-layer 1A 2203, through and using channel scaling factors B 2215, resulting a second column of convolution layers 2B (“cony-layer 2B”) 2217 that directly corresponds to cony-layer 2A 2207 [0243]; Further, in one embodiment, training/fine-tuning logic 2111 may then be triggered to train or fine-tune the narrow CNN and, if necessitated, continue to repeat one or more the above operations [0232])

	Regarding claim 3, Modified Yan teaches the computer storage medium of claim 1, Yan teaches wherein the first layer is a batched-normalized convolution layer of the neural network (For example, in one embodiment, addition/computation logic 2103 may be used to input a wide network structure, such as a wide CNN network structure, and adding a scale layer (such as in terms of batch normalization) to input the wide CNN. In one embodiment, learning/analyzing logic 2105 may then be used to learn the wide CNN with sparse channel constraint loss as determined from the scale layer. In one embodiment, pruning logic 2109 may then be used to prune the channels for each layer with a scaling factor near or at zero to obtain a pruned narrow CNN [0232])

	Regarding claim 8, Yan teaches a method for providing a neural network, (FIG. 20 illustrates a computing device 2000 hosting a neural network slimming mechanism (“slimming mechanism”) 2010 according to one embodiment [0204]) comprising:
	training the neural network, wherein the steps for training the neural network includes updating a channel-scaling coefficient for each channel of the plurality of channels; (In one embodiment, pruning logic 2109 may then be used to prune the channels for each layer with a scaling factor near or at zero to obtain a pruned narrow CNN. Further, in one embodiment, training/fine-tuning logic 2111 may then be triggered to train or fine-tune the narrow CNN and, if necessitated, continue to repeat one or more the above operations [0232])
	identifying a constant channel of the plurality of channels; (In one embodiment, block 2355, based on the results or learned data of block 2353, addition/computation logic 2103 of FIG. 21 is triggered to compute and, in some embodiments, even predict a channel scaling factor for each channel based on channel sparsity of the channels as revealed from or identified in the results/learned data of block 2353. At block 2357, as facilitated by pruning logic 2109 of FIG. 21, any channels having associated a low channel scaling factor, such as zero or near zero or any other predetermined number, may be regarded as of low importance or significance to the wide network and/or the machine/deep learning procedures [0257]) and
	updating the trained neural network by removing the constant channel from the first layer. (In one embodiment, as illustrated, scaling factors associated with channels Ci2 2208 and Ci4 2209 in cony-layer 1A 2203 are computed to be 0.001 and 0.003, respectively, which are regarded as near zero and thus, as further described with reference to FIG. 21, any channels nearing a zero scaling factor are removed from the convolution layer, such as channels Ci2 2208 and Ci4 2209 are removed from cony-layer 1A 2203 that then results in a more compact and slim cony-layer 1B 2213, while producing cony-layer 2B 2217 that is the same as cony-layer 2A 2207 [0246], Fig. 22A)
	Yan does not explicitly teach determining a computational cost for a first layer of the neural network; wherein the determination of the computation cost for the first layer includes quantifying a computational resource cost for each channel included in the first layer;
	Howard teaches determining a computational cost for a first layer of the neural network, (The computational cost for the core layers of an example network can be expressed as depthwise separable convolutions [0111])
	wherein the determination of the computation cost for the first layer includes quantifying a computational resource cost for each channel included in the first layer;
 (Depthwise convolutions can be used to apply a single filter per each input channel [0091]; Depthwise convolution has a computational cost of: D K ·D K ·M·D F ·D F (Equation 2) [0093])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Yan to incorporate the teachings of Howard for the benefit of reducing computational costs associated with convolutional neural networks (Howard, abstract)

	Regarding claim 9, Yan teaches the method for claim 8, further comprising: Yan teaches scaling the channel-scaling coefficient for each of the plurality of channels of the first layer based on a hyper-parameter; (Upon imposing sparse constraint to each channel scaling factor, s, the loss function may be re-defined as:

    PNG
    media_image1.png
    81
    218
    media_image1.png
    Greyscale

where g ( ) refers to a function encourage scaling factor, s, close to zero, where a sparse function may be L-norm, such as g(s)=|s|, where λ controls the tradeoff between empirical loss and sparsity of s [0230-0231], “Examiner notes: λ is the hyper-parameter.”; As illustrated, here in neural network 2250, convolution layer 1 (conv-layer1) 2251 is shown as having a number of channels, such as C11 2261, C12 2263, C13 2265, C14 2267, corresponding to scale layer 2253 having a number of corresponding channel scaling factors, such as S11 2271, S12 2273, S13 2275, S14 2277, respectively, and a resulting layer, such as convolution layer 2 (cony-layer2) 2255 having the i-th channel 2281. As further described above, particularly with reference to FIGS. 21-22A, channel scaling factors 2271-2277 of scale layer 2253 are added to or associated with channels 2261-2267 of cony-layer 1 2251 to measure and indicate the importance of each channel 2261-2267 and that whether during training or fine-tuning procedures, sparse constraints can be imposed on certain channels 2261-2267 based on their assigned channel scaling factors 2271-2277 [0250])
	scaling each of a plurality of model weights associated with a second layer of the neural network that is subsequent to the first layer based on the hyper-parameter; (In one embodiment, addition/computation logic 2103 may then be triggered to add one or more scale-parameters to each output channel (such as in terms of scale layer), and sparsely impose these scalar values [0226] “Examiners notes: parameter is interpreted as weights”)
	training the neural network based on the scaled channel-scaling coefficients of the first layer and the scaled models weights of the second layer; (The learning and analysis of the sparse scalar values by learning/analyzing logic 2105 allows for training/fine-tuning logic 2111 to perform one or more training procedures to produce sparse scale-values for each of the output channels, while pruning logic 2109 is used to remove any or all of the channels having scale-value near or at zero and subsequently, obtain narrowed neural networks for those such channels [0226])
	re-scaling the channel-scaling coefficient for each of the plurality of channels of the first layer (In one embodiment, this wide network 2201 hosts a long first column of convolution layers 1A (“cony-layer 1A”) 2203 having correspondingly assigned channel scaling factors A 2205 as part of a scale layer, resulting in a shorter second column of convolution layers 2A (“cony-layer 2A”) 2207. [0242]; Further, in one embodiment, training/fine-tuning logic 2111 may then be triggered to train or fine-tune the narrow CNN and, if necessitated, continue to repeat one or more the above operations [0232])
	based on the hyper-parameter; (Upon imposing sparse constraint to each channel scaling factor, s, the loss function may be re-defined as:

    PNG
    media_image1.png
    81
    218
    media_image1.png
    Greyscale

where g ( ) refers to a function encourage scaling factor, s, close to zero, where a sparse function may be L-norm, such as g(s)=|s|, where λ controls the tradeoff between empirical loss and sparsity of s [0230-0231], “Examiner notes: λ is the hyper-parameter.”) and
	re-scaling each of the plurality of model weights associated with the second layer based on the hyper-parameter. ( For example, as illustrated, slim network 2211 contains a far shorter first column of convolution layers 1B (“conv-layer 1B”) 2213, corresponding to cony-layer 1A 2203, through and using channel scaling factors B 2215, resulting a second column of convolution layers 2B (“cony-layer 2B”) 2217 that directly corresponds to cony-layer 2A 2207 [0243]; Further, in one embodiment, training/fine-tuning logic 2111 may then be triggered to train or fine-tune the narrow CNN and, if necessitated, continue to repeat one or more the above operations [0232])

	Regarding claim 10, Modified Yan teaches the method of claim 8, Yan teaches wherein the first layer is a batched-normalized convolution layer of the neural network. (For example, in one embodiment, addition/computation logic 2103 may be used to input a wide network structure, such as a wide CNN network structure, and adding a scale layer (such as in terms of batch normalization) to input the wide CNN. In one embodiment, learning/analyzing logic 2105 may then be used to learn the wide CNN with sparse channel constraint loss as determined from the scale layer. In one embodiment, pruning logic 2109 may then be used to prune the channels for each layer with a scaling factor near or at zero to obtain a pruned narrow CNN [0232])

	Regarding claim 15, Yan teaches a computing system, comprising: a processor device; and a computer-readable storage medium, coupled with the processor device, having instructions stored thereon, which, when executed by the processor device, provide the system with a training engine configured to train a neural network by performing actions (In some embodiments, the one or more processors 102 each include one or more processor cores 107 to process instructions which, when executed, perform operations for system and user software. In some embodiments, each of the one or more processor cores 107 is configured to process a specific instruction set 109 [0043]; FIG. 20 illustrates a computing device 2000 hosting a neural network slimming mechanism (“slimming mechanism”) 2010 according to one embodiment [0204]) comprising:
	training the neural network, wherein training the neural network includes updating a channel-scaling coefficient for each channel of the plurality of channels; (In one embodiment, pruning logic 2109 may then be used to prune the channels for each layer with a scaling factor near or at zero to obtain a pruned narrow CNN. Further, in one embodiment, training/fine-tuning logic 2111 may then be triggered to train or fine-tune the narrow CNN and, if necessitated, continue to repeat one or more the above operations [0232])
	identifying a constant channel of the plurality of channels based on the updated channel-scaling coefficient for the constant channel; (In one embodiment, block 2355, based on the results or learned data of block 2353, addition/computation logic 2103 of FIG. 21 is triggered to compute and, in some embodiments, even predict a channel scaling factor for each channel based on channel sparsity of the channels as revealed from or identified in the results/learned data of block 2353. At block 2357, as facilitated by pruning logic 2109 of FIG. 21, any channels having associated a low channel scaling factor, such as zero or near zero or any other predetermined number, may be regarded as of low importance or significance to the wide network and/or the machine/deep learning procedures [0257]) and
	updating the trained neural network by removing the constant channel from the first layer, such that the updated neural network is a channel-pruned neural network. (In one embodiment, as illustrated, scaling factors associated with channels Ci2 2208 and Ci4 2209 in cony-layer 1A 2203 are computed to be 0.001 and 0.003, respectively, which are regarded as near zero and thus, as further described with reference to FIG. 21, any channels nearing a zero scaling factor are removed from the convolution layer, such as channels Ci2 2208 and Ci4 2209 are removed from cony-layer 1A 2203 that then results in a more compact and slim cony-layer 1B 2213, while producing cony-layer 2B 2217 that is the same as cony-layer 2A 2207 [0246], Fig. 22A)
	Yan does not explicitly teach determining a cost metric for a first layer of the neural network; wherein the determination of the cost metric for the first layer includes quantifying a computational resource cost for each channel included in the first layer; based on the cost metric of the first layer;
	Howard teaches determining a cost metric for a first layer of the neural network; (The computational cost for the core layers of an example network can be expressed as depthwise separable convolutions [0111])
	 wherein the determination of the cost metric for the first layer includes quantifying a computational resource cost for each channel included in the first layer; (Depthwise convolutions can be used to apply a single filter per each input channel [0091]; Depthwise convolution has a computational cost of: D K ·D K ·M·D F ·D F (Equation 2) [0093])
	based on the cost metric of the first layer; (Depthwise convolutions can be used to apply a single filter per each input channel [0091]; Depthwise convolution has a computational cost of: D K ·D K ·M·D F ·D F (Equation 2) [0093])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Yan to incorporate the teachings of Howard for the benefit of reducing computational costs associated with convolutional neural networks (Howard, abstract)

	Regarding claim 16, Modified Yan teaches the computing system of claim 15, the actions further comprising: Yan teaches the actions further comprising: scaling the channel-scaling coefficient for each of the plurality of channels of the first layer based on a hyper-parameter; (Upon imposing sparse constraint to each channel scaling factor, s, the loss function may be re-defined as:

    PNG
    media_image1.png
    81
    218
    media_image1.png
    Greyscale

where g ( ) refers to a function encourage scaling factor, s, close to zero, where a sparse function may be L-norm, such as g(s)=|s|, where λ controls the tradeoff between empirical loss and sparsity of s [0230-0231], “Examiner notes: λ is the hyper-parameter.”; As illustrated, here in neural network 2250, convolution layer 1 (conv-layer1) 2251 is shown as having a number of channels, such as C11 2261, C12 2263, C13 2265, C14 2267, corresponding to scale layer 2253 having a number of corresponding channel scaling factors, such as S11 2271, S12 2273, S13 2275, S14 2277, respectively, and a resulting layer, such as convolution layer 2 (cony-layer2) 2255 having the i-th channel 2281. As further described above, particularly with reference to FIGS. 21-22A, channel scaling factors 2271-2277 of scale layer 2253 are added to or associated with channels 2261-2267 of cony-layer 1 2251 to measure and indicate the importance of each channel 2261-2267 and that whether during training or fine-tuning procedures, sparse constraints can be imposed on certain channels 2261-2267 based on their assigned channel scaling factors 2271-2277 [0250])
	scaling each of a plurality of model weights associated with a second layer of the neural network that is subsequent to the first layer based on the hyper-parameter; (In one embodiment, addition/computation logic 2103 may then be triggered to add one or more scale-parameters to each output channel (such as in terms of scale layer), and sparsely impose these scalar values [0226] “Examiners notes: parameter is interpreted as weights”)
	training the neural network based on the scaled channel-scaling coefficients of the first layer and the scaled models weights of the second layer; (The learning and analysis of the sparse scalar values by learning/analyzing logic 2105 allows for training/fine-tuning logic 2111 to perform one or more training procedures to produce sparse scale-values for each of the output channels, while pruning logic 2109 is used to remove any or all of the channels having scale-value near or at zero and subsequently, obtain narrowed neural networks for those such channels [0226])
	re-scaling the channel-scaling coefficient for each of the plurality of channels of the first layer (In one embodiment, this wide network 2201 hosts a long first column of convolution layers 1A (“cony-layer 1A”) 2203 having correspondingly assigned channel scaling factors A 2205 as part of a scale layer, resulting in a shorter second column of convolution layers 2A (“cony-layer 2A”) 2207. [0242]; Further, in one embodiment, training/fine-tuning logic 2111 may then be triggered to train or fine-tune the narrow CNN and, if necessitated, continue to repeat one or more the above operations [0232])
	based on the hyper-parameter and (Upon imposing sparse constraint to each channel scaling factor, s, the loss function may be re-defined as:

    PNG
    media_image1.png
    81
    218
    media_image1.png
    Greyscale

where g ( ) refers to a function encourage scaling factor, s, close to zero, where a sparse function may be L-norm, such as g(s)=|s|, where λ controls the tradeoff between empirical loss and sparsity of s [0230-0231], “Examiner notes: λ is the hyper-parameter.”) and 
	re-scaling each of the plurality of model weights associated with the second layer based on the hyper-parameter ( For example, as illustrated, slim network 2211 contains a far shorter first column of convolution layers 1B (“conv-layer 1B”) 2213, corresponding to cony-layer 1A 2203, through and using channel scaling factors B 2215, resulting a second column of convolution layers 2B (“cony-layer 2B”) 2217 that directly corresponds to cony-layer 2A 2207 [0243]; Further, in one embodiment, training/fine-tuning logic 2111 may then be triggered to train or fine-tune the narrow CNN and, if necessitated, continue to repeat one or more the above operations [0232])

4.	Claims 4, 6, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Yan et al (US20200234130 filed on 8/18/2017) in view of Howard et al (US20180137406 filed on 09/18/2017) and further in view of Martin (US10185891 filed on 7/8/2016)

	Regarding claim 4, Modified Yan teaches the computer storage medium of claim 1, Yan teaches wherein the actions further comprise: absorbing the constant channel of the first layer into a second layer of the neural network that is a convolution layer and subsequent to the first layer, (computation/addition logic to compute a plurality of scaling factors to be associated with the plurality of channels such that each channel is assigned a scaling factor, wherein each scaling factor to indicate relevance of a corresponding channel within the first neural network; and pruning logic to prune the first neural network into a second neural network by removing one or more channels of the plurality of channels having low relevance as indicated by one or more scaling factors of the plurality of scaling factors assigned to the one or more channels [0264])
	Modified Yan did not explicitly teach wherein absorbing the constant channel is based on whether the second layer is batch normalized.
	Martin teaches wherein absorbing the constant channel is based on whether the second layer is batch normalized (Second intermediate normalization layer may normalize one or more sets of concatenated feature maps. For example, inter-A norm layer B 417 may normalized one or more sets of concatenated feature maps from inter-A concat layer 416 to produce one or more normalized sets of concatenated feature maps. Inter-A norm layer B 417 may perform normalization using one or more batch normalizing transforms and/or other transforms, col 8, lines 60-67)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Yan to incorporate the teachings of Martin for the benefit of producing one or more normalized sets of concatenated feature maps (Martin, col 8, lines 64-65)

	Regarding claim 6, Modified Yan teaches the one or more computer storage medium of claim 1, Yan teaches wherein the updated neural network is a channel-pruned neural network (In one embodiment, as illustrated, scaling factors associated with channels Ci2 2208 and Ci4 2209 in cony-layer 1A 2203 are computed to be 0.001 and 0.003, respectively, which are regarded as near zero and thus, as further described with reference to FIG. 21, any channels nearing a zero scaling factor are removed from the convolution layer, such as channels Ci2 2208 and Ci4 2209 are removed from cony-layer 1A 2203 that then results in a more compact and slim cony-layer 1B 2213, while producing cony-layer 2B 2217 that is the same as cony-layer 2A 2207 [0246], Fig. 22A; At block 2305, in one embodiment, using the results of channel sparsity regularization process of block 2303, any channels in a convolution layer of the wide network that are small in scaling factor are detected and then removed or pruned from the wide network [0254])
	re-training the channel-pruned neural network based on stochastic gradient descent (SGD) of a training loss function (The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network [0178])
	Yan does not explicitly teach in response to padding in the first layer.
	Martin teaches in response to padding in the first layer (pre-padding layer 403 may increase the dimensionality of one or more image maps (e.g., from 96 height and 96 width to 102 height and 102 width, etc.) by padding the borders of the image maps with zero values. Padding the borders of the image maps with zero values may allow for the compact convolutional neural network to control the dimensions of outputs of convolution operations (e.g., feature maps, etc.), col 6, lines 25-32; The difference between the predicted identity of the sampled face and the classification may be back propagated through the compact convolutional neural network to update the weights of the filters, col 12, lines 25-28)
	The same motivation to combine as dependent claim 4 applies here.

	Regarding claim 17, Modified Yan teaches the computing system of claim 15, Yan teaches the actions further comprising: absorbing the constant channel of the first layer into a second layer of the neural network that is a convolution layer and subsequent to the first layer, (computation/addition logic to compute a plurality of scaling factors to be associated with the plurality of channels such that each channel is assigned a scaling factor, wherein each scaling factor to indicate relevance of a corresponding channel within the first neural network; and pruning logic to prune the first neural network into a second neural network by removing one or more channels of the plurality of channels having low relevance as indicated by one or more scaling factors of the plurality of scaling factors assigned to the one or more channels [0264])
	Modified Yan did not explicitly teach wherein absorbing the constant channel is based on whether the second layer is batch normalized.
	Martin teaches wherein absorbing the constant channel is based on whether the second layer is batch normalized (Second intermediate normalization layer may normalize one or more sets of concatenated feature maps. For example, inter-A norm layer B 417 may noalized one or more sets of concatenated feature maps from inter-A concat layer 416 to produce one or more normalized sets of concatenated feature maps. Inter-A norm layer B 417 may perform normalization using one or more batch normalizing transforms and/or other transforms, col 8, lines 60-67)
	The same motivation to combine as dependent claim 4 applies here.

5.	Claims 5, 12 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Yan et al (US20200234130 filed on 8/18/2017) in view of Howard et al (US20180137406 filed on 09/18/2017) and further in view of Liu et al. ("Learning efficient convolutional networks through network slimming." Proceedings of the IEEE international conference on computer vision. 2017.)

	Regarding claim 5, Modified Yan teaches the one or more computer storage medium of claim 1, wherein training the neural network includes: updating model weights of the neural network based on a stochastic gradient descent (SGD) of a training loss function; (The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network [0178]) and
	updating the channel-scaling coefficient for each channel of the plurality of channels via an iterative-thresholding algorithm (ISTA) (wherein the plurality of scaling factors comprises one or more numbers to indicate one or more relevance levels of the plurality of channels within the first neural network such that a minimum relevance level is indicated by a minimum threshold number [0268])
	Modified Yan does not explicitly teach that penalizes a batch normalization loss function based on the computation value for the first layer and a norm of the channel-scaling coefficient.
	Liu teaches that penalizes a batch normalization loss function based on the computation value for the first layer and a norm of the channel-scaling coefficient. (The way BN normalizes the activations motivates us to design a simple and efficient method to incorporates the channel-wise scaling factors. Particularly, BN layer normalizes the internal activations using mini-batch statistics, pg. 2739, left col, third para,; Our idea is introducing a scaling factor γ for each channel, which is multiplied to the output of that channel. Then we jointly train the network weights and these scaling factors, with sparsity regularization imposed on the latter. Finally we prune those channels with small factors, and fine-tune the pruned network. Specifically, the training objective of our approach is given by

    PNG
    media_image2.png
    69
    364
    media_image2.png
    Greyscale

where (x, y) denote the train input and target, W denotes the trainable weights, the first sum-term corresponds to the normal training loss of a CNN, g(·) is a sparsity-induced penalty on the scaling factors, and λ balances the two terms. In our experiment, we choose g(s) = |s|, which is known as L1-norm and widely used to achieve sparsity. Subgradient descent is adopted as the optimization method for the nonsmooth L1 penalty term, pg. 2738-2739, left col, Scaling Factors and Sparsity-induced Penalty)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Yan to incorporate the teachings Liu of pruning a channel which essentially corresponds to removing all the incoming and outgoing connections of that channel so that we can directly obtain a narrow network without resorting to any special sparse computation packages (Liu, pg. 2739, left col, first para.)

	Regarding claim 12, Modified Yan teaches the method of claim 8, further comprising: updating model weights of the neural network based on a stochastic gradient descent (SGD) of a training loss function; (The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network [0178]) and
	updating the channel-scaling coefficient for each channel of the plurality of channels via an iterative-thresholding algorithm (ISTA) (wherein the plurality of scaling factors comprises one or more numbers to indicate one or more relevance levels of the plurality of channels within the first neural network such that a minimum relevance level is indicated by a minimum threshold number [0268]; Further, in one embodiment, training/fine-tuning logic 2111 may then be triggered to train or fine-tune the narrow CNN and, if necessitated, continue to repeat one or more the above operations [0232])
	Modified Yan does not explicitly teach that penalizes a batch normalization loss function based on the computation value for the first layer and a norm of the channel-scaling coefficient.
	Liu teaches that penalizes a batch normalization loss function based on the computation value for the first layer and a norm of the channel-scaling coefficient. (The way BN normalizes the activations motivates us to design a simple and efficient method to incorporates the channel-wise scaling factors. Particularly, BN layer normalizes the internal activations using mini-batch statistics, pg. 2739, left col, third para,; Our idea is introducing a scaling factor γ for each channel, which is multiplied to the output of that channel. Then we jointly train the network weights and these scaling factors, with sparsity regularization imposed on the latter. Finally we prune those channels with small factors, and fine-tune the pruned network. Specifically, the training objective of our approach is given by


    PNG
    media_image2.png
    69
    364
    media_image2.png
    Greyscale

where (x, y) denote the train input and target, W denotes the trainable weights, the first sum-term corresponds to the normal training loss of a CNN, g(·) is a sparsity-induced penalty on the scaling factors, and λ balances the two terms. In our experiment, we choose g(s) = |s|, which is known as L1-norm and widely used to achieve sparsity. Subgradient descent is adopted as the optimization method for the nonsmooth L1 penalty term, pg. 2738-2739, left col, Scaling Factors and Sparsity-induced Penalty)
	The same motivation to combine dependent claim 5 applies here.

	Regarding claim 18, Modified Yan teaches the computing system of claim 15, the actions further comprising: updating model weights of the neural network based on a stochastic gradient descent (SGD) of a training loss function; (The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network [0178]) and
	updating the channel-scaling coefficient for each channel of the plurality of channels via an iterative-thresholding algorithm (ISTA) (wherein the plurality of scaling factors comprises one or more numbers to indicate one or more relevance levels of the plurality of channels within the first neural network such that a minimum relevance level is indicated by a minimum threshold number [0268]; Further, in one embodiment, training/fine-tuning logic 2111 may then be triggered to train or fine-tune the narrow CNN and, if necessitated, continue to repeat one or more the above operations [0232])
	Modified Yan does not explicitly teach that penalizes a batch normalization loss function based on the computation value for the first layer and a norm of the channel-scaling coefficient.
	Liu teaches that penalizes a batch normalization loss function based on the computation value for the first layer and a norm of the channel-scaling coefficient. (The way BN normalizes the activations motivates us to design a simple and efficient method to incorporates the channel-wise scaling factors. Particularly, BN layer normalizes the internal activations using mini-batch statistics, pg. 2739, left col, third para.; Our idea is introducing a scaling factor γ for each channel, which is multiplied to the output of that channel. Then we jointly train the network weights and these scaling factors, with sparsity regularization imposed on the latter. Finally, we prune those channels with small factors, and fine-tune the pruned network. Specifically, the training objective of our approach is given by


    PNG
    media_image2.png
    69
    364
    media_image2.png
    Greyscale

where (x, y) denote the train input and target, W denotes the trainable weights, the first sum-term corresponds to the normal training loss of a CNN, g(·) is a sparsity-induced penalty on the scaling factors, and λ balances the two terms. In our experiment, we choose g(s) = |s|, which is known as L1-norm and widely used to achieve sparsity. Subgradient descent is adopted as the optimization method for the nonsmooth L1 penalty term, pg. 2738-2739, left col, Scaling Factors and Sparsity-induced Penalty)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Yan to incorporate the teachings Liu of pruning a channel which essentially corresponds to removing all the incoming and outgoing connections of that channel so that we can directly obtain a narrow network without resorting to any special sparse computation packages (Liu, pg. 2739, left col, first para.)

	

6.	Claims 7, 14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Yan et al (US20200234130 filed on 8/18/2017) in view of Howard et al (US20180137406 filed on 09/18/2017) and further in view of Yu et al (US20180253647)

	Regarding claim 7, Modified Yan teaches the one or more computer storage media of claim 1, Modified Yan did not explicitly teach wherein the actions further comprise: in response to the first layer not being a batch norm layer, generating a batch norm layer by computing the scaling coefficient and batch norm bias;
	Yu teaches wherein the actions further comprise: in response to the first layer not being a batch norm layer, generating a batch norm layer by computing the scaling coefficient and batch norm bias; (Referring to FIGS. 7A-7B, for example, accelerated convolutional/deconvolutional layer generator 716 may generate an weight and a bias representative of a single accelerated layer of the form:		y=α 0 x+β 0;
where x is the input to the accelerated layer, y is the output of the accelerated layer, α0 is the weight, and β0 is the bias of the accelerated layer. The accelerated layer may represent a group of layers of an original CNN model (e.g., original convolutional/deconvolutional layer, batch-norm layer, and scale layer) [0065])
	determining the scaling coefficient for each channel of the plurality of coefficients based on a variance of a convolution of training data for each channel of the plurality of channels; (Referring back to FIG. 5, in one embodiment, because the batch-norm, and scaling layers of a group of layers are linear transformations, vector calculating module 123B may calculate corresponding vectors for these layers. Vector calculating module 123B may calculate an overall scale vector and an overall shift vector for a batch-norm layer or a batch-norm layer and a scaling layer. For example, a scale vector and a shift vector representing a batch-norm layer are of the form:

    PNG
    media_image3.png
    96
    331
    media_image3.png
    Greyscale

where mean(y1) is a mean of output of y1, and std(y1) is a standard deviation of y1. A scale vector and a shift vector representing a batch-norm layer and a scaling layer are of the form:

    PNG
    media_image4.png
    96
    380
    media_image4.png
    Greyscale

where mean(y1) is a mean of output of y1, std(y1) is a standard deviation of y1, and α2 is a scaling factor and β2 is a shift factor of the scaling layer. Layer generating module 123C may combine a convolutional/deconvolutional layer with a corresponding scale vector and a shift vector of the group of layers to form an accelerated layer corresponding to a group of layers [0062]) and
	determining a batch normalization bias coefficient for each channel of the plurality of coefficients based on a mean of the convolution of training data for each channel of the plurality of channels (In one embodiment, accelerated convolutional layer 718 (or accelerated deconvolutional layer 728) is combined from convolutional layer 710 (or deconvolutional layer 720), scale vector 712 (or scale vector 722), and shift vector 714 (or shift vector 724) in the form of:

    PNG
    media_image5.png
    69
    337
    media_image5.png
    Greyscale

where α0 is a weight, and β0 is a bias of accelerated convolutional layer 718 (or accelerated deconvolutional layer 728), y1 is the output, α1 is a weight, and β1 is a bias of convolutional layer 710 (or deconvolutional layer 720), mean(y1) is a mean of output of y1, std(y1) is a standard deviation of y1, and α2 is a scaling factor and β2 is a shift factor of a scaling layer [0067])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Yan to incorporate the teachings Yu for the benefit of reducing the training time and increasing the convergence rate (Yu, [0038])

	Regarding claim 14, Modified Yan teaches the method of claim 8, further comprising: Modified Yan does not explicitly teach in response to the first layer not being a batch norm layer, generating a batch norm layer by computing the scaling coefficient and batch norm bias; determining the scaling coefficient for each channel of the plurality of coefficients based on a variance of a convolution of training data for each channel of the plurality of channels; determining a batch normalization bias coefficient for each channel of the plurality of coefficients based on a mean of the convolution of training data for each channel of the plurality of channels
	Yu teaches in response to the first layer not being a batch norm layer, generating a batch norm layer by computing the scaling coefficient and batch norm bias; (Referring to FIGS. 7A-7B, for example, accelerated convolutional/deconvolutional layer generator 716 may generate an weight and a bias representative of a single accelerated layer of the form:	y=α 0 x+β 0;
where x is the input to the accelerated layer, y is the output of the accelerated layer, α0 is the weight, and β0 is the bias of the accelerated layer. The accelerated layer may represent a group of layers of an original CNN model (e.g., original convolutional/deconvolutional layer, batch-norm layer, and scale layer) [0065])
	determining the scaling coefficient for each channel of the plurality of coefficients based on a variance of a convolution of training data for each channel of the plurality of channels; (Referring back to FIG. 5, in one embodiment, because the batch-norm, and scaling layers of a group of layers are linear transformations, vector calculating module 123B may calculate corresponding vectors for these layers. Vector calculating module 123B may calculate an overall scale vector and an overall shift vector for a batch-norm layer or a batch-norm layer and a scaling layer. For example, a scale vector and a shift vector representing a batch-norm layer are of the form:

    PNG
    media_image3.png
    96
    331
    media_image3.png
    Greyscale

where mean(y1) is a mean of output of y1, and std(y1) is a standard deviation of y1. A scale vector and a shift vector representing a batch-norm layer and a scaling layer are of the form:

    PNG
    media_image4.png
    96
    380
    media_image4.png
    Greyscale

where mean(y1) is a mean of output of y1, std(y1) is a standard deviation of y1, and α2 is a scaling factor and β2 is a shift factor of the scaling layer. Layer generating module 123C may combine a convolutional/deconvolutional layer with a corresponding scale vector and a shift vector of the group of layers to form an accelerated layer corresponding to a group of layers [0062]) and
	determining a batch normalization bias coefficient for each channel of the plurality of coefficients based on a mean of the convolution of training data for each channel of the plurality of channels. (In one embodiment, accelerated convolutional layer 718 (or accelerated deconvolutional layer 728) is combined from convolutional layer 710 (or deconvolutional layer 720), scale vector 712 (or scale vector 722), and shift vector 714 (or shift vector 724) in the form of:

    PNG
    media_image5.png
    69
    337
    media_image5.png
    Greyscale


where α0 is a weight, and β0 is a bias of accelerated convolutional layer 718 (or accelerated deconvolutional layer 728), y1 is the output, α1 is a weight, and β1 is a bias of convolutional layer 710 (or deconvolutional layer 720), mean(y1) is a mean of output of y1, std(y1) is a standard deviation of y1, and α2 is a scaling factor and β2 is a shift factor of a scaling layer [0067])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Yan to incorporate the teachings Yu for the benefit of reducing the training time and increasing the convergence rate (Yu, [0038])

	Regarding claim 20, Modified Yan the computing system of claim 15, Modified Yan did not explicitly teach teaches the actions further comprising: in response to the first layer not being a batch norm layer, generating a batch norm layer by computing the scaling coefficient and batch norm bias; determining the scaling coefficient for each channel of the plurality of coefficients based on a variance of a convolution of training data for each channel of the plurality of channels; determining a batch normalization bias coefficient for each channel of the plurality of coefficients based on a mean of the convolution of training data for each channel of the plurality of channels.
	Yu teaches in response to the first layer not being a batch norm layer, generating a batch norm layer by computing the scaling coefficient and batch norm bias; (Referring to FIGS. 7A-7B, for example, accelerated convolutional/deconvolutional layer generator 716 may generate an weight and a bias representative of a single accelerated layer of the form:		y=α 0 x+β 0;
where x is the input to the accelerated layer, y is the output of the accelerated layer, α0 is the weight, and β0 is the bias of the accelerated layer. The accelerated layer may represent a group of layers of an original CNN model (e.g., original convolutional/deconvolutional layer, batch-norm layer, and scale layer) [0065])
	determining the scaling coefficient for each channel of the plurality of coefficients based on a variance of a convolution of training data for each channel of the plurality of channels; (Referring back to FIG. 5, in one embodiment, because the batch-norm, and scaling layers of a group of layers are linear transformations, vector calculating module 123B may calculate corresponding vectors for these layers. Vector calculating module 123B may calculate an overall scale vector and an overall shift vector for a batch-norm layer or a batch-norm layer and a scaling layer. For example, a scale vector and a shift vector representing a batch-norm layer are of the form:

    PNG
    media_image3.png
    96
    331
    media_image3.png
    Greyscale

where mean(y1) is a mean of output of y1, and std(y1) is a standard deviation of y1. A scale vector and a shift vector representing a batch-norm layer and a scaling layer are of the form:

    PNG
    media_image4.png
    96
    380
    media_image4.png
    Greyscale


where mean(y1) is a mean of output of y1, std(y1) is a standard deviation of y1, and α2 is a scaling factor and β2 is a shift factor of the scaling layer. Layer generating module 123C may combine a convolutional/deconvolutional layer with a corresponding scale vector and a shift vector of the group of layers to form an accelerated layer corresponding to a group of layers [0062]) and
	determining a batch normalization bias coefficient for each channel of the plurality of coefficients based on a mean of the convolution of training data for each channel of the plurality of channels (In one embodiment, accelerated convolutional layer 718 (or accelerated deconvolutional layer 728) is combined from convolutional layer 710 (or deconvolutional layer 720), scale vector 712 (or scale vector 722), and shift vector 714 (or shift vector 724) in the form of:

    PNG
    media_image5.png
    69
    337
    media_image5.png
    Greyscale

where α0 is a weight, and β0 is a bias of accelerated convolutional layer 718 (or accelerated deconvolutional layer 728), y1 is the output, α1 is a weight, and β1 is a bias of convolutional layer 710 (or deconvolutional layer 720), mean(y1) is a mean of output of y1, std(y1) is a standard deviation of y1, and α2 is a scaling factor and β2 is a shift factor of a scaling layer [0067])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Yan to incorporate the teachings Yu for the benefit of reducing the training time and increasing the convergence rate (Yu, [0038])

7.	Claims 11 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Yan et al (US20200234130 filed on 8/18/2017) in view of Howard et al (US20180137406 filed on 09/18/2017) and further in view of Martin (US10185891 filed on 7/8/2016)

	Regarding claim 11, Modified Yan teaches the method of claim 8, further comprising: absorbing the constant channel of the first layer into a second layer of the neural network that is a convolution layer and subsequent to the first layer, (computation/addition logic to compute a plurality of scaling factors to be associated with the plurality of channels such that each channel is assigned a scaling factor, wherein each scaling factor to indicate relevance of a corresponding channel within the first neural network; and pruning logic to prune the first neural network into a second neural network by removing one or more channels of the plurality of channels having low relevance as indicated by one or more scaling factors of the plurality of scaling factors assigned to the one or more channels [0264])
	Modified Yan does not explicitly teach wherein absorbing the constant channel is based on whether the second layer is batch normalized.
	Martin teaches wherein absorbing the constant channel is based on whether the second layer is batch normalized Martin teaches wherein absorbing the constant channel is based on whether the second layer is batch normalized (Second intermediate normalization layer may normalize one or more sets of concatenated feature maps. For example, inter-A norm layer B 417 may normalized one or more sets of concatenated feature maps from inter-A concat layer 416 to produce one or more normalized sets of concatenated feature maps. Inter-A norm layer B 417 may perform normalization using one or more batch normalizing transforms and/or other transforms, col 8, lines 60-67)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Yan to incorporate the teachings of Martin for the benefit of producing one or more normalized sets of concatenated feature maps (Martin, col 8, lines 64-65)

	Regarding claim 13, Modified Yan teaches the method of claim 8, further comprising: re-training the channel-pruned neural network based on stochastic gradient descent (SGD) of a training loss function (The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network [0178])
	Modified Yan does not explicitly teach in response to padding in the first layer, 
	Martin teaches in response to padding in the first layer (pre-padding layer 403 may increase the dimensionality of one or more image maps (e.g., from 96 height and 96 width to 102 height and 102 width, etc.) by padding the borders of the image maps with zero values. Padding the borders of the image maps with zero values may allow for the compact convolutional neural network to control the dimensions of outputs of convolution operations (e.g., feature maps, etc.), col 6, lines 25-32; The difference between the predicted identity of the sampled face and the classification may be back propagated through the compact convolutional neural network to update the weights of the filters, col 12, lines 25-28)
	The same motivation to combine dependent claim 11 applies here.

8.	Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Yan et al (US20200234130 filed on 8/18/2017) in view of Howard et al (US20180137406 filed on 09/18/2017) in view of Liu et al. ("Learning efficient convolutional networks through network slimming." Proceedings of the IEEE international conference on computer vision. 2017.) and further in view of in view of Martin (US10185891 filed on 7/8/2016)

	Regarding claim 19, Modified Yan teaches the computing system of claim 18, the actions further comprising: re-training the channel-pruned neural network based on stochastic gradient descent (SGD) of a training loss function (The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network [0178])
	Modified Yan does not explicitly teach in response to padding in the first layer, 
	Martin teaches in response to padding in the first layer (pre-padding layer 403 may increase the dimensionality of one or more image maps (e.g., from 96 height and 96 width to 102 height and 102 width, etc.) by padding the borders of the image maps with zero values. Padding the borders of the image maps with zero values may allow for the compact convolutional neural network to control the dimensions of outputs of convolution operations (e.g., feature maps, etc.), col 6, lines 25-32; The difference between the predicted identity of the sampled face and the classification may be back propagated through the compact convolutional neural network to update the weights of the filters, col 12, lines 25-28)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Yan to incorporate the teachings of Martin for the benefit of producing one or more normalized sets of concatenated feature maps (Martin, col 8, lines 64-65)
	

Conclusion
	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 7:30am-5:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B. Zhen can be reached on (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/M.G./Examiner, Art Unit 2121                                  


/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121