Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
101 Rejection
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 USC § 101 because the claimed invention is directed to non-statutory subject matter.

Regarding Claim 1:  Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to a method which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis:  Claim 1 recites a computer implemented method of processing neural networks, which, under its broadest reasonable interpretation is a series of mental processes and mathematical calculations.  For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing, including the following: 
updating each of a plurality of weight coefficients included in the neural network so that an objective function obtained by adding a basic loss function and an L2 regularization term multiplied by a regularization strength is minimized (mathematical calculation),
 specifying an inactive node and an inactive channel among a plurality of nodes and a plurality of channels included in the neural network (observation, evaluation, and judgement based on mathematical calculations)
Therefore, claim 1 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis:  Claim 1 does not recite any additional elements to integrate the judicial exception into a practical application.  Therefore, claim 1 is directed to a judicial exception.
Step 2B Analysis:  Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 1 amount to no more than mere instructions to apply the judicial exception using a generic computer component.
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claim 17, which recites a device, respectively, as well as to dependent claims 2-16, and 18-19. The additional limitations of the dependent claims are addressed briefly below:
Dependent claim 2 recites additional mathematical calculations “wherein, for each of the plurality of weight coefficients, in the updating, a gradient is calculated based on the objective function, a step width is calculated based on the gradient and a corresponding past gradient, and the plurality of weight coefficients are updated based on the calculated step width so that the objective function is decreased.”
Dependent claim 3 recites additional mathematical calculations “wherein an activation function including an interval of an input value at which a differential function becomes 0 or an interval of an input value at which the differential function is asymptotic to 0 is set in the neural network.”
Dependent claim 4 recites additional mathematical calculations “wherein, in the differential function of the activation function, an interval of an input value on a positive side further than a predetermined input value is larger than 0, and an interval of an input value on a negative side further than the predetermined input value is 0 or asymptotic to 0.”
Dependent claim 5 recites additional mathematical relationships “wherein the activation functions set in all nodes and channels included in all intermediate layers of the neural network are identical to one another.”
Dependent claim 6 recites additional observation, evaluation, and judgement based on the result of a mathematical calculation “wherein, in the specifying, a node and a channel for which norms of weight vectors are a predetermined threshold value or less are specified as the inactive node and the inactive channel”
Dependent claim 7 recites additional mathematical calculation “deleting the inactive node and the inactive channel from the neural network”
Dependent claim 8 recites additional insignificant extra-solution activity “acquiring a plurality of pieces of training information including an input vector and a target vector serving as a target of an output vector” which amounts to gathering data (See Mayo, 566 U.S. at 79, 101 USPQ2d at 1968; OIP Techs., Inc. v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1092-93 (Fed. Cir. 2015)) as well as additional mathematical calculations “generating an error vector based on the output vector and the target vector”, and “in the updating, each of the plurality of weight coefficients is updated each time a set of the forward direction process and the reverse direction process is executed” as well as additional elements “executing a forward direction process of assigning the input vector to the input layer of the neural network, causing operation data to be propagated in a forward direction, and causing the output vector to be output from an output layer and a reverse direction process of assigning the error vector to the output layer of the neural network and causing error data to be propagated in a reverse direction” which is well-understood, routine, and conventional (See Stork [Col. 3 l. 59-65] "A method for pruning and adjusting the weights of a feed-forward ANN is described that is based upon a Taylor expansion of the saliency function of the network" [Col. 4 l. 15-30] "The error metric, E, based on the difference between the desired response and the observed response, is then used to adjust the individual weights of the weight vector, w. Typically, a gradient descent form of algorithm is used such as the well-known generalized delta rule or backpropagation rule").  
Dependent claim 9 recites additional elements “wherein, after the weight coefficient is updated predetermined number of times or more, in the deleting, the inactive node and the inactive channel are deleted from the neural network.” Which is well-understood, routine, and conventional (See MPEP 2106.05(d) Performing repetitive calculations, Flook, 437 U.S. at 594, 198 USPQ2d at 199 (recomputing or readjusting alarm limit values); Bancorp Services v. Sun Life, 687 F.3d 1266, 1278, 103 USPQ2d 1425, 1433 (Fed. Cir. 2012) ("The computer required by some of Bancorp’s claims is employed only for its most basic function, the performance of repetitive calculations, and as such does not impose meaningful limits on the scope of those claims."))
Dependent claim 10 recites additional observation, evaluation, and judgement based on the result of a mathematical calculation (“determining whether or not a size of the neural network from which the inactive node and the inactive channel have been deleted is a target size or less after the inactive node and the inactive channel are deleted, causing each of the plurality of weight coefficients to be updated again in the neural network from which the inactive node and the inactive channel have been deleted when the size of the neural network is not the target size or less, and causing the inactive node and the inactive channel to be deleted”)
Dependent claim 11 recites additional mathematical calculations “changing the regularization strength in accordance with a target deletion ratio wherein, in the changing, the regularization strength is changed so that the regularization strength increases as the target deletion ratio increases”
Dependent claim 12 recites additional mathematical relationships “wherein the activation function is ReLU.”
Dependent claim 13 recites additional mathematical relationships “wherein the activation function is ELU.”
Dependent claim 14 recites additional mathematical relationships “wherein the activation function is hyperbolic tangent.”
Dependent claim 15 recites additional mathematical relationships “wherein, in the updating, the weight coefficient is updated by an algorithm of Adam.”
Dependent claim 16 recites additional mathematical relationships “wherein, in the updating, the weight coefficient is updated by an algorithm of RMSprop.”
Dependent claim 18 recites additional observation, evaluation, and judgement based on the result of a mathematical calculation “the one or more processors is further configured to delete the inactive node and the inactive channel from the neural network” and “the one or more processors specifies a node and a channel for which norms of weight vectors are a predetermined threshold value or less as the inactive node and the inactive channel”, as well as additional mathematical calculations “wherein activation functions set in all nodes and channels included in all intermediate layers of the neural network include an interval of an input value at which a differential function becomes 0 or an interval of an input value at which the differential function is asymptotic to 0”, “for each of the plurality of weight coefficients, the one or more processors calculates a gradient based on the objective function, calculates a step width based on the gradient and a corresponding past gradient, and updates the plurality of weight coefficients based on the calculated step width so that the objective function is decreased”
Dependent claim 19 recites additional mathematical relationships “wherein the activation function is ReLU” and “the one or more processors updates the plurality of weight coefficients in accordance with an optimization algorithm of Adam”

Therefore, when considering the elements separately and in combination, they do not do not add significantly more to the inventive concept. Accordingly, claims 1-20 are rejected under 35 U.S.C. § 101. 



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-7, 12, 15, and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Bansal (“Minnorm training: an algorithm for training over-parameterized deep neural networks”, 2018) and in view of Gorokhov (US20210027166A1). 

	Regarding claim 1, Bansal teaches  A learning method of optimizing a neural network, comprising: ([Abstract] "In this work, we propose a new training method for finding minimum weight norm solutions in over-parameterized neural networks (NNs). This method seeks to improve training speed and generalization performance by framing NN training as a constrained optimization problem")
	updating each of a plurality of weight coefficients included in the neural network so that an objective function obtained by adding a basic loss function and an L2 regularization term multiplied by a regularization strength is minimized; and ([p. 2 §1] "deep NNs are typically trained to minimize an unconstrained objective" [p. 6 §3.2.1] "In the scalar two neuron chain with a single example, the traditional unconstrained training loss function with an L2 penalty (weight decay) takes the form [See Eqn 8]" Gamma interpreted as synonymous with regularization strength.)
	While Bansal teaches that the optimization occurs at a sample and channel basis ([p. 15 §A] "this will lead to Lagrangian parameters for each sample and output channel (indexed as αµi)"), Bansal does not explicitly teach specifying an inactive node and an inactive channel among a plurality of nodes and a plurality of channels included in the neural network.  

Gorokhov, in the same field of endeavor, teaches specifying an inactive node and an inactive channel among a plurality of nodes and a plurality of channels included in the neural network. ([¶0018] " the context information includes channel values (e.g., red channel values, green channel values, blue channel values) associated with the first network layer 22. Moreover, the context aggregation component 26 may be a downsample (e.g., pooling) layer that averages channel values in the first network layer 22. Additionally, the illustrated branch path 20 includes a plurality of FC layers 30 (30 a, 30 b) that conduct an importance classification of the aggregated context information and selectively exclude one or more channels in the first network layer 22 from consideration by the second network layer 24 based on the importance classification" [¶0019] " if the first network layer 22 has 256 output neurons, the context aggregation component 26 might provide a “blob” of 256 values to a first FC layer 30 a, where the first FC layer 30 a generates a high-level feature vector having 32 elements/output neurons (e.g., with the value of each output neuron indicating the likelihood of that neuron being activated). Additionally, a second FC layer 30 b may generate an importance score vector based on the high-level feature vector, where the importance score vector has 256 output neurons. The second FC layer 30 b may generally make higher level classifications than the first FC layer 30 a. The importance score vector may contain zero values for neurons in less important channels" Gorokhov explicitly teaches selectively specifying inactive neurons (nodes) and channels.). 

	Bansal and Gorokhov are both directed towards pruning neural networks.  Therefore, Bansal and Gorokhov are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Bansal with the teachings of Gorokhov by pruning network channels.  Gorokhov teaches as a motivation for combination ([¶0074] “building block technology described herein may create lighter neural networks that solve computer vision and/or natural language processing problems in real-time on embedded processors. The technology may be used for hardware evaluation and design. Moreover, the execution time of particular network layers may be reduced considerably (e.g., 33% or more). The technology is also advantageous over knowledge distillation solutions that transfer knowledge from a large network to a smaller one, but fail to address redundancy or dynamic channel importance.”).  This motivation for combination also applies to the remaining claims depending on this combination.

	Regarding claim 2, the combination of Bansal and Gorokhov teaches The method according to claim 1, wherein, for each of the plurality of weight coefficients, in the updating, a gradient is calculated based on the objective function, a step width is calculated based on the gradient and a corresponding past gradient, and the plurality of weight coefficients are updated based on the calculated step width so that the objective function is decreased. (Bansal [p. 4 §2] "We perform a pair of iterative updates similar to the dual-ascent algorithm to optimize for the network weights...where η is the step size for weight updates. The above equation corresponds to gradient descent on the Lagrangian and can be applied to arbitrary deep networks via standard automatic differentiation. This weight update step can be replaced by multiple gradient steps or alternative neural network training approaches such as momentum [28] or Adam [15]. The second step is the update of the Lagrange multipliers with step size s" Step width interpreted as synonymous with step size.). 

	Regarding claim 3, the combination of Bansal, and Gorokhov teaches The method according to claim 2, wherein an activation function including an interval of an input value at which a differential function becomes 0 or an interval of an input value at which the differential function is asymptotic to 0 is set in the neural network. (Bansal [p. 8 §3.3.1] "We consider a shallow neural network with weight vector W = [w1w2] and sigmoidal output...where y˜µ = yµ/2 + 1/2 is the output label {−1, 1} recoded to {0, 1}. Differentiating with respect to the weights yields the gradient descent update" Bansal explicitly teaches that the derivative of the sigmoid activation function is mapped from 0 to 1.). 

	Regarding claim 4, the combination of Bansal, and Gorokhov teaches The method according to claim 3, wherein, in the differential function of the activation function, an interval of an input value on a positive side further than a predetermined input value is larger than 0, and an interval of an input value on a negative side further than the predetermined input value is 0 or asymptotic to 0. (Bansal [p. 8 §3.3.1] "We consider a shallow neural network with weight vector W = [w1w2] and sigmoidal output...where y˜µ = yµ/2 + 1/2 is the output label {−1, 1} recoded to {0, 1}. Differentiating with respect to the weights yields the gradient descent update" Bansal explicitly teaches that the derivative of the sigmoid activation function is mapped from 0 to 1.). 

	Regarding claim 5, the combination of Bansal, and Gorokhov teaches The method according to claim 3, wherein the activation functions set in all nodes and channels included in all intermediate layers of the neural network are identical to one another. (Bansal [p. 11 §4] "We train a fully connected network with 2 hidden layers and 800 hidden units with ReLU activations on 50K examples for each algorithm"). 

	Regarding claim 6, the combination of Bansal, and Gorokhov teaches The method according to claim 2, wherein, in the specifying, a node and a channel for which norms of weight vectors are a predetermined threshold value or less are specified as the inactive node and the inactive channel. (Gorokhov [¶0023] "The layer width loss 70 may be determined based on the pruning ratio constraint. In one example, the layer width loss is determined by calculating the mean across all elements (e.g., output neurons) of the vector of multipliers and then computing the Euclidean norm (e.g., distance) between the mean and the pruning ratio constraint. Accordingly, the calculated loss may be considered to be a penalty for layer width...during the training process, there may be an adversarial situation where compliance with the constraint imposed by the accuracy loss 74 minimizes the error of the network, but the layer width loss 70 minimizes the number of channels and results in a penalty if the number of channels does not comply with the pruning ratio constraint" Euclidian norm of output neurons interpreted as synonymous with norms of weight vectors.  Pruning ratio constraint interpreted as synonymous with threshold value.). 

	Regarding claim 7, the combination of Bansal, and Gorokhov teaches The method according to claim 2, further comprising, deleting the inactive node and the inactive channel from the neural network. (Gorokhov [¶0018] " the context information includes channel values (e.g., red channel values, green channel values, blue channel values) associated with the first network layer 22. Moreover, the context aggregation component 26 may be a downsample (e.g., pooling) layer that averages channel values in the first network layer 22. Additionally, the illustrated branch path 20 includes a plurality of FC layers 30 (30 a, 30 b) that conduct an importance classification of the aggregated context information and selectively exclude one or more channels in the first network layer 22 from consideration by the second network layer 24 based on the importance classification" [¶0019] " if the first network layer 22 has 256 output neurons, the context aggregation component 26 might provide a “blob” of 256 values to a first FC layer 30 a, where the first FC layer 30 a generates a high-level feature vector having 32 elements/output neurons (e.g., with the value of each output neuron indicating the likelihood of that neuron being activated). Additionally, a second FC layer 30 b may generate an importance score vector based on the high-level feature vector, where the importance score vector has 256 output neurons. The second FC layer 30 b may generally make higher level classifications than the first FC layer 30 a. The importance score vector may contain zero values for neurons in less important channels" Gorokhov explicitly teaches selectively specifying inactive neurons (nodes) and channels.  Pruning interpreted as synonymous with deleting.). 

	Regarding claim 12, the combination of Bansal, and Gorokhov teaches The method according to claim 4, wherein the activation function is ReLU. (Bansal [p. 11 §4] "We train a fully connected network with 2 hidden layers and 800 hidden units with ReLU activations on 50K examples for each algorithm"). 

Regarding claim 15, the combination of Bansal, and Gorokhov teaches The method according to claim 2, wherein, in the updating, the weight coefficient is updated by an algorithm of Adam. (Bansal [p. 4 §2] "This weight update step can be replaced by multiple gradient steps or alternative neural network training approaches such as momentum [28] or Adam [15]."). 

Regarding claim 17, claim 17 is substantially similar to claim 1.  Therefore, the rejection applied to claim 1 also applies to claim 17.

	Regarding claim 18, the combination of Bansal, and Gorokhov teaches The device according to claim 17, wherein the one or more processors is further configured to delete the inactive node and the inactive channel from the neural network, (Gorokhov [¶0018] " the context information includes channel values (e.g., red channel values, green channel values, blue channel values) associated with the first network layer 22. Moreover, the context aggregation component 26 may be a downsample (e.g., pooling) layer that averages channel values in the first network layer 22. Additionally, the illustrated branch path 20 includes a plurality of FC layers 30 (30 a, 30 b) that conduct an importance classification of the aggregated context information and selectively exclude one or more channels in the first network layer 22 from consideration by the second network layer 24 based on the importance classification" [¶0019] " if the first network layer 22 has 256 output neurons, the context aggregation component 26 might provide a “blob” of 256 values to a first FC layer 30 a, where the first FC layer 30 a generates a high-level feature vector having 32 elements/output neurons (e.g., with the value of each output neuron indicating the likelihood of that neuron being activated). Additionally, a second FC layer 30 b may generate an importance score vector based on the high-level feature vector, where the importance score vector has 256 output neurons. The second FC layer 30 b may generally make higher level classifications than the first FC layer 30 a. The importance score vector may contain zero values for neurons in less important channels" Gorokhov explicitly teaches selectively specifying inactive neurons (nodes) and channels.  Pruning interpreted as synonymous with deleting.)
	wherein activation functions set in all nodes and channels included in all intermediate layers of the neural network include an interval of an input value at which a differential function becomes 0 or an interval of an input value at which the differential function is asymptotic to 0, (Bansal [p. 8 §3.3.1] "We consider a shallow neural network with weight vector W = [w1w2] and sigmoidal output...where y˜µ = yµ/2 + 1/2 is the output label {−1, 1} recoded to {0, 1}. Differentiating with respect to the weights yields the gradient descent update" Bansal explicitly teaches that the derivative of the sigmoid activation function is mapped from 0 to 1.)
	for each of the plurality of weight coefficients, the one or more processors calculates a gradient based on the objective function, calculates a step width based on the gradient and a corresponding past gradient, and updates the plurality of weight coefficients based on the calculated step width so that the objective function is decreased, and (Bansal [p. 4 §2] "We perform a pair of iterative updates similar to the dual-ascent algorithm to optimize for the network weights...where η is the step size for weight updates. The above equation corresponds to gradient descent on the Lagrangian and can be applied to arbitrary deep networks via standard automatic differentiation. This weight update step can be replaced by multiple gradient steps or alternative neural network training approaches such as momentum [28] or Adam [15]. The second step is the update of the Lagrange multipliers with step size s" Step width interpreted as synonymous with step size.)
	the one or more processors specifies a node and a channel for which norms of weight vectors are a predetermined threshold value or less as the inactive node and the inactive channel. (Gorokhov [¶0023] "The layer width loss 70 may be determined based on the pruning ratio constraint. In one example, the layer width loss is determined by calculating the mean across all elements (e.g., output neurons) of the vector of multipliers and then computing the Euclidean norm (e.g., distance) between the mean and the pruning ratio constraint. Accordingly, the calculated loss may be considered to be a penalty for layer width...during the training process, there may be an adversarial situation where compliance with the constraint imposed by the accuracy loss 74 minimizes the error of the network, but the layer width loss 70 minimizes the number of channels and results in a penalty if the number of channels does not comply with the pruning ratio constraint" Euclidian norm of output neurons interpreted as synonymous with norms of weight vectors.  Pruning ratio constraint interpreted as synonymous with threshold value.). 

	Regarding claim 19, the combination of Bansal, and Gorokhov teaches The device according to claim 18, wherein the activation function is ReLU, and (Bansal [p. 11 §4] "We train a fully connected network with 2 hidden layers and 800 hidden units with ReLU activations on 50K examples for each algorithm")
	the one or more processors updates the plurality of weight coefficients in accordance with an optimization algorithm of Adam. (Bansal [p. 4 §2] "This weight update step can be replaced by multiple gradient steps or alternative neural network training approaches such as momentum [28] or Adam [15]."). 

	Claims 8-10 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Bansal, and Gorokhov and in further view of Stork (US5636326A).

	Regarding claim 8, the combination of Bansal and Gorokhov teaches The method according to claim 7.
	However, the combination of Bansal and Gorokhov does not explicitly teach acquiring a plurality of pieces of training information including an input vector and a target vector serving as a target of an output vector;
	generating an error vector based on the output vector and the target vector; and 
	executing a forward direction process of assigning the input vector to the input layer of the neural network, causing operation data to be propagated in a forward direction, 
	and causing the output vector to be output from an output layer and a reverse direction process of assigning the error vector to the output layer of the neural network and causing error data to be propagated in a reverse direction, 
	wherein, in the updating, each of the plurality of weight coefficients is updated each time a set of the forward direction process and the reverse direction process is executed.  
Stork, in the same field of endeavor, teaches acquiring a plurality of pieces of training information including an input vector and a target vector serving as a target of an output vector; ([Col. 13 l. 0-15] "In the training mode, training unit 240 and neural net model 220 are initialized from memory 250 via bus 202 providing initial weights, topological, and network circuit characteristics (activation functions). Training input vectors, x.sup.[k], are provided from memory 250 via bus 202 together with their exemplar responses, t.sup.[k]. Neural net model 220 simulates the network response to the input vector, x.sup.[k], providing an output 0.sup.[k] which is compared to t.sup.[k] by differencing unit 256 which forms at its output an error vector t.sup.[k] -0.sup.[k] for each input/output exemplar pairs (x.sup.[k], 0.sup.[k])" Error vector interpreted as synonymous with a target vector.)
	generating an error vector based on the output vector and the target vector; and ([Col. 13 l. 0-15] "In the training mode, training unit 240 and neural net model 220 are initialized from memory 250 via bus 202 providing initial weights, topological, and network circuit characteristics (activation functions). Training input vectors, x.sup.[k], are provided from memory 250 via bus 202 together with their exemplar responses, t.sup.[k]. Neural net model 220 simulates the network response to the input vector, x.sup.[k], providing an output 0.sup.[k] which is compared to t.sup.[k] by differencing unit 256 which forms at its output an error vector t.sup.[k] -0.sup.[k] for each input/output exemplar pairs (x.sup.[k], 0.sup.[k])")
	executing a forward direction process of assigning the input vector to the input layer of the neural network, causing operation data to be propagated in a forward direction, ([Col. 3 l. 59-65] "A method for pruning and adjusting the weights of a feed-forward ANN is described that is based upon a Taylor expansion of the saliency function of the network" Feed-forward interpreted as synonymous with propagating in a forward direction.)
	and causing the output vector to be output from an output layer and a reverse direction process of assigning the error vector to the output layer of the neural network and causing error data to be propagated in a reverse direction, ([Col. 4 l. 15-30] "The error metric, E, based on the difference between the desired response and the observed response, is then used to adjust the individual weights of the weight vector, w. Typically, a gradient descent form of algorithm is used such as the well-known generalized delta rule or backpropagation rule" Backpropagation interpreted as synonymous with causing error data to be propagated in a reverse direction.)
	wherein, in the updating, each of the plurality of weight coefficients is updated each time a set of the forward direction process and the reverse direction process is executed. ([Col. 4 l. 5-30] "In the supervised learning mode, a neural network is trained by applying a series of input data vectors to the input terminals of a network with some initial set of weights, represented by the vector, w. The output vector response of the network is then compared to a set of exemplar vector responses that correspond to the desired network response for each input vector. The error metric, E, based on the difference between the desired response and the observed response, is then used to adjust the individual weights of the weight vector, w. Typically, a gradient descent form of algorithm is used such as the well-known generalized delta rule or backpropagation rule"). 

	Bansal, Gorokhov, and Stork are all directed towards pruning neural networks.  Therefore, Bansal, Gorokhov, and Stork are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Bansal and Gorokhov with the teachings of Stork by using an input, output, and error vector in a feed-forward neural network with backpropagation.  While it would be obvious to one of ordinary skill in the art that the convolutional neural network in Gorokhov is feed-forward, and that the input vector for the network would be received, Stork reinforces the obviousness of these neural network features.  Stork teaches that forward and backward propagation is well-known in the art as of 1992 ([Col. 15-30] “The error metric, E, based on the difference between the desired response and the observed response, is then used to adjust the individual weights of the weight vector, w. Typically, a gradient descent form of algorithm is used such as the well-known generalized delta rule or backpropagation rule [Rumelhart, D. E., Hinton, G. E., and Williams, R. J., Learning Internal Representations by Error propagation, Chapt. 8, parallel Distributed Processing, Vol. 1, 1986, Cambridge, MIT Press]”). This motivation for combination also applies to the remaining claims depending on this combination.  

	Regarding claim 9, the combination of Bansal, Gorokhov, and Stork teaches The method according to claim 8, wherein, after the weight coefficient is updated predetermined number of times or more, (Bansal [p. 4 Algorithm 1 & p. 5 Algorithm 2] "for number of epochs do")
	in the deleting, the inactive node and the inactive channel are deleted from the neural network. (Gorokhov [¶0018] " the context information includes channel values (e.g., red channel values, green channel values, blue channel values) associated with the first network layer 22. Moreover, the context aggregation component 26 may be a downsample (e.g., pooling) layer that averages channel values in the first network layer 22. Additionally, the illustrated branch path 20 includes a plurality of FC layers 30 (30 a, 30 b) that conduct an importance classification of the aggregated context information and selectively exclude one or more channels in the first network layer 22 from consideration by the second network layer 24 based on the importance classification" [¶0019] " if the first network layer 22 has 256 output neurons, the context aggregation component 26 might provide a “blob” of 256 values to a first FC layer 30 a, where the first FC layer 30 a generates a high-level feature vector having 32 elements/output neurons (e.g., with the value of each output neuron indicating the likelihood of that neuron being activated). Additionally, a second FC layer 30 b may generate an importance score vector based on the high-level feature vector, where the importance score vector has 256 output neurons. The second FC layer 30 b may generally make higher level classifications than the first FC layer 30 a. The importance score vector may contain zero values for neurons in less important channels" Gorokhov explicitly teaches selectively specifying inactive neurons (nodes) and channels.  Pruning interpreted as synonymous with deleting.). 

	Regarding claim 10, the combination of Bansal, Gorokhov, and Stork teaches The method according to claim 9, further comprising, determining whether or not a size of the neural network from which the inactive node and the inactive channel have been deleted is a target size or less after the inactive node and the inactive channel are deleted, (Gorokhov [¶0020] " the branch path 20 may be considered a regularization technique. As will be discussed in greater detail, the post-training pruning may use either a fixed pruning ratio constraint or an “adversarial” balance between a layer width loss and an accuracy constraint." [¶0023] "the calculated loss may be considered to be a penalty for layer width...the layer width loss 70 minimizes the number of channels and results in a penalty if the number of channels does not comply with the pruning ratio constraint." Minimizing number of channels interpreted as synonymous with minimizing the size of a neural network.)
	causing each of the plurality of weight coefficients to be updated again in the neural network from which the inactive node and the inactive channel have been deleted when the size of the neural network is not the target size or less, (Stork [Col. 7 l. 23-40] "In summary, the method described has the following major steps: (1) Train the network to a minimum error using well established training procedures. (2) Compute H-1 from the known ANN topology, known activating functions (output nonlinearities) and synaptic weights. (3) Find the "q" that gives the smallest error increase as smaller than the total error, E, the qth weight should be deleted. Otherwise go to step 5. (4) Use the value of q from step 3 to update all weights by δw=-(w.sub.q /[H.sup.-1 ].sub.qq) H.sup.-1 ·e.sub.q (15) and return to step 2. (5) Because no more weights may be deleted without a large increase in E, the process ends or the overall cycle can be repeated by retraining the network (going to step 1)." Stork explicitly teaches updating weights after pruning network weights.  E is interpreted as synonymous with threshold for neural network target size.)
	and causing the inactive node and the inactive channel to be deleted. (Gorokhov [¶0018] " the context information includes channel values (e.g., red channel values, green channel values, blue channel values) associated with the first network layer 22. Moreover, the context aggregation component 26 may be a downsample (e.g., pooling) layer that averages channel values in the first network layer 22. Additionally, the illustrated branch path 20 includes a plurality of FC layers 30 (30 a, 30 b) that conduct an importance classification of the aggregated context information and selectively exclude one or more channels in the first network layer 22 from consideration by the second network layer 24 based on the importance classification" [¶0019] " if the first network layer 22 has 256 output neurons, the context aggregation component 26 might provide a “blob” of 256 values to a first FC layer 30 a, where the first FC layer 30 a generates a high-level feature vector having 32 elements/output neurons (e.g., with the value of each output neuron indicating the likelihood of that neuron being activated). Additionally, a second FC layer 30 b may generate an importance score vector based on the high-level feature vector, where the importance score vector has 256 output neurons. The second FC layer 30 b may generally make higher level classifications than the first FC layer 30 a. The importance score vector may contain zero values for neurons in less important channels" Gorokhov explicitly teaches selectively specifying inactive neurons (nodes) and channels.  Pruning interpreted as synonymous with deleting.). 

	Claims 11, 14, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Bansal and Gorokhov and in further view of Theodorakopoulos (US20180137417A1).

	Regarding claim 11, the combination of Bansal and Gorokhov teaches The method according to claim 2
	However, the combination of Bansal and Gorokhov does not explicitly teach changing the regularization strength in accordance with a target deletion ratio, wherein, in the changing, the regularization strength is changed so that the regularization strength increases as the target deletion ratio increases.  

Theodorakopoulos, in the same field of endeavor, teaches The method according to claim 2, further comprising, changing the regularization strength in accordance with a target deletion ratio, wherein, in the changing, the regularization strength is changed so that the regularization strength increases as the target deletion ratio increases. ([¶0079] “In order to impose the desirable channel-wise sparsity, the primary loss function used during back-propagation it is augmented with a new term, which penalizes the use of convolutional kernels by adding an extra regularization term proportional to the number of kernels that are engaged in each forward propagation step Target deletion ratio interpreted as synonymous with desirable channel-wise sparsity.”  Theodorakopoulos explicitly teaches that the regularization term strength is proportional to the number of kernels in the network.  Adding an extra regularization term interpreted as synonymous with changing the regularization strength.). 

Bansal, Gorokhov, and Theodorakopoulos are all directed towards pruning neural networks.  Therefore, Bansal, Gorokhov, and Theodorakopoulos are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Bansal and Gorokhov with the teachings of Theodorakopoulos by using a regularization term that is proportional to the target network sparsity.  Theodorakopoulos teaches as a motivation for combination ([¶0013] “An exemplary aspect is proposed in which the amount of computational resources used within a CNN is adapted to the input data, and where the CNN is able to learn to always use the minimum amount of computational resources. In addition, the amount of computational resources to be used can in this method be adapted to the system, by trading-off some of the recognition accuracy.”).

	Regarding claim 14, the combination of Bansal and Gorokhov teaches The method according to claim 3.
	However, the combination of Bansal, and Gorokhov does not explicitly teach the activation function is hyperbolic tangent.  

Theodorakopoulos, in the same field of endeavor, teaches the activation function is hyperbolic tangent. ([¶0040] "Directly after the convolutions an additive bias and nonlinearity (sigmoidal, hyperbolic tangent etc.) or a rectified linear unit (RELU, leaky RELU etc.) is applied to each feature map (34, 39 in FIG. 1)."). 

Bansal, Gorokhov, and Theodorakopoulos are all directed towards pruning neural networks.  Therefore, Bansal, Gorokhov, and Theodorakopoulos are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Bansal and Gorokhov with the teachings of Theodorakopoulos by using a hyperbolic tangent function as an activation function.  It would be obvious to one of ordinary skill in the art that a hyperbolic tangent function could be used as an activation function in a neural network.  This is further reinforced by Theodorakopoulos.  Theodorakopoulos further teaches as a motivation for combination ([¶0013] “An exemplary aspect is proposed in which the amount of computational resources used within a CNN is adapted to the input data, and where the CNN is able to learn to always use the minimum amount of computational resources. In addition, the amount of computational resources to be used can in this method be adapted to the system, by trading-off some of the recognition accuracy.”).

	Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of   and Gorokhov and in further view of Theodorakopoulos 

	Regarding claim 16, the combination of Bansal and Gorokhov teaches The method according to claim 2.
	However, the combination of Bansal and Gorokhov does not explicitly teach in the updating, the weight coefficient is updated by an algorithm of RMSprop.  

Theodorakopoulos, in the same field of endeavor, teaches in the updating, the weight coefficient is updated by an algorithm of RMSprop. ([¶0077] "using in one embodiment a back-propagation algorithm (e.g Stochastic Gradient Descend, AdaDelta, Adaptive Gradient, Adam, Nesterov's Accelerated Gradient, RMSprop etc.)"). 

Bansal, Gorokhov, and Theodorakopoulos are all directed towards pruning neural networks.  Therefore, Bansal, Gorokhov, and Theodorakopoulos are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Bansal and Gorokhov with the teachings of Theodorakopoulos by using RMSProp to update the weights.  It would be obvious to one of ordinary skill in the art that RMSProp is a known algorithm that can be used to update weights in a neural network.  This is reinforced by Theodorakopoulos who teaches as an additional motivation for combination ([¶0013] “An exemplary aspect is proposed in which the amount of computational resources used within a CNN is adapted to the input data, and where the CNN is able to learn to always use the minimum amount of computational resources. In addition, the amount of computational resources to be used can in this method be adapted to the system, by trading-off some of the recognition accuracy.”).

	Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Bansal and Gorokhov and in further view of Lai (“PRUNE: Preserving Proximity and Global Ranking for Network Embedding”, 2017).  

	Regarding claim 13, the combination of Bansal and Gorokhov teaches The method according to claim 4.
	However, the combination of Bansal and Gorokhov does not explicitly teach wherein the activation function is ELU.  

Lai, in the same field of endeavor, teaches The method according to claim 4, wherein the activation function is ELU. ([p. 7 §4.1] "Model Setup. For all experiments, our model fixes node embedding and hidden layers to be 128- dimensional, proximity representation to be 64-dimensional. Exponential Linear Unit (ELU) [4] activation is adopted in hidden layers for faster learning"). 

	Bansal, Gorokhov, and Lai are all directed towards pruning neural networks.  Therefore, Bansal, Gorokhov, and Lai are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Bansal and Gorokhov with the teachings of Lai by using an exponential linear unit.  Lai provides as a motivation for combination ([p. 7 §4.1] " Exponential Linear Unit (ELU) [4] activation is adopted in hidden layers for faster learning").

	Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Gorokhov, and Voss (US20210117948A1) and in further view of Bansal.

	Regarding claim 20, Gorokhov teaches An image recognition system, comprising: ([¶0014] "the input to the first network layer 22 holds raw pixel values of an image, where the first network layer 22 is a convolutional layer that extracts features (e.g., edges, curves, colors) from the image")
	an image acquiring unit that acquires an image; ([¶0014] "to extract certain features may be done during a training procedure in which known input images are fed to the neural network")
	the one or more processors is configured to execute: ([¶0030] " In the illustrated example, the system 100 includes one or more processors 102 (e.g., host processor(s), central processing unit(s)/CPU(s), vision processing units/VPU(s)) having one or more cores 104"  [¶0031] " The processor(s) 102 may execute instructions 120 (e.g., a specialized kernel inside a Math Kernel Library for Deep Learning Networks/MKL-DNN) retrieved from the system memory 108 and/or the mass storage 118 to perform one or more aspects of the method 80 (FIG. 4) and/or the method 90 (FIG. 5), already discussed.")
	specifying an inactive node and an inactive channel among a plurality of nodes and a plurality of channels included in the neural network, and ([¶0018] " the context information includes channel values (e.g., red channel values, green channel values, blue channel values) associated with the first network layer 22. Moreover, the context aggregation component 26 may be a downsample (e.g., pooling) layer that averages channel values in the first network layer 22. Additionally, the illustrated branch path 20 includes a plurality of FC layers 30 (30 a, 30 b) that conduct an importance classification of the aggregated context information and selectively exclude one or more channels in the first network layer 22 from consideration by the second network layer 24 based on the importance classification" [¶0019] " if the first network layer 22 has 256 output neurons, the context aggregation component 26 might provide a “blob” of 256 values to a first FC layer 30 a, where the first FC layer 30 a generates a high-level feature vector having 32 elements/output neurons (e.g., with the value of each output neuron indicating the likelihood of that neuron being activated). Additionally, a second FC layer 30 b may generate an importance score vector based on the high-level feature vector, where the importance score vector has 256 output neurons. The second FC layer 30 b may generally make higher level classifications than the first FC layer 30 a. The importance score vector may contain zero values for neurons in less important channels" Gorokhov explicitly teaches selectively specifying inactive neurons (nodes) and channels.)
	deleting the inactive node and the inactive channel from the neural network. ([¶0018] " the context information includes channel values (e.g., red channel values, green channel values, blue channel values) associated with the first network layer 22. Moreover, the context aggregation component 26 may be a downsample (e.g., pooling) layer that averages channel values in the first network layer 22. Additionally, the illustrated branch path 20 includes a plurality of FC layers 30 (30 a, 30 b) that conduct an importance classification of the aggregated context information and selectively exclude one or more channels in the first network layer 22 from consideration by the second network layer 24 based on the importance classification" [¶0019] " if the first network layer 22 has 256 output neurons, the context aggregation component 26 might provide a “blob” of 256 values to a first FC layer 30 a, where the first FC layer 30 a generates a high-level feature vector having 32 elements/output neurons (e.g., with the value of each output neuron indicating the likelihood of that neuron being activated). Additionally, a second FC layer 30 b may generate an importance score vector based on the high-level feature vector, where the importance score vector has 256 output neurons. The second FC layer 30 b may generally make higher level classifications than the first FC layer 30 a. The importance score vector may contain zero values for neurons in less important channels" Gorokhov explicitly teaches selectively specifying inactive neurons (nodes) and channels.  Pruning interpreted as synonymous with deleting.).
	However, Gorokhov does not explicitly teach a neural network that recognizes an object based on the acquired image; and 
	a control unit that executes a control process based on a recognition result output from the neural network, 
	wherein the neural network is optimized by a learning process with one or more processors, and updating each of a plurality of weight coefficients included in the neural network so that an objective function obtained by adding a basic loss function and an L2 regularization term multiplied by a regularization strength is minimized,  

Voss, in the same field of endeavor, teaches a neural network that recognizes an object based on the acquired image; and ([Abstract] "computer-readable media performs automated product-recognition processes based on input image frames received from a photographic element").
	a control unit that executes a control process based on a recognition result output from the neural network, ([¶0023] "an object recognition device in communication with the controller and the imaging device" [¶0067] "The object recognition module may pass a plurality of classes (i.e., candidates or potentially matching products) to the transaction management module as output of the recognition process").
	wherein the neural network is optimized by a learning process with one or more processors, and ([¶0147] "model compression may be performed in the training pipeline for improved performance of the CNN 100 on low-power mobile device networks. More particularly, the CNN 100 may be compressed using Huffman Coding, by performing fine-tuned pruning (removing unnecessary connections from the computational graph) or a number of other steps"). 

	Gorokhov and Voss are both directed towards pruning neural networks used for images.  Therefore, Gorokhov and Voss are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Gorokhov with the teachings of Voss by performing object recognition on the acquired image and executing a process based on the recognition. It would be obvious to one of ordinary skill in the art that a convolutional neural network can be used for object recognition and that a processor and control unit are commonly employed to perform the neural network processing.  Voss provides as a motivation for combination ([¶0104] “it should be noted that the preferred CNN 100 of FIG. 7 utilizes a plurality of hidden layer representations—corresponding to respective tasks of g1 to g4—to provide additional performance benefits when executed in low-power mobile electronic devices 20”).  

	However, the combination of Gorokhov and Voss does not explicitly teach updating each of a plurality of weight coefficients included in the neural network so that an objective function obtained by adding a basic loss function and an L2 regularization term multiplied by a regularization strength is minimized,  

Bansal, in the same field of endeavor, teaches updating each of a plurality of weight coefficients included in the neural network so that an objective function obtained by adding a basic loss function and an L2 regularization term multiplied by a regularization strength is minimized, ([p. 2 §1] "deep NNs are typically trained to minimize an unconstrained objective" [p. 6 §3.2.1] "In the scalar two neuron chain with a single example, the traditional unconstrained training loss function with an L2 penalty (weight decay) takes the form [See Eqn 8]" Gamma interpreted as synonymous with regularization strength.). 

	Gorokhov, Bansal, and Voss are all directed towards pruning neural networks.  Therefore, Gorokhov, Bansal, and Voss are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Bansal with the teachings of Gorokhov and Voss by pruning network channels.  The combination of Gorokhov and Voss teaches as a motivation for combination (Gorokhov [¶0074] “building block technology described herein may create lighter neural networks that solve computer vision and/or natural language processing problems in real-time on embedded processors. The technology may be used for hardware evaluation and design. Moreover, the execution time of particular network layers may be reduced considerably (e.g., 33% or more). The technology is also advantageous over knowledge distillation solutions that transfer knowledge from a large network to a smaller one, but fail to address redundancy or dynamic channel importance.”).  This motivation for combination also applies to the remaining claims depending on this combination.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Lin (“Runtime Neural Pruning”, 2017) discloses channel-wise neural network pruning.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720.  The examiner can normally be reached on M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        

/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126