Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Election/Restriction
REQUIREMENT FOR UNITY OF INVENTION
As provided in 37 CFR 1.475(a), a national stage application shall relate to one invention only or to a group of inventions so linked as to form a single general inventive concept (“requirement of unity of invention”). Where a group of inventions is claimed in a national stage application, the requirement of unity of invention shall be fulfilled only when there is a technical relationship among those inventions involving one or more of the same or corresponding special technical features. The expression “special technical features” shall mean those technical features that define a contribution which each of the claimed inventions, considered as a whole, makes over the prior art.
The determination whether a group of inventions is so linked as to form a single general inventive concept shall be made without regard to whether the inventions are claimed in separate claims or as alternatives within a single claim. See 37 CFR 1.475(e).
When Claims Are Directed to Multiple Categories of Inventions:
As provided in 37 CFR 1.475 (b), a national stage application containing claims to different categories of invention will be considered to have unity of invention if the claims are drawn only to one of the following combinations of categories:
(1) A product and a process specially adapted for the manufacture of said product; or
(2) A product and a process of use of said product; or

(4) A process and an apparatus or means specifically designed for carrying out the said process; or
(5) A product, a process specially adapted for the manufacture of the said product, and an apparatus or means specifically designed for carrying out the said process.
Otherwise, unity of invention might not be present. See 37 CFR 1.475 (c).
Restriction is required under 35 U.S.C. 121 and 372.
This application contains the following inventions or groups of inventions which are not so linked as to form a single general inventive concept under PCT Rule 13.1. 
In accordance with 37 CFR 1.499, applicant is required, in reply to this action, to elect a single invention to which the claims must be restricted.
Group I, Claims 1-25, 445, and 446, drawn to a computer[-implemented method of/system for] restricting learning for a neural network … comprising: training … the neural network on a training data set; and adding … a relaxation term to a back-propagated derivative of an objective function with respect to a [computed value (Claim 1)/learned parameter (Claim 445)/connection weight (Claim 7)] … the relaxation term adding a penalty to a cost function …according to whether the [computed value (Claim 1)/learned parameter (Claim 445)/connection weight (Claim 7)] for the first and second nodes diverge from each other.
Group II, Claims 26-45, drawn to a computer[-implemented method of/system for] restricting learning for a neural network … comprising: training … the neural network on a training data set; and adding … a relaxation term to a back-propagated derivative of an objective function with respect to an [activation value] … the relaxation term adding a penalty to a cost function …according to whether the [activation value] for the first and second nodes diverge from each other.

The groups of inventions listed above do not relate to a single general inventive concept under PCT Rule 13.1 because, under PCT Rule 13.2, they lack the same or corresponding special technical features for the following reasons:
Groups I and II lack unity of invention because even though the inventions of these groups require the technical feature of a computer[-implemented method of/system for] restricting learning for a neural network … comprising: training … the neural network on a training data set; and adding … a relaxation term to a back-propagated derivative of an objective function with respect to a [computed value] … the relaxation term adding a penalty to a cost function …according to whether the [computed value] for the first and second nodes diverge from each other, this technical feature is not a special technical feature as it does not make a contribution over the prior art in view of Wang et al., “Training Compressed Fully-Connected Networks with a Density-Diversity Penalty.”  Specifically, Wang teaches Claim 1, see the rejection of Claim 1 in the 35 U.S.C. 102 section below.  
Group I contains the special technical feature wherein the penalty is based on divergence of computed values which are specified as connection weights/learned parameters, (see Claims 11 and 445) which is a different special technical feature than Group II, wherein the penalty is based on whether activation values of nodes diverge from each other.  The difference between a penalty function based on divergence of weights versus divergence of activations is .
During a telephone conversation with Mark Knedeisen on March 5th, 2021 a provisional election was made without traverse to prosecute the invention of Group I, Claims 1-25, 445, and 446.  Affirmation of this election must be made by applicant in replying to this Office action.  Claims 26-45 are withdrawn from further consideration by the examiner, 37 CFR 1.142(b), as being drawn to a non-elected invention.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-6, 8-18, 20-24, 445, and 446 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Wang, “Training Compressed Fully-Connected Networks with a Density-Diversity Penalty.”  Chen et al., “MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems” is relied upon to demonstrate inherency in Wang.
Regarding Claim 1, Wang teaches a computer-implemented method (Wang, pg. 5, 2nd-to-last paragraph, “For our implementation, we start with the mxnet package” with Chen, Abstract, “MXNet is a multi-language machine learning (ML) library” i.e. a software package) of restricting learning by a neural network (Wang, Abstract, “We proposed a new ‘density-diversity penalty’ regularizer that can be applied to fully-connected layers of neural networks during training.  We show that using this regularizer results in significantly fewer parameters” where “results in significantly fewer parameters” denotes restricted), wherein the neural network comprises a first node (Wang, pg. 2, last paragraph, “where                         
                            
                                
                                    W
                                
                                
                                    j
                                
                            
                        
                     denotes the weight matrix of layer                         
                            j
                        
                    ” where layers in a neural network are made of nodes; i.e. the first node is any node in some layer), the method comprising:  training, by a computer system, the neural network on a training data set (Wang, title, “Training compressed fully-connected networks” & pg. 5, 2nd-to-last paragraph, “We apply the density-diversity penalty to the fully connected layers of the models on both the MNIST (computer vision) and TIMIT (speech recognition) datasets”) and adding, by the computer system during training, a relaxation term to a back-propagated derivative of an objective function with respect to a computed value of each of the first node of the neural network and a second node (Wang, pg. 4, Eq. (3),                         
                            
                                
                                    ∂
                                    D
                                    P
                                    (
                                    
                                        
                                            W
                                        
                                        
                                            j
                                        
                                    
                                    )
                                
                                
                                    ∂
                                    
                                        
                                            W
                                        
                                        
                                            j
                                        
                                    
                                    (
                                    a
                                    ,
                                    b
                                    )
                                
                            
                        
                     is the relaxation term that is added to the derivative/gradient of the objective function                         
                            L
                            (
                            
                                
                                    y
                                
                                ^
                            
                            ,
                            y
                            )
                        
                    , i.e. the gradient of the cost function on pg. 3, Eq. (2) includes the derivative of the objective function and the derivative of the penalty function                         
                            D
                            P
                            (
                            
                                
                                    W
                                
                                
                                    j
                                
                            
                            )
                        
                     and is taken with respect to the weights                         
                            
                                
                                    W
                                
                                
                                    j
                                
                            
                            (
                            a
                            ,
                            b
                            )
                        
                    /a computed value of each node in the layer; the gradient is a back-propagated derivative because Wang states, pg. 5, 2nd-to-last paragraph, “we modified [the mxnet package] by changing the weight update code to include our density-diversity penalty” where the “weight-update code” of Chen/mxnet performs weight-update by back-propagating derivatives, see Chen, pg. 3, 3rd paragraph, “given a symbolic neural network and the weight updating function … we can implement the gradient descent by … net.forward_backward()” with, 1st paragraph, “symbolic differentiation (‘backward’)), the relaxation term adding a penalty to a cost function of each of the computed value of the first node and the computed value of the second node according to whether the computed values for the first and second nodes diverge from each other (Wang, pg. 3, Eq. (2), the density-diversity penalty is added to the loss function according to whether weights/the computed value of different nodes of the network diverge, i.e. the difference                         
                            
                                
                                    
                                        
                                            W
                                        
                                        
                                            j
                                        
                                    
                                    
                                        
                                            a
                                            ,
                                            b
                                        
                                    
                                    -
                                    
                                        
                                            W
                                        
                                        
                                            j
                                        
                                    
                                    (
                                    a
                                    '
                                    ,
                                    b
                                    '
                                    )
                                
                            
                        
                     for every pair of edges in the layer, where                         
                             
                            
                                
                                    ∂
                                    D
                                    P
                                    (
                                    
                                        
                                            W
                                        
                                        
                                            j
                                        
                                    
                                    )
                                
                                
                                    ∂
                                    
                                        
                                            W
                                        
                                        
                                            j
                                        
                                    
                                    (
                                    a
                                    ,
                                    b
                                    )
                                
                            
                        
                      is computed for each edge                         
                            
                                
                                    a
                                    ,
                                    b
                                
                            
                        
                    , where edges are each associated with a node “a” in the layer, thus whether the weights/computed values for the first and second nodes diverge from each other).
Regarding Claim 2, Wang teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated).  Wang further teaches controlling, by the computer system, a weight of the relaxation term via a hyperparameter (Wang, pg. 3, Eq. (2), where                         
                            
                                
                                    λ
                                
                                
                                    j
                                
                            
                        
                     is a weight hyperparameter for the penalty term).
Regarding Claim 3, Wang teaches the method of Claim 2 (and thus the rejection of Claim 2 is incorporated).  By inspection of Wang pg. 3, Eq. (2),                         
                            
                                
                                    λ
                                
                                
                                    j
                                
                            
                        
                     is a multiplicative scale factor applied to the relaxation term.
Regarding Claim 4, Wang teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated).  As described in the rejection of Claim 1, the first and second node are any two nodes in a particular layer of the neural network, thus the neural network comprises the second node.
Regarding Claim 5, Wang teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated).  As described in the rejection of Claim 1, the first and second node are any two nodes in a particular layer of the neural network, thus the neural network comprises a first neural network.  Since any subset of nodes of the first neural network (say, the first j j) is also a neural network, it is true that a second neural network comprises the second node.
Regarding Claim 6, Wang teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated).  Wang further teaches adding, by the computer system during training, a second relaxation term to a back-propagated derivative of an objective function with respect to a computer value of each of the first node and a third node, the relaxation term adding a penalty to a cost function of each of the first node and the third node according to whether the computed values for the first and third nodes diverge from each other (Wang, pg. 3, Eq. (2), the difference between weights                         
                            
                                
                                    
                                        
                                            W
                                        
                                        
                                            j
                                        
                                    
                                    
                                        
                                            a
                                            ,
                                            b
                                        
                                    
                                    -
                                    
                                        
                                            W
                                        
                                        
                                            j
                                        
                                    
                                    (
                                    a
                                    '
                                    ,
                                    b
                                    '
                                    )
                                
                            
                        
                     is represented and summed over every pair of edges in the layer, see the summation over a, b, a’, and b’, thus for each of the first node, the second node, and a third node).
Regarding Claim 8, Wang teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated).  Wang further teaches wherein the relaxation term is added to the back-propagated derivative of the objective function with respect to the computed value of each of the first node and the second node for a subset of data examples in the training data set (Wang, pg. 4, 3rd paragraph, “In practice, to further reduce computational cost, for every mini-batch, we only apply the density-diversity penalty with a certain small probability”).
Regarding Claim 9, Wang teaches the method of Claim 8 (and thus the rejection of Claim 8 is incorporated).  Wang further teaches a classification category into which the training data set has been divided (Wang, pg. 6, 2nd paragraph, “The MNIST dataset consists of hand-written digits … and there are 10 classes of labels”) and the subset of data examples, in any particular category, that have been randomly selected, is a subset of data examples in the training dataset which corresponds to a classification category into which the training data set has been divided (note, this is a different subset than that used in the rejection of Claim 8, but which still fulfills all the requirements of Claim 8).
Regarding Claim 10, Wang teaches the method of Claim 8 (and thus the rejection of Claim 8 is incorporated).  Wang further teaches a data cluster into which the training data set has been divided (Wang, pg. 6, 2nd paragraph, “The MNIST dataset consists of hand-written digits containing 60000 training data points and 10000 test data points”) and the subset of data examples, out of the 60000 points, that have been randomly selected, is a subset of data examples in the training dataset which corresponds to a data cluster into which the training data set has been divided by a machine learning system according to cluster assignment values (where the training data set is all 70000 data points, divided into two sets/clusters by the machine learning system/invention of Wang according to whether the data points have been assigned to the testing/training set, i.e. cluster assignment values, note, this is a different subset than that used in the rejection of Claim 8, but which still fulfills all the requirements of Claim 8).
Regarding Claim 11, Wang teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated).  The rejection of Claim 1 has already identified the computed value as a connection weight of each of the first node and the second node (Wang, pg. 3, Eq. (2), the density-diversity penalty is added to the loss function according to whether weights/the computed value of different nodes of the network diverge).
Regarding Claim 12, Wang teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated).  Wang further teaches wherein the relaxation term requires that the connection weights of the first and second node be equal to avoid a penalty (Wang, pg. 3, Eq. (2), a penalty is applied unless                         
                            
                                
                                    
                                        
                                            W
                                        
                                        
                                            j
                                        
                                    
                                    
                                        
                                            a
                                            ,
                                            b
                                        
                                    
                                    -
                                    
                                        
                                            W
                                        
                                        
                                            j
                                        
                                    
                                    (
                                    a
                                    '
                                    ,
                                    b
                                    '
                                    )
                                
                            
                        
                     is zero, i.e. the connection weights of the first and second node are equal).
the method of Claim 1 (and thus the rejection of Claim 1 is incorporated).  The rejection of Claim 1 has already identified the computed value as a connection weight (Wang, pg. 3, Eq. (2), the density-diversity penalty is added to the loss function according to whether weights/the computed value of different nodes of the network diverge), wherein the weights are learned parameters (Wang, Abstract, “the trained weight matrices” where “trained” denotes learned).

Claims 13-18 and 20-24 and 446 recite a computer system … comprising: one or more processor cores; [and] one or more memories coupled to the one or more processor cores, the one or more memories storing … instructions that, when executed by the one or more processor cores, cause the system to execute the methods of Claims 1-6 and 8-12 and 445, respectively.  As Wang executes their method on a computer (Wang, pg. 5, 2nd-to-last paragraph, “For our implementation, we start with the mxnet package” with Chen, Abstract, “MXNet is a multi-language machine learning (ML) library” i.e. a software package), in which these features are inherent (and the neural network and training data is stored in a memory), Claims 13-18 and 20-24 and 446 are rejected for reasons set forth in the rejections of Claims 1-6 and 9-12 and 445, respectively.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 7 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Wang, “Training Compressed Fully-Connected Networks with a Density-Diversity Penalty.”
Regarding Claim 7, Wang teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated).  The exact invention of Wang adds the penalty term to only a subset of the training examples (Wang, pg. 4, 3rd paragraph, “In practice, to further reduce computational cost, for every mini-batch, we only apply the density-diversity penalty with a certain small probability”) but this section of Wang implies that, if desired, one could train a neural network wherein the relaxation term is added to the back-propagated derivative of the objective function with respect to the computed value of the first node and the second node for each data example in the training data set (Wang, pg. 4, 3rd paragraph, “the sorting trick is efficient for calculating the gradient … in practice, to further reduce computational cost” implies that one would think to apply the penalty to every training example, that doing so is not inventive).  The citation “This [the approach actually implemented by Wang] still effectively forces the values of the weight matrix to collapse” (Wang, pg. 4, 3rd paragraph) implies that applying the penalty to every training example is the ideal, that one with sufficient computational resources would apply if they desired to get the best results, thus one of ordinary skill in the art, before the effective filing date of the claimed invention, would be motivated to train on every example in order to achieve the best weight compression.

Claim 19 recites a computer system … comprising: one or more processor cores; [and] one or more memories coupled to the one or more processor cores, the one or more memories storing … instructions that, when executed by the one or more processor cores, cause the system to execute the method of Claims 7.  As Wang executes their method on a computer (Wang, pg. 5, 2nd-to-last paragraph, “For our implementation, we start with the mxnet package” with Chen, Abstract, “MXNet is a multi-language machine learning (ML) library” i.e. a software package), in which these features are inherent (and the neural network and training data is stored in a memory), Claim 19 is rejected for reasons set forth in the rejection of Claim 7.

Claim 25 is rejected under 35 U.S.C. 103 as being unpatentable over Wang, in view of Kadav, US PG Pub 2017/0091668.
Regarding Claim 25, Wang teaches the computer system of Claim 13 (and thus the rejection of Claim 13 is incorporated).  Wang does not teach one or more processors and one or more memories distributed across a plurality of computer nodes interconnected via connections having varying data bandwidths nor to transmit data between the computer nodes according to the bandwidth associated with respective connections between the computer nodes.  However, Kadav teaches a distributed system for training a neural network (Kadav, [0023], “MALT provides fault tolerance, network efficiency, and speedup to … neural networks”) using a plurality of computer nodes interconnected having varying bandwidths wherein the one or more processor cores and the one or more memories are distributed across the computer nodes (Kadav, Fig. 7) wherein the memory of each of the plurality of computer nodes stores instructions, that, when executed by the one or more processor cores, cause the computer nodes to transmit data between the computer nodes according to the data bandwidth associated with respective connections between the computer nodes (Kadav, Fig. 7 & [0013], “an exemplary system that communicates more frequently with machines connected via high bandwidths and occasionally with machines connected over low bandwidths”).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to train a neural network using the procedure and cost function of Wang on the distributed machine learning system of Kadav.  The motivation to do so is that (Kadav, [0023], “MALT provides fault tolerance, network efficiency, and speedup to … neural networks”).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure:  Barber, “Deep Learning: Autodiff, Parameter Tying and Backprop Through Time” teaches training a neural network by persuading weights to not diverge from each other.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRIAN M SMITH whose telephone number is (469)295-9104.  The examiner can normally be reached on Monday - Friday, 8:30am -5pm Central.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/BRIAN M SMITH/Examiner, Art Unit 2122