Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claims 1, 2, 7, and 12 are objected to because of the following informalities:  
In claims 1, 7, and 12, “the computed new parameter matrix” should be “the calculated new parameter matrix” because the antecedent expression is “calculating”. 
In claim 2, “the object function” should be “the objective function”. For purposes of examination, this object-to claim term has been interpreted as having the meaning of the suggested revision.
In claim 7, second-to-last line, “reconstruct” should be “reconstructing”.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-12 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
claims 1, 7, and 12, the limitation of “the group” (which is recited in the last paragraph of claim 1 and corresponding parts of the other claims) lacks antecedent basis. For purposes of examination, “the group” has been interpreted as “a group.”
In claims 1, 7, and 12, the limitation of “the computed split parameters” (which is recited in the last paragraph of claim 1 and corresponding parts of the other claims) lacks antecedent basis. For purposes of examination, “the computed split parameters” has been interpreted as “computed split parameters.”
In claim 4, the term “excessive” in the expression “that prevents a size of one group from being excessive” is a relative term which renders the claim indefinite. The term “excessive” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. For purposes of examination, the entire expression of “that prevents a size of one group from being excessive” has been interpreted to have the meaning of “that regularizes against a difference between a size of one group and a size of another group” based on paragraphs 150-151 of the specification and the understanding that the instant expression is describing a term used in a regularization scheme that imposes a penalty when group size is unbalanced. In other words, for purposes of examination, “excessive” has been interpreted to be cover an amount that is large compared to the size of another group in the context of regularization. This part of the rejection can be overcome by amending the above expression in the manner recited above.
Dependent claims 2-6 and 8-11 are also rejected for the same reasons given for parent claims 1 and 7, since these dependent claims incorporate the indefinite recitations of their parent claims without curing the deficiencies thereof.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

1.	Claims 1, 3, 5-7, 9, and 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Mao et al., “MoDNN: Local distributed mobile computing system for Deep Neural Network,” Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, March 27-31, 2017, pp. 1396-1401 (“Mao”) in view of Alvarez-Icaza Rivera et al. (US 2015/0324684 A1) (“Alvarez-Icaza Rivera”) and Scardapane et al., “Group Sparse Regularization for Deep Neural Networks,” arXiv:1607.00485v1 [stat.ML] 2 Jul 2016. (“Scardapane”).
As to claim 1, Mao teaches a method for optimizing a trained model, comprising:
initializing a parameter matrix [Page 1399, Algorithm 2, input definition: “Weight matrix A”, which is partitioned to “                        
                            
                                
                                    W
                                
                                
                                    [
                                    i
                                    ]
                                
                            
                             
                            i
                            =
                            0,1
                            ,
                            …
                            k
                        
                    ” (see output definition of the algorithm).] and a plurality of split variables [Page 1399, left column, first full paragraph: “In the weight partition scheme of FLs in MoDNN, a clustering algorithm is leveraged to group the nonzero weights into several clusters and minimize the number of the nonzero weights outside the clusters.” These clusters are represented as C[1] to C[k] in Algorithm 2, line 5. The variables corresponding to the input or output neurons of the weight matrix A, which are clustered and split based on clusters as described in the sections described below, or the cluster groups of these variables are considered to read on the instant limitation.] of a trained model configured of a plurality of layers; [Abstract: “MoDNN can partition already trained DNN models onto several mobile devices to accelerate DNN computations by alleviating device-level computing cost and memory usage.” Since the model is a DNN, i.e., a deep neural network, the model has a plurality of layers. See also § III.E (page 1400), which teaches that “given a trained DNN, the model processor scans each layer and identify their type” and § V (page 1401), which teaches a plural “fully connected layers.”]
calculating a new parameter matrix […] for the plurality of split variables and the trained model [Page 1399, left column, first full paragraph: “In the weight partition scheme of FLs in MoDNN, a clustering algorithm is leveraged to group the nonzero weights into several clusters and minimize the number of the nonzero weights outside the clusters.” This operation creates a new parameter matrix with clustered weights, as shown in FIG. 5(b)] […]; and
vertically splitting the plurality of layers according to the group based on the computed split parameters [Page 1399, left column, second full paragraph, last sentence: “k dense clusters are generated and the corresponding input neurons are transmitted to the assigned worker nodes for parallel executions.” The separate clusters that are determined in the modified spectral co-cluster (MSCC) algorithm are regarded as “split parameters.” See Algorithm 2, line 5, which teaches clusters C[1] to C[k]. With respect to the limitation of splitting a plurality of layers, § III.E (page 1400) teaches that “Given a trained DNN, the model processor scans each layer and identify their type” and § V (page 1401) teaches plural “fully connected layers.” With respect to the limitation of “the group,” the aforementioned clustering forms groups.] and reconstructing the trained model using the computed new parameter matrix as parameters of the vertically split layers. [§ IIIV.B: “each worker node is mapped with a part of the layer inputs and the outputs are reduced back to the GO, which generates the inputs of the new layer in the following map procedure.” The Examiner notes that applicant’s specification describe the act of “reconstructing” as using the neural network after distribution to the separate nodes. See paragraph 59 of applicant’s specification. Since the instant reference teaches that “outputs are reduced back to the GO,” the reference teaches that the neural network layers that have been divided among the worker nodes are used organized to function collectively, thereby the instant limitation of “reconstructing.”] 
Mao does not specifically teach the following:
(1)	the new parameter matrix has “a block-diagonal matrix” [The Examiner notes that FIG. 5(b) of Mao teaches that the weights are clustered to form a diagonal pattern. However, since the weight matrix incudes non-zero elements outside of the diagonal, it does not fully amount to a “block-diagonal matrix.”]
(2)	“to minimize a loss function for the trained model” and calculation of “a weight decay regularization term, and an objective function including a split regularization term defined by the parameter matrix and the plurality of split variables.”
Alvarez-Icaza Rivera, in an analogous art, teaches limitation (1) listed above. Alvarez-Icaza Rivera teaches “neuromorphic hardware for neuronal computation” (see title) that includes “a distributed and parallel set of neurosynaptic core circuits” (see claim 5 of Alvarez-Icaza Rivera). Therefore, Alvarez-Icaza Rivera is in the same field of endeavor as the claimed invention, namely machine learning. 
In particular, Alvarez-Icaza Rivera teaches a parameter matrix having “a block-diagonal matrix” [[0095]: “FIG. 9A illustrates an example synaptic weight matrix S, in accordance with an embodiment of the invention. The synaptic weight matrix S for a computing system is an Nx×Nn block diagonal matrix comprising multiple block diagonal submatrices Ssub positioned along a diagonal 450 of the synaptic weight matrix S. Each submatrix Ssub is implemented using a corresponding core circuit 10. Therefore, each submatrix Ssub is an Ax×An block diagonal matrix. If the computing system comprises C core circuits 10, the synaptic weight matrix S comprises C submatrices Ssub.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Mao with the teachings of Alvarez-Icaza Rivera by modifying the new parameter matrix to have a block-diagonal matrix. The motivation would have been to compute a weight matrix in a format that can be implemented in a plurality of different circuits, as suggested by Alvarez-Icaza Rivera, paragraph [0095] (“Each submatrix Ssub is implemented using a corresponding core circuit 10”).
Scardapane, in an analogous art, teaches the remaining limitations (2) listed above. Scardapane generally teaches “group sparse regularization for deep neural networks” and is therefore in the same field of endeavor as the claimed invention, namely machine learning.
In particular, Scardapane teaches “to minimize a loss function for the trained model” [§ II, paragraph 1: “The network is trained by minimizing a standard regularized cost function…[see equation (2) in original text]…where L(⋅,⋅) is a proper cost function, R(⋅) is used to impose regularization.” § IV paragraph 1: “the networks are trained using the popular Adam algorithm, a derivation of stochastic gradient descent with both adaptive step sizes and momentum. Specifically, we minimize the loss function in (2) with the standard cross-entropy loss…and multiple choices for the regularization penalty.”] and calculation of “a weight decay regularization term” [§ II, paragraph 3: “the second most common approach to regularize the network, inspired by the Lasso algorithm, is to penalize the absolute magnitude of the weights: [see equation (4) in original text defining                         
                            
                                
                                    R
                                
                                
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                            
                            (
                            w
                            )
                        
                    ]. The Examiner notes that while § III, paragraph 2 uses the term “weight decay” for the                         
                            
                                
                                    l
                                
                                
                                    2
                                
                            
                        
                     norm, the                         
                            
                                
                                    l
                                
                                
                                    1
                                
                            
                        
                     norm of equation (4) can also be regarded as a “weight decay regularization term,” since it penalizes higher weights similar to the                         
                            
                                
                                    l
                                
                                
                                    2
                                
                            
                        
                     norm.] “and an objective function including a split regularization term defined by the parameter matrix and the plurality of split variables.” [Page 4, left column, third full paragraph full paragraph (§ III.A) teaches an objective function in the form of equation 7 for RSGL(w). This function includes a split regularization term as described in page 4, left column, second full paragraph (in § III.A): “Group sparse regularization can be written as …                         
                            
                                
                                    R
                                
                                
                                    
                                        
                                            l
                                        
                                        
                                            2,1
                                        
                                    
                                
                            
                            (
                            w
                            )
                        
                    ” where                         
                            
                                
                                    R
                                
                                
                                    
                                        
                                            l
                                        
                                        
                                            2,1
                                        
                                    
                                
                            
                            (
                            w
                            )
                        
                     is a regularization term that is defined by the weight matrix w and therefore also defined by the variables of this matrix. Note that this regularization term is used in conjunction with                         
                            
                                
                                    R
                                
                                
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                            
                            (
                            w
                            )
                        
                     as shown in equation (7), described as the “composite sparse group Lasso” in the paragraph above equation (7). With respect to the claim language of “split regularization,” neither the claim nor the specification requires a particular definition of this term. Furthermore, the claim does not require the “split regularization term” to have any specific functionality or be used to create any particular result. Accordingly, “split” has been interpreted to refer to the use of the regularization term in a splitting operation. Therefore, since the context of splitting is already taught by Mao, this                         
                            
                                
                                    R
                                
                                
                                    
                                        
                                            l
                                        
                                        
                                            2,1
                                        
                                    
                                
                            
                            (
                            w
                            )
                        
                     term is considered to be a split regularization term when incorporated into the method of Mao, especially given that                         
                            
                                
                                    R
                                
                                
                                    
                                        
                                            l
                                        
                                        
                                            2,1
                                        
                                    
                                
                            
                            (
                            w
                            )
                        
                     is used to enforce sparsity and sparsity is used for layer partitioning in primary reference Mao.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Mao and Alvarez-Icaza Rivera with the teachings of Scardapane by modifying the calculation of the new parameter matrix to be for the plurality of split variables and the trained model “to minimize a loss function for the trained model” and by performing the further operation of calculating “a weight decay regularization term, and an objective function including a split regularization term defined by the parameter matrix and the plurality of split variables.” The motivation would have been to “perform pruning and feature selection while optimizing the weights of a neural network” (Scardapane, § VI, paragraph 1).

As to claim 3, the combination of Mao, Alvarez-Icaza Rivera, and Scardapane teaches the method as claimed in claim 1, wherein in the computing, a stochastic gradient descent method is used so that the object function is minimized. [Scardapane, § IV paragraph 1: “the networks are trained using the popular Adam algorithm, a derivation of stochastic gradient descent with both adaptive step sizes and momentum. Specifically, we minimize the loss function in (2) with the standard cross-entropy loss…and multiple choices for the regularization penalty.”] 

As to claim 5, the combination of Mao, Alvarez-Icaza Rivera, and Scardapane teaches the method as claimed in claim 1, further comprising:
computing a second-order new parameter matrix for the reconstructed trained model [Mao, § III.E (page 1400) teaches that “Given a trained DNN, the model processor scans each layer and identify their type…If a sparse fully-connected layer is detected, MSCC and FGCP will be applied in sequence to assign the workloads to the worker nodes in clusters and the workloads for outliers” and § V (page 1401) teaches plural “fully connected layers.” Therefore, the same method used for a particular fully connected layer is also used for subsequent fully connected layer, in which case the computed weight matrix with the clustered weights corresponds to a “second-order new parameter matrix”] to minimize the loss function for the trained model and a second objective function including only the weight decay regularization term, [Scardapane, § II, paragraph 1 teaches a loss function (“The network is trained by minimizing a standard regularized cost function”) as described in the rejection of claim 1, above. Furthermore, the formulation in (2) applies to the whole network. With respect to the limitation of “a second objective function including only the weight decay regularization term,” the weight decay term                         
                            
                                
                                    R
                                
                                
                                    
                                        
                                            l
                                        
                                        
                                            1
                                        
                                    
                                
                            
                            (
                            w
                            )
                        
                     in Scardapane, as discussed in the rejection of claim 1, corresponds to a “second objective function” as well as the “weight decay regularization term.”] 
optimizing the trained model using the computed second-order new parameter matrix as parameters of the vertically split layers. [Mao, § III.E: “If a sparse fully-connected layer is detected, MSCC and FGCP will be applied in sequence to assign the workloads to the worker nodes in clusters and the workloads for outliers, respectively, in order to achieve the minimum total execution time.” The minimization of execution time constitutes optimizing the trained model. As noted above, the general process in Mao is applied to each fully connected layer. Therefore, the optimization uses the computed weight matrix with the clustered weights for a second layer.]

As to claim 6, the combination of Mao, Alvarez-Icaza Rivera, and Scardapane teaches the method as claimed in claim 5, further comprising:
parallelizing each of the vertically split layers within the optimized trained model using different processors. [Mao, Page 1399, left column, second full paragraph, last sentence: “k dense clusters are generated and the corresponding input neurons are transmitted to the assigned worker nodes for parallel executions.” With respect to the workers, Mao § I, paragraph 3 teaches: “The mobile device that carries the testing data (e.g., image) acts as the Group Owner (GO) and the other devices act as the worker nodes.” Note that Mao, § I, paragraph 2 teaches a “a local distributed mobile computing system…” That is, the mobile devices each have a processor.]

As to claims 7, 9 and 11, these claims are directed to an electronic apparatus for performing operations that are the same or substantially the same as those recited in claims 1, 3, and 5, respectively. Therefore, the rejections made to claims 1, 3, and 5 are applied to claims 7, 9, and 11, respectively.
Additionally, Mao teaches an electronic apparatus, [§ 1, paragraph 4: “The mobile device that carries the testing data (e.g., image) acts as the Group Owner (GO)”] comprising: a memory storing a trained model configured of a plurality of layers; [Since the method of Mao, which includes algorithm 2 on page 1399, is a computer-implemented method, the limitation of “memory” is implicit disclosed. Furthermore, the limitation of “trained model configured of a plurality of layers” is taught for the reasons stated in claim 1.] and a processor [Since Mao teaches that its method, which includes algorithm 2 on page 1399, is a computer-implemented method and is performed by a “mobile device that…acts as the Group Owner” (§ I, paragraph 3) as part of “a local distributed mobile computing system…” (§ I, paragraph 2), the limitation of “processor” is implicit disclosed.]

	As to claim 12, this claim is directed to a computer readable medium for performing operations that are the same or substantially the same as those recited in claim 1. Therefore, the rejection made to claim 1 is applied to claim 12. 
	Additionally, Mao teaches “a non-transitory computer readable recording medium including a program for executing a method…” [Since Mao teaches that its method, which includes algorithm 2 on page 1399, is a computer-implemented method and is performed by a “mobile device that…acts as the Group Owner” (§ I, paragraph 3) as part of “a local distributed mobile computing system…” (§ I, paragraph 2), the instant limitation of a non-transitory computer readable recording medium including a program is implicit disclosed.]

2.	Claims 2 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Mao in view of Alvarez-Icaza Rivera and Scardapane, and further in view of Applegate et al. (US 2017/0185585 A1) (“Applegate”)
As to claim 2, the combination of Mao, Alvarez-Icaza Rivera, and Scardapane teaches the method as claimed in claim 1, wherein in the initializing, […] the plurality of split variables are initialized not to be uniform to each other, [As noted in the rejection of claim 1, Mao teaches clustering the input and output neurons into groups. As shown in in FIG. 5(b) of Mao, these groups are non-uniform to each other as to the number of input or output neurons, since their formation is based on clustering. Therefore, Mao teaches that at some initial stage of the clustering, the group assignments are non-uniform.]  
The thus-far combination of references does not teach that in the initializing “the parameter matrix is initialized randomly.”
Applegate, in an analogous art, teaches the above limitations. Applegate teaches learning vector representations for codes (see title), and inputting them into a “semantic neural network model” (abstract). Therefore, Applegate is in the same field of endeavor as the claimed invention, namely machine learning.
In particular, Applegate teaches “the parameter matrix is initialized randomly.” [[0046]: “In some embodiments, the weight matrices W and W′ may be initialized to random weights such that these weight matrices are not easily saturated.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Mao, Alvarez-Icaza Rivera, and Scardapane with the teachings of Applegate by modifying the initialing such that that in the initializing, “the parameter matrix is initialized randomly.” The motivation would have been to initialize the matrix such that it is not easily saturated, as suggested by Applegate ([0046], part quoted above).

As to claim 8, the further limitations of this claim are the same or substantially the same as those recited in claim 2. Therefore, the rejection made to claim 2 is applied to claim 8.


Allowable Subject Matter
Claims 4 and 10 contain allowable subject matter if amended to overcome the § 112(b) rejections in a manner that is within the scope of the interpretations of the indefinite terms identified in the § 112(b) rejection. 
The prior art of record does not teach or suggest the limitations of “wherein the split regularization term includes a group weight regularization term that suppresses an inter-group connection and activates only an intra-group connection, a disjoint group assignment that makes each group be orthogonal to each other, and a balanced group assignment that prevents a size of one group from being excessive” recited in claim 4 and the corresponding limitations in claim 10.
	The closet prior art of record is discussed as follows.	
Mao et al. does not teach a split regularization term, as noted in the rejection of claim 1, above.
Scardapane et al. teaches a split regularization term, as noted in the rejection of claim 1, but does not teach a split regularization term that includes the elements recited in the above-quoted part of claim 4.
Nowak et al. (cited in the IDS filed on July 14, 2020), teaches a regularization term for splitting inputs in § 4.3, equation 6. However, Nowak does not teach split regularization term that includes the elements recited in the above-quoted part of claim 4.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Zhang et al. (US 2020/0026992 A1) teaches splitting a weight matrix of a neural network into submatrices that are processed in parallel (see FIG. 4).
Rouhani et al. (US 2021/0295166 A1) teaches splitting neural networks in general so that split-apart portions can be run on different machines (see FIGS. 1B-1C).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to YAO DAVID HUANG whose telephone number is (571)270-1764. The examiner can normally be reached Monday - Friday 8:30 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Y.D.H./Examiner, Art Unit 2124                                                                                                                                                                                                        

/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124