DETAILED ACTION
The applicant’s request for continued examination regarding application number 15/945,888, filed April 5, 2018 has been entered.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on March 4, 2022 has been entered.

Response to Amendments
The amendment filed March 10, 2022 has been entered. Examiner acknowledges receipt of Amendments to Application 15/945,888, which include: Amendments to the Claims, and Remarks containing Applicant’s amendments. 
Regarding Applicant’s Remarks and Amendments to the Claims, Examiner acknowledges Claims 1-2, 4-5, 9-10, 12, 14, 18-19, and 21-23 have been amended, with Claims 3 and 24 newly cancelled, and Claims 8, 11, 15-16, and 20 previously cancelled. Claims 1-2, 4-7, 9-10, 12-14, 17-19, 21-23, and 25 remain pending in the application. 
Regarding Applicant’s Remarks and Amendments to the Claims, Examiner acknowledges Applicant’s Amendments to the Claims have resolved the objections identified in Claims 21 and 23, and therefore the respective claim objections previously set forth in the Final Office Action mailed November 10, 2021 are withdrawn. 
Regarding Applicant’s Remarks and Amendments to the Claims, Examiner acknowledges applicant’s Amendments to the Claims have cancelled Claim 3, and therefore the corresponding §112(a) rejection previously set forth in the Final Office Action mailed November 10, 2021 for Claim 3 is withdrawn. 

Response to Arguments
Examiner acknowledges receipt of Arguments to Application 15/945,888, which include: Remarks containing Applicant’s arguments. 
Regarding Applicant's Remarks for Claims 1-3, 7, 9, 14, 18, and 21-25 under 35 U.S.C. 102(a)(1) as being anticipated by Cao et al., A Deep and Stable Extreme Learning Approach for Classification and Regression, Proceedings of ELM-2014 Volume 1, Springer International Publishing Switzerland 2015, pp.141-150 [hereafter referred as Cao]; for Claim 4 under 35 U.S.C. 103 as being unpatentable over Cao in view of Akusok et al., High-Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications, July 17 2015, IEEE Access Volume 3, 2015, pp.1011-1025 [hereafter referred as Akusok]; for Claims 5-6, 10, 12-13, and 19 under 35 U.S.C. 103 as being unpatentable over Cao in view of Akusok, in further view of Huang et al., Discriminative clustering via extreme learning machine, June 19 2015 [hereafter referred as Huang]; and for Claim 17 under 35 U.S.C. 103 as being unpatentable over Cao in view of Huang, Applicant’s arguments with respect to the above claim(s) have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Examiner has noted that the Applicant has amended the claims to the extent such that the scope of the claims have changed, which necessitates further examination and re-evaluation of the amended and original claims. The updated rejections and associated claim mappings according to the Applicant’s amended claims are provided in the sections indicated below.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 14, 17-19, and 25 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite 
for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding amended Claim 14,
This claim has been amended to recite the following limitations:
“A system comprising:
an artificial neural network that is to be trained for a first classification task, the artificial
neural network comprising:
… a hidden level of nodes that receives the first set of modified values and applies an intermediate non-linear function to the first set of modified values to obtain a first set of intermediate modified values in a feature space of the first artificial neural network, wherein the hidden level of nodes have parameters assigned thereto that were retrieved from a second artificial neural network that has a same structure as the artificial neural network, wherein the second artificial neural network is trained to perform a second classification task that differs from the first classification task; …
	The terms “the first artificial neural network” and “a second artificial neural network that has a same structure as the artificial neural network” recited in the above limitations have been specifically amended into this claim and thus renders the claim as being indefinite, as it is unclear whether Applicant intends to recite two or three artificial neural networks in this independent claim. Examiner also notes that Applicant’s specification is silent on the number of artificial neural networks supported in the claimed invention. Applicant is asked to clarify and make the appropriate changes to resolve this issue. For the purposes of examination, based on the earlier amended independent Claims 1 and 9, it is assumed that there are only two artificial neural networks, the term “an artificial neural network” will be interpreted as “a first artificial neural network”, with further corrections applied to the term “a second artificial neural network that has a same structure as the first artificial neural network”.
	Claims 17-19, and 25 are dependent claims tracing back to parent independent Claim 14, and thus inherit the same indefiniteness issue found in Claim 14. Hence, Claims 17-19, and 25 are also rejected as being indefinite by virtue of dependency.

The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.


Claims 1-2, 4-10, 12-14, 17-19, 21-23, and 25 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. 
 The claims contain subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Regarding amended Claim 1,
This claim has been amended to recite the following limitation: “… assigning the parameters of the first DNN to a second DNN, where the second DNN has the structure, and further wherein the second DNN is to be trained to perform a second classification task that is different from the first classification task”, but the specification fails to disclose a set of steps where a second DNN is trained to perform a second classification task that is different from the first classification task. Examiner notes that Applicant’s specification paragraphs [0101]-[0107] and accompanying Figure 6 describe a method for fine-tuning a DNN model (with block 650 indicating the fine-tuning process), but none of the paragraphs explicitly recite a DNN model performing a different classification task. Examiner further notes that Figure 6 block 620 further points to Figure 5 for initializing parameters for a last layer of a DNN model, but Figure 5A/5B and the accompanying paragraphs [0077]-[0083] do not mention performing a second classification task that is different from a first classification task. While Applicant’s specification paragraph [0005] does mention briefly that fine-tuning involves having parameters from lower level layers of a DNN model to be the same value as a pre-trained model, none of the methods shown in Figure 5 or Figure 6 discuss performing different classification tasks. The specification must describe and support the claims such that the public is informed of the boundaries of what constitutes infringement of the patent, as well as determining whether the claimed invention meets all the criteria for patentability by distinctly claiming the subject matter which the inventor regards as the invention. See MPEP 2163. Given that there is no support of this amended claim limitation present in the specification, this amended claim limitation fails to comply with the written description requirement. For the purposes of examination, this limitation will be interpreted as broadly reciting a second DNN performing a second classification task.
	Claims 2, 4-7, and 21-22 are dependent claims tracing back to parent independent Claim 1, and thus inherit the lack of written description issue found in Claim 1. Hence, Claims 2, 4-7, and 21-22 also fail to comply with the written description requirement by virtue of dependency.
Regarding amended Claim 9,
Similarly, this claim has been amended to recite the following limitation: “… assigning the parameters of the second hidden layer of the second DNN to the first hidden layer of the first DNN, wherein the first DNN is to be trained to perform one or more classification tasks that are different from the first classification task”, but the specification fails to disclose a set of steps where the first DNN is trained to perform one or more classification tasks that are different from the first classification task. Examiner notes that Applicant’s specification paragraphs [0101]-[0107] and accompanying Figure 6 describe a method for fine-tuning a DNN model (with block 650 indicating the fine-tuning process), but none of the paragraphs explicitly recite a DNN model performing one or more different classification tasks. Examiner further notes that Figure 6 block 620 further points to Figure 5 for initializing parameters for a last layer of a DNN model, but Figure 5A/5B and the accompanying paragraphs [0077]-[0083] do not mention performing one or more different classification tasks. While Applicant’s specification paragraph [0005] does mention briefly that fine-tuning involves having parameters from lower level layers of a DNN model to be the same value as a pre-trained model, none of the methods shown in Figure 5 or Figure 6 discuss performing different classification tasks. The specification must describe and support the claims such that the public is informed of the boundaries of what constitutes infringement of the patent, as well as determining whether the claimed invention meets all the criteria for patentability by distinctly claiming the subject matter which the inventor regards as the invention. See MPEP 2163. Given that there is no support of this amended claim limitation present in the specification, this amended claim limitation fails to comply with the written description requirement. For the purposes of examination, this limitation will be interpreted as broadly reciting a second DNN performing a second classification task.
	Claims 10, 12-13, and 23 are dependent claims tracing back to parent independent Claim 9, and thus inherit the lack of written description issue found in Claim 9. Hence, Claims 10, 12-13, and 23 also fail to comply with the written description requirement by virtue of dependency.
Regarding amended Claim 14,
Similarly, this claim has been amended to recite the following limitation: “… wherein the hidden level of nodes have parameters assigned thereto that were retrieved from a second artificial neural network that has a same structure as the artificial neural network, wherein the second artificial neural network is trained to perform a second classification task that differs from the first classification task”, but the specification fails to disclose a set of steps where the first DNN is trained to perform one or more classification tasks that are different from the first classification task. Examiner notes that Applicant’s specification paragraphs [0101]-[0107] and accompanying Figure 6 describe a method for fine-tuning a DNN model (with block 650 indicating the fine-tuning process), but none of the paragraphs explicitly recite a DNN model performing one or more different classification tasks. Examiner further notes that Figure 6 block 620 further points to Figure 5 for initializing parameters for a last layer of a DNN model, but Figure 5A/5B and the accompanying paragraphs [0077]-[0083] do not mention performing one or more different classification tasks. While Applicant’s specification paragraph [0005] does mention briefly that fine-tuning involves having parameters from lower level layers of a DNN model to be the same value as a pre-trained model, none of the methods shown in Figure 5 or Figure 6 discuss performing different classification tasks. The specification must describe and support the claims such that the public is informed of the boundaries of what constitutes infringement of the patent, as well as determining whether the claimed invention meets all the criteria for patentability by distinctly claiming the subject matter which the inventor regards as the invention. See MPEP 2163. Given that there is no support of this amended claim limitation present in the specification, this amended claim limitation fails to comply with the written description requirement. For the purposes of examination, this limitation will be interpreted as broadly reciting a second artificial neural network performing a second classification task.
	Claims 17-19, and 25 are dependent claims tracing back to parent independent Claim 14, and thus inherit the lack of written description issue found in Claim 14. Hence, Claims 17-19, and 25 also fail to comply with the written description requirement by virtue of dependency.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 14, 17-19, and 25 are rejected under 35 U.S.C. 101 because 
the claimed invention is directed to non-statutory subject matter. The claims do not fall within at least one of the four categories of patent eligible subject matter because the system recited in independent Claim 14 (and inherited in the associated dependent Claims 17-19 and 25) is directed to software per se, which is not one of the four categories of patent eligible subject matter recited in 35 U.S.C. 101 (process, machine, article of manufacture, or composition of matter). Applicant’s specification paragraphs [0028] and [0033] indicate that “… The various components shown in the figures can be implemented in any matter, for example, by software, hardware (e.g., discrete logic components, etc.), firmware, and so on, or any combination of these implementations.” and “The term “logic” is both contemplated and to be understood to encompass any functionality for performing a task. … An operation can be performed using, for instance, software, hardware (e.g., discrete logic components, etc.), firmware, etc., or any combination thereof.”, where only software can be used to implement the various components and logic recited in independent Claim 14 (e.g., the first and second artificial neural networks and their respective input level of nodes, the hidden level of nodes, and the output level of nodes, and their associated level initializing logic), and hence directing independent Claim 14 to a software per se implementation. To allow eligibility of independent Claim 14 and its associated dependent Claims 17-19 and 25, Applicant is advised to positively recite hardware elements (i.e., a computer processor, memory/non-transitory computer-readable medium) as part of this system recited in independent Claim 14 in order to resolve the 101 non-statutory subject matter rejection. 

Claim Rejections - 35 USC § 103
















The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.


This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2, 9, 14, and 21-23 are rejected under 35 U.S.C. 103 as being unpatentable over 
Yosinski et al., How transferable are features in deep neural networks?, November 6 2014 [hereafter referred as Yosinski] in view of Krahenbuhl et al., Data-Dependent Initializations of Convolutional Neural Networks, September 22 2016 [hereafter referred as Krahenbuhl].
Regarding amended Claim 1, 
Yosinski teaches
(Currently amended) A method of training a deep neural network, comprising:
receiving parameters of a first deep neural network (DNN), where the first DNN has a structure, and further wherein the first DNN has been previously trained to perform a first classification task (Examiner’s note: Under its broadest reasonable interpretation in light of Applicant’s specification paragraphs [0106] and [0005], this limitation will be interpreted as broadly reciting pre-training a first neural network (first DNN) to perform a first classification and applying the learned features from the first neural network, where, under its broadest reasonable interpretation, the limitation “the first DNN has a structure” broadly recites a structure associated with the first DNN. Yosinski teaches training a base network on a base dataset on a base classification task and copying the first n layers from a base network to a target network to learn a target classification task, where both the base and target networks have a deep convolutional neural network structure consisting of an input layer, an output layer, and one or more hidden layers, and where the base and target classification tasks are based on a subset of ImageNet data containing similar classes. Furthermore, this copying of n layers from a base network to a target network corresponds to applying the learned features (and associated weight vectors) from a base network to train a target network on a target task (and thus corresponds to “receiving parameters of a first deep neural network (DNN) …”) (Yosinski p.2 2nd-4th paragraphs: “… we first train a base network on a base dataset and task, and then we repurpose the learned features, or transfer them, to a second target network to be trained on a target dataset and task …The usual transfer learning approach is to train a base network and then copy its first n layers to the first n layers of a target network …”; p.3 1st-4th paragraphs: “… we define the degree of generality of a set of features learned on task A as the extent to which the features can be used for another task B. … We create pairs of classification tasks A and B by constructing pairs of non-overlapping subsets of the ImageNet dataset … To create tasks A and B, we randomly split the 1000 ImageNet classes into two groups each containing 500 classes and approximately half the data … We train one eight-layer convolutional network on A and another on B. … We then choose a layer n from                         
                            {
                            1,2
                            ,
                            …
                            ,
                            7
                            }
                        
                     and train several new networks … We repeated this process for all n in                         
                            
                                
                                    {
                                    1,2
                                    ,
                                    …
                                    ,
                                    7
                                    }
                                     
                                
                                
                                    2
                                
                            
                        
                     and in both directions … To create base and target datasets that are similar to each other, we randomly assign half of the 1000 ImageNet classes to A and half to B … Thus A and B are similar when created by randomly assigning classes to each, and we expect that transferred features will perform better than when A and B are less similar.”; and p.4 Figure 1, including caption: “…The labeled rectangles (e.g.,                         
                            
                                
                                    W
                                
                                
                                    A
                                    1
                                
                            
                        
                    ) represent the weight vector learned of that layer … The vertical, ellipsoidal bars between weight vectors represent the activations of the network at each layer.”).);
assigning the parameters of the first DNN to a second DNN, where the second DNN has the structure, and further wherein the second DNN is to be trained to perform a second classification task that is different from the first classification task (Examiner’s note: As indicated earlier, this limitation exhibits a 112(a) lack of written description such that for purposes of examination, this limitation will be interpreted as broadly reciting training a second neural network (second DNN) to perform a second classification task based on the learned features from the first neural network, where, under its broadest reasonable interpretation, the limitation “the second DNN has the structure” broadly recites the second DNN having a similar deep convolutional neural network structure used in the first neural network (first DNN). As indicated earlier, Yosinski teaches training a base network on a base dataset on a base classification task and copying the first n layers from a base network to a target network to learn a target classification task, where both the base and target networks have a neural network structure consisting of an input layer, an output layer, and one or more hidden layers, and where the base and target classification tasks are based on a subset of ImageNet data containing similar classes, such that each base and target classification task will classify similar classes (thus corresponding to “a first classification task” and “a second classification task”). Yosinski further teaches applying the copied first n layers (from the base network) to a target network, where this copying of n layers corresponds to applying the learned features (and associated weight vectors) from a base network to train a target network on a target task (and thus corresponds to “assigning the parameters of the first DNN to a second DNN …”) (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1, including caption: “… The transfer network experimental treatment is the same as the selffer treatment, except that the first n layers are copied from a network trained on one dataset (e.g., A) and then the entire network is trained on the other dataset (e.g., B).”).);
inputting training data into the second DNN, wherein the second DNN comprises multiple layers, the multiple layers including: an input layer that receives the training data; an output layer; and at least one hidden layer that is interconnected with the input layer and the output layer, the at least one hidden layer receives output from the input layer and outputs transformed data to a feature space between the at least one hidden layer and the output layer, wherein the at least one hidden layer has the parameters assigned thereto (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites the structure of a DNN. As indicated earlier, Yosinski teaches copying n layers from a base network to a target network and training the target network with a subset of image data from the ImageNet dataset to perform a second classification task (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs). Yosinski p.4 Figure 1 further teaches both base and target networks have a deep convolutional neural network structure that contains an input layer, an output layer, and one or more hidden layers between the input and output layer, with interconnections between the input and the first hidden layer, and the last hidden layer and the output layer, where the first n layers that were copied (Yosinski p.4 Figure 1 shows the case for n=3) contains the learned features (and associated weight vectors) from the base network (thus corresponding to “wherein the at least one hidden layer has the parameters assigned thereto”), and where after each hidden layer a set of features (after applying the activations) are learned, where these activations to produce these features represent a data transformation, with the last set of activations between the last hidden layer and the output layer corresponding to “a feature space between the at least one hidden layer and the output layer” (Yosinski p.3 2nd paragraph: “… We then choose a layer n from                         
                            {
                            1,2
                            ,
                            …
                            ,
                            7
                            }
                        
                     and train several new networks … here we copy the first 3 layers from a network … and then learn higher layer features on top of them to classify a new target dataset … We repeated this process for all n in                         
                            
                                
                                    {
                                    1,2
                                    ,
                                    …
                                    ,
                                    7
                                    }
                                     
                                
                                
                                    2
                                
                            
                        
                     and in both directions …”; p.4 Figure 1, including caption: “…The labeled rectangles (e.g.,                         
                            
                                
                                    W
                                
                                
                                    A
                                    1
                                
                            
                        
                    ) represent the weight vector learned of that layer … The vertical, ellipsoidal bars between weight vectors represent the activations of the network at each layer.”).); …
… initializing, non-randomly, parameters (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites initializing the parameters of a deep neural network using a process other than assigning random initialization of weights in a layer. As indicated earlier, Yosinski teaches performing initialization of a network with transferred features from any number of hidden layers (n = 1..7), where this initialization involves applying n layers (and their learned features and associated weights) from a base network to a target network, where this initialization of parameters through applying n layers from a base network to a target network represents a non-random initialization of a deep neural network (Yosinski p.2 item 5: “… we find that initializing a network with transferred features from almost any number of layers can produce a boost to generalization performance …”; p.3 3rd paragraph: … We repeated this process for all n in                         
                            
                                
                                    {
                                    1,2
                                    ,
                                    …
                                    ,
                                    7
                                    }
                                     
                                
                                
                                    2
                                
                            
                        
                     and in both directions …”; and p.4 Figure 1, including caption: “… Fourth row: The transfer network experimental treatment is the same as the selffer treatment, except that the first n layers are copied from a network trained on one dataset (e.g., A) and then the entire network is trained on the other dataset (e.g., B).”).) …
… wherein the second DNN is trained to perform the second classification task based on the parameters of the output layer (Examiner’s note: As indicated earlier, Yosinski teaches training a target network by copying the first n layers from a base network to a target network, and using a target dataset (based on a subset of ImageNet data) to perform a target classification task. A person having ordinary skill in the art would understand that the features at the feature space (represented by the feature activations between the last hidden layer and the output layer, and based on weights from the last hidden) are used to generate the predicted classification labels for a classification task, and as such, corresponds to “wherein the second DDN is trained to perform the second classification task based on the parameters of the output layer” (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1).).
While Yosinski teaches measuring and comparing classification performances for a target network that has applied n layers (and associated weight parameters) from a base network (Yosinski p.4 2nd paragraph; p.4 Figure 1; p.5 Figure 2), Yosinski does not explicitly teach
… evaluating a distribution of the data in the feature space, wherein the distribution of the data in the feature space is based upon the parameters assigned to the at least one hidden layer …
… initializing … parameters of the output layer based on the evaluated distribution of the data in the feature space …
Krahenbuhl teaches
… evaluating a distribution of the data in the feature space, wherein the distribution of the data in the feature space is based upon the parameters assigned to the at least one hidden layer (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites determining a data distribution for a feature space between the last hidden layer and the output layer, based on an initial set of parameter values assigned to the hidden layers. Krahenbuhl teaches performing a within-layer normalization procedure for affine layers in a convolutional neural network, where this normalization procedure involves computing the per-channel sample mean and variance for a set of samples (representing a distribution of data) in order to rescale the weights in the affine layers. A person having ordinary skill in the art would understand that affine layers in a deep convolutional neural network refers to the last hidden layers in a deep convolutional network before the output layer, such that this normalization process to rescale the weights in the affine layers represents an evaluation of a distribution of the data in a feature space (Krahenbuhl p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs: “We aim to ensure that each channel that a layer k+1 receives a similarly distributed input. It is straightforward to initialize weights in affine layers such that the units have outputs following similar distributions. … We normalize the network activations using empirical estimates of activation statistics obtained from actual data samples                         
                            
                                
                                    z
                                
                                
                                    0
                                
                            
                        
                    ∼𝒟. … In particular, for each affine layer k … we compute the empirical mean and standard deviations for all outgoing activations and normalize the weights                         
                            
                                
                                    W
                                
                                
                                    k
                                
                            
                        
                     such that all activations have unit variance and mean β. This procedure is summarized in Algorithm 1.”).) …
… initializing … parameters of the output layer based on the evaluated distribution of the data in the feature space (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites performing a procedure involving initializing parameters for an output layer, where this initialization is based on a data distribution in a feature space from a DNN. Krahenbuhl teaches performing a data dependent initialization procedure for the weights in a neural network layer by estimating the expected norm of the gradient with respect to weights                         
                            
                                
                                    
                                        
                                            C
                                        
                                        ~
                                    
                                
                                
                                    k
                                    ,
                                    j
                                
                                
                                    2
                                
                            
                        
                     through an approximation involving computing an estimate of the columns of the weight matrix                         
                            
                                
                                    W
                                
                                
                                    k
                                
                            
                        
                     to enforce weight learning at the same rate (where this estimation of the columns of the weight matrix represents a linear approximation). Krahenbuhl further teaches first performing a within-layer initialization at the affine layers to normalize those weights at the affine layers (where the affine layers represent the fully-connected layers and thus the last hidden layers of a neural network), and then performing a between-layer normalization iterative procedure to compute the scale correction to correct the weights and biases for all layers using the estimation of the columns of the weight matrix to obtain roughly constant weight parameter change rates                         
                            
                                
                                    C
                                
                                
                                    k
                                    ,
                                    i
                                
                                
                                    2
                                
                            
                        
                     across all layers in a neural network. Krahenbuhl further teaches that although random weights were initially used, the described procedure is updated to support PCA-based initialization or k-means based weight initialization. Hence, this data dependent initialization process involving an estimation of weights across all layers (based on the evaluated within-layer initialization of normalized weights at the affine layers representing a data distribution in a feature space) corresponds to a process for “initializing … parameters of the output layer based on the evaluated distribution of the data in the feature space”  (Krahenbuhl p.3 Section 3 1st-3rd paragraphs: “Given an N-layer neural network with loss function ℓ(                        
                            
                                
                                    z
                                
                                
                                    N
                                
                            
                        
                    ), we first define                         
                            
                                
                                    C
                                
                                
                                    i
                                    ,
                                    j
                                    ,
                                    k
                                
                                
                                    2
                                
                            
                        
                     to be the expected norm of the gradient with respect to weights                         
                            
                                
                                    W
                                
                                
                                    k
                                
                            
                            (
                            i
                            ,
                            j
                            )
                        
                     in layer k … <see equation 1> … where D is a set of input images … To not rely on any labels during initialization, we use a random linear loss function ℓ(                        
                            
                                
                                    z
                                
                                
                                    N
                                
                            
                        
                    )=                        
                            
                                
                                    η
                                
                                
                                    T
                                
                            
                            
                                
                                    z
                                
                                
                                    N
                                
                            
                        
                    , where η∼𝒩(0,I) is sampled from a unit Gaussian distribution. … In order for all parameters to learn at the same “rate”, we require the change in eq.1 to be proportional to the magnitude of the weights … rather than enforce that the individual weight all learn at the same rate, we enforce that the columns of weight matrix                         
                            
                                
                                    W
                                
                                
                                    k
                                
                            
                        
                     do so, i.e.,:                         
                            
                                
                                    
                                        
                                            C
                                        
                                        ~
                                    
                                
                                
                                    k
                                    ,
                                    j
                                
                                
                                    2
                                
                            
                        
                    … <see equation 3> … should be approximately constant …”; p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs; and p.5 Algorithm 2 and p.5 2nd paragraph: “… We use an iterative procedure to obtain roughly constant parameter change rates                         
                            
                                
                                    C
                                
                                
                                    k
                                    ,
                                    i
                                
                                
                                    2
                                
                            
                        
                     across all layers k … given previously initialized weights …”; and p.5 Section 3.3 Weight Initializations 1st paragraph: “Until now, we used a random Gaussian initialization of the weights, but our procedure does not require this. Hence, we explored two data-driven initializations: a PCA-based initialization and a k-means based initialization. …”).) …
Both Yosinski and Krahenbuhl are analogous art since they both teach methods for initializing weight parameters in a deep neural network.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the weight initialization parameter method taught in Yosinski and enhance it with the data-dependent weight initialization method taught in Krahenbuhl as a way to calibrate the weights based on the input data and improve fine-tuning performance. The motivation to combine is taught in Krahenbuhl, since this re-parameterization allows for layer-by-layer calibration to improve fine-tuning performance, as well as providing a weight initialization method in which all weights learn equally fast, which makes the system more computationally efficient as well as exhibiting improved classification performance when compared with other weight initialization methods (Krahenbuhl p.2 1st paragraph: “.. this sort of re-parameterization gives us a tool we can use to calibrate layer-by-layer learning to improve fine-tuning performance …”; p.3 2nd paragraph: Given an arbitrary neural network, we next aim for a good parameterization … We initialize our network such that all weights in all layers learn equally fast …”; and p.8 Section 4.2 Weight Initialization 1st-3rd paragraph: “Next we compare our Gaussian, PCA and k-means based weights … We compare all methods on both classification and detection performance in Table 2 … Our initialization on the other hand has no trouble with those additional layers and substantially improves on the random Gaussian initialization.” and p.9 Table 2).
Regarding amended Claim 2, 
Yosinski in view of Krahenbuhl teaches
(Currently amended) The method of claim 1, wherein initializing the parameters of the output layer comprises estimating parameter values of the output layer by finding an approximate solution to the second classification task (Examiner’s note: As indicated earlier, Krahenbuhl teaches performing a data dependent initialization of the weights by estimating the expected norm of the gradient with respect to weights through an approximation of this gradient                         
                            
                                
                                    
                                        
                                            C
                                        
                                        ~
                                    
                                
                                
                                    k
                                    ,
                                    j
                                
                                
                                    2
                                
                            
                        
                     by computing an estimate based on the columns of the weight matrix                         
                            
                                
                                    W
                                
                                
                                    k
                                
                            
                        
                     learn at the same rate, where this estimation of the columns of the weight matrix learning at the same rate represents a linear approximation (and thus represents providing an estimate of parameter values of the output layer by finding an approximate solution). Krahenbuhl further teaches using this data dependent initialization process for image classification and object detection (and thus represents using this data dependent initialization process on image-based classification tasks) (Krahenbuhl p.3 Section 3 1st-3rd paragraphs; p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs; p.5 Algorithm 2 and p.5 2nd paragraph; and p.6 Figure 1 and p.6 2nd paragraph Image classification, 3rd paragraph Object detection).).  
Regarding amended Claim 9, 
Yosinski teaches
(Currently amended) A method of initializing parameters of a first deep neural network (DNN) having a first hidden layer, the method comprising: 
receiving parameters assigned to a second hidden layer of a second DNN that has a same structure as the first DNN, wherein the second DNN is configured to perform a first classification task (Examiner’s note: Under its broadest reasonable interpretation in light of Applicant’s specification paragraphs [0106] and [0005], this limitation will be interpreted as broadly reciting training a second neural network (second DNN) to perform a first classification and applying the learned features from the second neural network, where, under its broadest reasonable interpretation, the limitation “the second DNN that has the same structure as the first DNN” broadly recites that both first and second DNN have a same structure. Furthermore, under its broadest reasonable interpretation, the term “second hidden layer” is interpreted as broadly reciting any hidden layer associated with a second DNN (versus identifying a specific hidden layer associated with a second DNN). As indicated earlier, Yosinski teaches training a base network on a base dataset on a base classification task and copying the first n layers from a base network to a target network to learn a target classification task, where both the base and target networks have a deep convolutional neural network structure consisting of an input layer, an output layer, and one or more hidden layers, and where the base and target classification tasks are based on a subset of ImageNet data containing similar classes. Furthermore, the copying of n layers from a base network to a target network corresponds to applying the learned features (and associated weight vectors) of the hidden layers from a base network to train a target network on a target task (and thus corresponds to “receiving parameters assigned to a second hidden layer of a second DNN that has a same structure as the first DNN, wherein the second DNN is configured to perform a first classification task …”) (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1, including caption).);
assigning the parameters of the second hidden layer of the second DNN to the first hidden layer of the first DNN, wherein the first DNN is to be trained to perform one or more classification tasks that are different from the first classification task (Examiner’s note: As indicated earlier, this limitation exhibits a 112(a) lack of written description such that for purposes of examination, this limitation will be interpreted as broadly reciting training a second neural network (second DNN) to perform a second classification based on the learned features from the first neural network. Furthermore, under its broadest reasonable interpretation, the terms “first hidden layer” and “second hidden layer” are interpreted as broadly reciting any corresponding hidden layer associated with the first and second DNN, respectively (versus identifying and associating a specific hidden layer associated with a first DNN with a specific hidden layer associated with a second DNN). As indicated earlier, Yosinski teaches training a base network on a base dataset on a base classification task and applying the first n layers from a base network to a target network to learn a target classification task, where the base and target classification tasks are based on a subset of ImageNet data containing similar classes, such that each base and target classification task will classify similar classes (thus corresponding to “a first classification task” and “a second classification task”). Yosinski further teaches applying the copied first n layers (from the base network) to a target network, where this copying of n layers corresponds to applying the learned features (and associated weight vectors) from a base network to train a target network on a target task (and thus corresponds to “assigning the parameters of the second hidden layer of the second DDN to the first hidden layer of the first DNN …”) (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1, including caption).); …
While Yosinski teaches measuring and comparing classification performances for a target network that has applied n layers (and associated weight parameters) from a base network (Yosinski p.4 2nd paragraph; p.4 Figure 1; p.5 Figure 2), Yosinski does not explicitly teach
… estimating initializing values for parameters of an output layer of the first DNN by finding an approximate solution to each of the one or more classification tasks, wherein the approximate solutions are based on a data distribution in the feature space of the first DNN, and further wherein the data distribution is based upon the parameters assigned to the first hidden layer of the first DNN.
Krahenbuhl teaches
… estimating initializing values for parameters of an output layer of the first DNN by finding an approximate solution to each of the one or more classification tasks, wherein the approximate solutions are based on a data distribution in the feature space of the first DNN, and further wherein the data distribution is based upon the parameters assigned to the first hidden layer of the first DNN (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites performing an estimation procedure involving estimating parameters for an output layer, where this estimation is based on an approximation based on a data distribution in a feature space from a hidden layer of a DNN. As indicated earlier, Krahenbuhl teaches performing a within-layer normalization procedure for affine layers in a convolutional neural network (where the affine layers represent the fully-connected layers and thus the last hidden layers of a neural network), where this normalization procedure involves computing the per-channel sample mean and variance for a set of samples (representing a distribution of data) in order to rescale the weights in the affine layers, such that this normalization process to rescale the weights in the affine layers represents an evaluation of a distribution of the data in a feature space (Krahenbuhl p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs). As indicated earlier, Krahenbuhl teaches performing a data dependent initialization procedure for the weights in a neural network layer by estimating the expected norm of the gradient with respect to weights                         
                            
                                
                                    
                                        
                                            C
                                        
                                        ~
                                    
                                
                                
                                    k
                                    ,
                                    j
                                
                                
                                    2
                                
                            
                        
                     through an approximation involving computing an estimate of the columns of the weight matrix                         
                            
                                
                                    W
                                
                                
                                    k
                                
                            
                        
                     to enforce weight learning at the same rate (where this estimation of the columns of the weight matrix represents a linear approximation). Krahenbuhl further teaches first performing a within-layer initialization at the affine layers to normalize those weights at the affine layers, and then performing a between-layer normalization iterative procedure to compute the scale correction to correct the weights and biases for all layers using the estimation of the columns of the weight matrix to obtain roughly constant weight parameter change rates                         
                            
                                
                                    C
                                
                                
                                    k
                                    ,
                                    i
                                
                                
                                    2
                                
                            
                        
                     across all layers in a neural network. Krahenbuhl further teaches that although random weights were initially used, the described procedure is updated to support PCA-based initialization or k-means based weight initialization. Hence, this data dependent initialization process involving an estimation of weights across all layers (based on the evaluated within-layer initialization of normalized weights at the affine layers representing a data distribution in a feature space) corresponds to a process for “estimating initializing values for parameters of an output layer of the first DNN by finding an approximate solution to each of the one or more classification tasks, wherein the approximate solutions are based on a data distribution in the feature space of the first DNN” (Krahenbuhl p.3 Section 3 1st-3rd paragraphs; p.5 Algorithm 2 and p.5 2nd paragraph; and p.5 Section 3.3 Weight Initializations 1st paragraph).).
Both Yosinski and Krahenbuhl are analogous art since they both teach methods for initializing weight parameters in a deep neural network.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the weight initialization parameter method taught in Yosinski and enhance it with the data-dependent weight initialization method taught in Krahenbuhl as a way to calibrate the weights based on the input data and improve fine-tuning performance. The motivation to combine is taught in Krahenbuhl, as provided in the prior art mapping from Claim 1.
Regarding amended Claim 14, 
Yosinski teaches
(Currently amended) A system comprising:
an artificial neural network that is to be trained for a first classification task (Examiner’s note: Under its broadest reasonable interpretation in light of Applicant’s specification paragraphs [0106] and [0005], this limitation will be interpreted as broadly reciting training a first artificial neural network to perform a first classification. Under its broadest reasonable interpretation in light of Applicant’s specification [0042], a deep neural network (DNN) is a type of an artificial neural network (ANN) with multiple hidden layers. As indicated earlier, Yosinski teaches training a base network on a base dataset on a base classification task and copying the first n layers from a base network to a target network to learn a target classification task, where both the base and target networks have a deep convolutional neural network structure consisting of an input layer, an output layer, and one or more hidden layers, and where the base and target classification tasks are based on a subset of ImageNet data containing similar classes. In the context of this limitation, the base network and the base classification class correspond to the first artificial neural network and the first classification task, respectively (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1, including caption).).), 
the artificial neural network comprising: 
an input level of nodes that receives a set of features and applies a first non-linear function to the set of features to output a first set of modified values (Examiner’s note: Under its broadest reasonable interpretation, the term “an input level of nodes that receives a set of features and applies a first non-linear function” broadly recites a first hidden layer (after the input layer) in an artificial neural network. As indicated earlier, Yosinski teaches copying the first n layers from a base network to a target network to learn a target classification task, where both the base and target networks have a deep convolutional neural network structure consisting of an input layer, an output layer, and one or more hidden layers, and where the base and target classification tasks are based on a subset of ImageNet data containing similar classes (with these classification tasks containing a subset of ImageNet data representing a set of features provided as input for training both the base and target deep convolutional neural networks) (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1, including caption). Yosinski further teaches applying the non-linear relu function on their hidden layers  (Yosinski p.7 Section 4.3 2nd paragraph: “…we use a different nonlinearity (relu(x) instead of abs(tanh(x)), …”). A person having ordinary skill in the art will understand the flow of data in a neural network that receives the ImageNet data as input data from the input layer, where after forwarding the input feature data to the first hidden layer for processing, the non-linear relu function at the first hidden layer will produce the activations as taught in Yosinski p.4 Figure 1 caption (as shown by the vertical ellipsoidal bars), where these activations represent a set of modified values in a feature space.); 
a hidden level of nodes that receives the first set of modified values and applies an intermediate non-linear function to the first set of modified values to obtain a first set of intermediate modified values in a feature space of the first artificial neural network (Examiner’s note: Under its broadest reasonable interpretation, the term “a hidden level of nodes that receives the first set of modified values and applies an intermediate non-linear function” broadly recites a second hidden layer (after the first hidden layer) in an artificial neural network. As indicated earlier, Yosinski teaches copying the first n layers from a base network to a target network to learn a target classification task, where both the base and target networks have a deep convolutional neural network structure consisting of an input layer, an output layer, and one or more hidden layers, and where the base and target classification tasks are based on a subset of ImageNet data containing similar classes (with these classification tasks containing a subset of ImageNet data representing a set of features provided as input for training both the base and target deep convolutional neural networks) (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1, including caption). Yosinski further teaches applying the non-linear relu function on their hidden layers  (Yosinski p.7 Section 4.3 2nd paragraph: “…we use a different nonlinearity (relu(x) instead of abs(tanh(x)), …”). A person having ordinary skill in the art will understand the flow of data in a neural network that receives the ImageNet data as input data from the input layer, where after forwarding the input feature data to the first hidden layer for processing, the flow continues to the second hidden layer for processing, where the non-linear relu function at the second hidden layer will produce the activations as taught in Yosinski p.4 Figure 1 caption (as shown by the vertical ellipsoidal bars), where these activations represent a set of intermediate modified values in a feature space.), 
wherein the hidden level of nodes have parameters assigned thereto that were retrieved from a second artificial neural network that has a same structure as the artificial neural network, wherein the second artificial neural network is trained to perform a second classification task that differs from the first classification task (Examiner’s note: As indicated earlier, this limitation exhibits both 112(b) indefiniteness and 112(a) lack of written description issues such that for purposes of examination, this limitation will be interpreted as broadly reciting training a second artificial neural network to perform a second classification based on the learned features from the first artificial neural network. As indicated earlier, Yosinski teaches copying the first n layers from a base network to a target network to learn a target classification task, where both the base and target networks have a neural network structure consisting of an input layer, an output layer, and one or more hidden layers, and where the base and target classification tasks are based on a subset of ImageNet data containing similar classes. Yosinski further teaches applying the copied first n layers (from the base network) to a target network, where this copying of n layers corresponds to applying the learned features (and associated weight vectors) from a base network to train a target network on a target task (and thus corresponds to “wherein the hidden level of nodes have parameters assigned thereto that were retrieved from a second artificial neural network …”) (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1, including caption).); 
an output level of nodes that receives the first set of intermediate modified values and generates a set of output values, the output values being indicative of a pattern relating to the first classification task (Examiner’s note: Under its broadest reasonable interpretation, the term “an output level of nodes that … generates a set of output values, the output values being indicative of a pattern relating to the first classification task” broadly recites the last hidden layer in the second artificial neural network that was applied from the first artificial neural network. As indicated earlier, Yosinski teaches copying the first n layers from a base network to a target network to learn a target classification task, where both the base and target networks have a neural network structure consisting of an input layer, an output layer, and one or more hidden layers, and where the base and target classification tasks are based on a subset of ImageNet data containing similar classes. Furthermore, this copying of n layers from a base network to a target network corresponds to applying the learned features (and associated weight vectors) from a base network to train a target network on a target task (and thus corresponds to “wherein the hidden level of nodes have parameters assigned thereto that were retrieved from a second artificial neural network …”) (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1, including caption). Using the fourth row of Yosinski p.4 Figure 4 as an example, Yosinski teaches applying the first three hidden layers of a base network to a target network, where these applied hidden layers (including their learned features and their corresponding weights) from the base network are frozen/locked, such that the output from the third hidden layer (                        
                            
                                
                                    W
                                
                                
                                    A
                                    3
                                
                            
                        
                    ) corresponds to output values being indicative of the learned features and patterns relating to the first classification task.); and 
level initializing logic that non-randomly initializes parameters of the output level (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites an algorithm or a series of steps (“level initializing logic”) for initializing the parameters of a deep neural network using a process other than assigning random initialization of weights in a layer, where the term “output level” broadly recites the last hidden layer in the second artificial neural network that was applied from the first artificial neural network (as indicated in an earlier limitation). As indicated earlier, Yosinski teaches performing initialization of a network with transferred features from any number of hidden layers (n = 1..7), where this initialization involves applying n layers (and their learned features and associated weights) to a target network, where this initialization represents a non-random initialization of a deep neural network (Yosinski p.2 item 5: “… we find that initializing a network with transferred features from almost any number of layers can produce a boost to generalization performance …”; p.3 3rd paragraph: … We repeated this process for all n in                         
                            
                                
                                    {
                                    1,2
                                    ,
                                    …
                                    ,
                                    7
                                    }
                                     
                                
                                
                                    2
                                
                            
                        
                     and in both directions …”; and p.4 Figure 1, including caption: “… Fourth row: The transfer network experimental treatment is the same as the selffer treatment, except that the first n layers are copied from a network trained on one dataset (e.g., A) and then the entire network is trained on the other dataset (e.g., B).”).) …
While Yosinski teaches measuring and comparing classification performances for a target network that has applied n layers (and associated weight parameters) from a base network (Yosinski p.4 2nd paragraph; p.4 Figure 1; p.5 Figure 2), Yosinski does not explicitly teach
… by resolving approximate solutions to the output level, wherein the approximate solutions to the output level are based on data distribution in the feature space, and further wherein the data distribution in the feature space is based upon the parameters retrieved from the second artificial neural network assigned to the hidden level of nodes (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites performing an estimation procedure involving estimating parameters for an output layer, where this estimation is based on an approximation based on a data distribution in a feature space from a hidden layer of an artificial neural network. As indicated earlier, Krahenbuhl teaches performing a within-layer normalization procedure for affine layers in a convolutional neural network (where the affine layers represent the fully-connected layers and thus the last hidden layers of a neural network), where this normalization procedure involves computing the per-channel sample mean and variance for a set of samples (representing a distribution of data) in order to rescale the weights in the affine layers, such that this normalization process to rescale the weights in the affine layers represents an evaluation of a distribution of the data in a feature space (Krahenbuhl p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs). As indicated earlier, Krahenbuhl teaches performing a data dependent initialization procedure for the weights in a neural network layer by estimating the expected norm of the gradient with respect to weights                         
                            
                                
                                    
                                        
                                            C
                                        
                                        ~
                                    
                                
                                
                                    k
                                    ,
                                    j
                                
                                
                                    2
                                
                            
                        
                     through an approximation involving computing an estimate of the columns of the weight matrix                         
                            
                                
                                    W
                                
                                
                                    k
                                
                            
                        
                     to enforce weight learning at the same rate (where this estimation of the columns of the weight matrix represents a linear approximation). Krahenbuhl further teaches first performing a within-layer initialization at the affine layers to normalize those weights at the affine layers, and then performing a between-layer normalization iterative procedure to compute the scale correction to correct the weights and biases for all layers using the estimation of the columns of the weight matrix to obtain roughly constant weight parameter change rates                         
                            
                                
                                    C
                                
                                
                                    k
                                    ,
                                    i
                                
                                
                                    2
                                
                            
                        
                     across all layers in a neural network. Krahenbuhl further teaches that although random weights were initially used, the described procedure is updated to support PCA-based initialization or k-means based weight initialization. Hence, this data dependent initialization process involving an estimation of weights across all layers (based on the evaluated within-layer initialization of normalized weights at the affine layers representing a data distribution in a feature space) corresponds to a process for “estimating initializing values for parameters of an output layer of the first DNN by finding an approximate solution to each of the one or more classification tasks, wherein the approximate solutions are based on a data distribution in the feature space of the first DNN” (Krahenbuhl p.3 Section 3 1st-3rd paragraphs; p.5 Algorithm 2 and p.5 2nd paragraph; and p.5 Section 3.3 Weight Initializations 1st paragraph).).  
Both Yosinski and Krahenbuhl are analogous art since they both teach methods for initializing weight parameters in a deep neural network.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the weight initialization parameter method taught in Yosinski and enhance it with the data-dependent weight initialization method taught in Krahenbuhl as a way to calibrate the weights based on the input data and improve fine-tuning performance. The motivation to combine is taught in Krahenbuhl, as provided in the prior art mapping from Claim 1.
Regarding amended Claim 21, 
Yosinski in view of Krahenbuhl teaches
(Currently amended) The method of claim 1, wherein the at least one hidden layer comprises multiple hidden layers, and further wherein the multiple hidden layers are parameterized by the parameters of the first DNN (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites multiple hidden layers in a neural network, where parameters from multiple hidden layers from a first DNN are applied to corresponding multiple hidden layers from a second DNN. As indicated earlier, Yosinski teaches copying the first n layers from a base network to a target network, where these first n layers corresponds to multiple hidden layers from a first DNN, and applying them to a second DNN to train a target network on a target task (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1).).
Regarding amended Claim 22, 
Yosinski in view of Krahenbuhl teaches
(Currently amended) The method of claim 1, wherein the parameters of the output layer of the second DNN are initialized based upon the second DNN being trained to perform object recognition (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites training the second DNN on a classification task based on object recognition. As indicated earlier, Yosinski teaches training a target network on a subset of ImageNet datasets to perform image classification, where these ImageNet datasets include images of dogs and cats and their corresponding classifications, and as such these image classification tasks represent an object recognition task (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1).).  
Regarding amended Claim 23, 
Yosinski in view of Krahenbuhl teaches
(Currently amended) The method of claim 9, wherein multiple hidden layers of the first DNN are parameterized with parameters of the corresponding hidden layers of the second DNN (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites applying the associated parameters from the multiple hidden layers of the first DNN to the second DNN. As indicated earlier, Yosinski teaches applying the first n layers from the base network to a target network, where the learned features (and associated weight parameters) from multiple hidden layers are applied to the target network (Yosinski p.2 2nd-4th paragraphs; p.3 1st-4th paragraphs; and p.4 Figure 1).).
Claims 4-6, 10, 12, 17-19, and 25 are rejected under 35 U.S.C. 103 as being unpatentable over 
Yosinski et al., How transferable are features in deep neural networks?, November 6 2014 [hereafter referred as Yosinski] in view of Krahenbuhl et al., Data-Dependent Initializations of Convolutional Neural Networks, September 22 2016 [hereafter referred as Krahenbuhl] as applied to Claims 1, 9, and 14; in further view of Alberti et al., Historical Document Image Segmentation with LDA-Initialized Deep Neural Networks, October 19 2017 [hereafter referred as Alberti].
Regarding amended Claim 4, 
Yosinski in view of Krahenbuhl as applied to Claim 1 teaches
(Currently amended) The method of claim 1, wherein initializing the parameters comprises:
approximating a distribution of features for the second classification task (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites performing an estimation procedure involving estimating parameters for an output layer, where this estimation is based on an approximation based on a data distribution in a feature space from a hidden layer of a DNN. As indicated earlier, Krahenbuhl teaches performing a within-layer normalization procedure for affine layers in a convolutional neural network (where the affine layers represent the fully-connected layers and thus the last hidden layers of a neural network), where this normalization procedure involves computing the per-channel sample mean and variance for a set of samples (representing a distribution of data) in order to rescale the weights in the affine layers, such that this normalization process to rescale the weights in the affine layers represents an evaluation of a distribution of the data in a feature space (Krahenbuhl p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs). As indicated earlier, Krahenbuhl teaches performing a data dependent initialization procedure for the weights in a neural network layer by estimating the expected norm of the gradient with respect to weights                         
                            
                                
                                    
                                        
                                            C
                                        
                                        ~
                                    
                                
                                
                                    k
                                    ,
                                    j
                                
                                
                                    2
                                
                            
                        
                     through an approximation involving computing an estimate of the columns of the weight matrix                         
                            
                                
                                    W
                                
                                
                                    k
                                
                            
                        
                     to enforce weight learning at the same rate (where this estimation of the columns of the weight matrix represents a linear approximation). Krahenbuhl further teaches first performing a within-layer initialization at the affine layers to normalize those weights at the affine layers, and then performing a between-layer normalization iterative procedure to compute the scale correction to correct the weights and biases for all layers using the estimation of the columns of the weight matrix to obtain roughly constant weight parameter change rates                         
                            
                                
                                    C
                                
                                
                                    k
                                    ,
                                    i
                                
                                
                                    2
                                
                            
                        
                     across all layers in a neural network. Krahenbuhl further teaches that although random weights were initially used, the described procedure is updated to support PCA-based initialization or k-means based weight initialization. Hence, this data dependent initialization process involving an estimation of weights across all layers (based on the evaluated within-layer initialization of normalized weights at the affine layers representing a data distribution in a feature space) corresponds to a process for “estimating initializing values for parameters of an output layer of the first DNN by finding an approximate solution to each of the one or more classification tasks, wherein the approximate solutions are based on a data distribution in the feature space of the first DNN” (Krahenbuhl p.3 Section 3 1st-3rd paragraphs; p.5 Algorithm 2 and p.5 2nd paragraph; and p.5 Section 3.3 Weight Initializations 1st paragraph).) …
While Yosinski in view of Krahenbuhl teaches a data-dependent weight initialization procedure that results in approximating the weights in all layers learning at the same rate (where this approximation to learn the weights at the same rate corresponds to a linear expression), Yosinski in view of Krahenbuhl does not explicitly teach
… deriving an optimal linear classifier based upon results of the approximating, the optimal linear classifier being usable to update the parameters of the output layer of the second DNN.
Alberti teaches
… deriving an optimal linear classifier based upon results of the approximating, the optimal linear classifier being usable to update the parameters of the output layer of the second DNN (Examiner’s note: Under its broadest reasonable interpretation in light of Applicant’s specification paragraph [0090] and associated equation (6), this limitation broadly recites the determination of an optimal linear classifier, where the optimal linear classifier is represented by a bias calculation based on a weight matrix and a corresponding class mean. Alberti teaches using linear discriminant analysis to initialize the weights of a CNN layer-wise with data-based values, where linear discriminant analysis is further used to estimate a linear classifier (represented by a linear matrix equation) using within-class mean                         
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                     (representing class centroid statistics) and pooled covariance matrix                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     (representing shared covariance matrix statistics).  Alberti p.3 Equations 7,                         
                            c
                            =
                            
                                
                                    arg
                                
                                ⁡
                                
                                    
                                        
                                            max
                                        
                                        ⁡
                                        
                                            
                                                
                                                    δ
                                                
                                                
                                                    c
                                                
                                            
                                            (
                                            x
                                            )
                                        
                                    
                                
                            
                        
                    , and Equations 11 further teach the determination of an optimal linear classifier involving a bias calculation based on a weight matrix and a corresponding class mean (Alberti p.1 col.2 last paragraph: “…we present a novel initialization method based on LDA which allows us to quickly initialize the weights of a CNN layer-wise with data-based values.”; p.3 col.1 4th paragraph: “… Let                         
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                     denote the within-class mean of class c …”; and p.3 Section 3 LDA as Classifier: “Even though LDA is most used for dimensionality reduction, it can be used to directly perform data classification … one must compute the discriminant functions                         
                            
                                
                                    δ
                                
                                
                                    c
                                
                            
                        
                     for each class c:                         
                            
                                
                                    δ
                                
                                
                                    c
                                
                            
                            =
                             
                            
                                
                                    x
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            -
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            +
                            
                                
                                    log
                                
                                ⁡
                                
                                    
                                        
                                            
                                                
                                                    π
                                                
                                                
                                                    c
                                                
                                            
                                        
                                    
                                
                            
                             
                            (
                            7
                            )
                        
                     where                         
                            
                                
                                    π
                                
                                
                                    c
                                
                            
                        
                     and                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     are the prior probability [9] and the pooled covariance matrix, for the class c. … An observation x will then be classified into class c as:                         
                            c
                            =
                            
                                
                                    arg
                                
                                ⁡
                                
                                    
                                        
                                            max
                                        
                                        ⁡
                                        
                                            
                                                
                                                    δ
                                                
                                                
                                                    c
                                                
                                            
                                            (
                                            x
                                            )
                                        
                                    
                                
                            
                        
                     … The entire vector 𝛅 can be computed in a matrix form (for all classes) given an input vector x:                         
                            δ
                            =
                            W
                            ∙
                            x
                            +
                            b
                        
                     … To initialize a neural layer to compute it we set the initial values of the bias                         
                            
                                
                                    b
                                
                                
                                    c
                                
                            
                        
                     to the constant part of Equation 7:                         
                            
                                
                                    b
                                
                                
                                    c
                                
                            
                            =
                            -
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            +
                            
                                
                                    log
                                
                                ⁡
                                
                                    
                                        
                                            
                                                
                                                    π
                                                
                                                
                                                    c
                                                
                                            
                                        
                                    
                                     
                                     
                                    
                                        
                                            11
                                        
                                    
                                
                            
                        
                     and the rows of the weight matrix W to be the linear part of Equation 7, such that at the row c we have                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                    .”). Alberti further teaches that this LDA initialization procedure is used in an iterative fashion to calculate the weights from the first layer to the last layer of a deep neural network, with this calculation for the last layer corresponding to updating the parameters of the output layer (Alberti p.4 Section 3.3 Experimental Setup 3rd paragraph: “… we start by computing LDA on k raw input patches and use the transformation matrix to initialize the first layer. We then proceed to apply a forward pass with the first layer to all k raw input patches and we use the output to compute again LDA such that we can use the new transformation matrix to initialize the second layer. This procedure is then repeated until the last layer is initialized. As this point, we add a classification layer that we will initialize in the same fashion as the others, but with the linear discriminant matrix rather than with the transformation matrix….”).).
Both Yosinski in view of Krahenbuhl and Alberti are analogous art since they both teach performing layer-wise weight initialization for deep neural networks.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the data-dependent weight initialization method taught in Yosinski in view of Krahenbuhl and enhance it to incorporate the linear discriminant analysis-based weight initialization method taught in Alberti as a way to initialize the weights for a neural network layer (including the output layer). The motivation to combine is taught in Alberti, as the results from applying the LDA weight initialization provides faster convergence, is more stable, and exhibits better performance when compared to random weight initialization, thus improving the overall performance and robustness of a system that uses this weight initialization procedure (Alberti p.4 col.1 Section 3.3 2nd-3rd paragraphs; p.4 col.2 Section 5 1st paragraph; p.5 col.1 1st paragraph; and p.5 col.1 Section 6 1st paragraph: “… we have investigated a new approach for initializing DNN using LDA. We show that such initialization is more stable, converge faster and to better performances than the random weights initialization.”).
Regarding amended Claim 5, 
Yosinski in view of Krahenbuhl, in further view of Alberti teaches
(Currently amended) The method of claim 4, wherein the distribution is Gaussian (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites a normalized distribution based on mean and variance. As indicated earlier, Krahenbuhl teaches performing a within-layer normalization procedure for affine layers in a convolutional neural network (where the affine layers represent the fully-connected layers and thus the last hidden layers of a neural network), where this normalization procedure involves computing the per-channel sample mean and variance for a set of samples (representing a distribution of data) in order to rescale the weights in the affine layers. This normalization process to rescale the weights in the affine layers represents an evaluation of a distribution of the data in a feature space, where this evaluation involves normalizing the parameters according to a sample mean and variance, which makes the distribution a Gaussian distribution (Krahenbuhl p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs).).
Regarding original Claim 6, 
Yosinski in view of Krahenbuhl, in further view of Alberti teaches
(Original) The method of claim 4, wherein the approximating is based on at least one of class centroid statistics and shared covariance matrix statistics (Examiner’s note: As indicated earlier, Alberti teaches using linear discriminant analysis to initialize the weights of a CNN layer-wise with data-based values, where linear discriminant analysis is further used to estimate a linear classifier (represented by a linear matrix equation) using within-class mean                         
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                     (representing class centroid statistics) and pooled covariance matrix                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     (representing shared covariance matrix statistics) (Alberti p.1 col.2 last paragraph; p.3 col.1 4th paragraph: “… Let                         
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                     denote the within-class mean of class c …”; and p.3 Section 3 LDA as Classifier: “ … one must compute the discriminant functions                         
                            
                                
                                    δ
                                
                                
                                    c
                                
                            
                        
                     for each class c:                         
                            
                                
                                    δ
                                
                                
                                    c
                                
                            
                            =
                             
                            
                                
                                    x
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            -
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            +
                            
                                
                                    log
                                
                                ⁡
                                
                                    
                                        
                                            
                                                
                                                    π
                                                
                                                
                                                    c
                                                
                                            
                                        
                                    
                                
                            
                             
                            (
                            7
                            )
                        
                     where                         
                            
                                
                                    π
                                
                                
                                    c
                                
                            
                        
                     and                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     are the prior probability [9] and the pooled covariance matrix, for the class c. … An observation x will then be classified into class c as:                         
                            c
                            =
                            
                                
                                    arg
                                
                                ⁡
                                
                                    
                                        
                                            max
                                        
                                        ⁡
                                        
                                            
                                                
                                                    δ
                                                
                                                
                                                    c
                                                
                                            
                                            (
                                            x
                                            )
                                        
                                    
                                
                            
                        
                     … The entire vector 𝛅 can be computed in a matrix form (for all classes) given an input vector x:                         
                            δ
                            =
                            W
                            ∙
                            x
                            +
                            b
                        
                     … To initialize a neural layer to compute it we set the initial values of the bias                         
                            
                                
                                    b
                                
                                
                                    c
                                
                            
                        
                     to the constant part of Equation 7:                         
                            
                                
                                    b
                                
                                
                                    c
                                
                            
                            =
                            -
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            +
                            
                                
                                    log
                                
                                ⁡
                                
                                    
                                        
                                            
                                                
                                                    π
                                                
                                                
                                                    c
                                                
                                            
                                        
                                    
                                     
                                     
                                    
                                        
                                            11
                                        
                                    
                                
                            
                        
                     and the rows of the weight matrix W to be the linear part of Equation 7, such that at the row c we have                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                    .”).).
Regarding amended Claim 10, 
Yosinski in view of Krahenbuhl as applied to Claim 9 teaches
(Currently amended) The method of claim 9, wherein the estimating the initializing values includes:
… approximating a distribution of the features for each class of data (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites performing an estimation procedure involving estimating parameters for an output layer, where this estimation is based on an approximation based on a data distribution in a feature space from a hidden layer of a DNN. As indicated earlier, Krahenbuhl teaches performing a within-layer normalization procedure for affine layers in a convolutional neural network (where the affine layers represent the fully-connected layers and thus the last hidden layers of a neural network), where this normalization procedure involves computing the per-channel sample mean and variance for a set of samples (representing a distribution of data) in order to rescale the weights in the affine layers, such that this normalization process to rescale the weights in the affine layers represents an evaluation of a distribution of the data in a feature space (Krahenbuhl p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs). As indicated earlier, Krahenbuhl teaches performing a data dependent initialization procedure for the weights in a neural network layer by estimating the expected norm of the gradient with respect to weights                         
                            
                                
                                    
                                        
                                            C
                                        
                                        ~
                                    
                                
                                
                                    k
                                    ,
                                    j
                                
                                
                                    2
                                
                            
                        
                     through an approximation involving computing an estimate of the columns of the weight matrix                         
                            
                                
                                    W
                                
                                
                                    k
                                
                            
                        
                     to enforce weight learning at the same rate (where this estimation of the columns of the weight matrix represents a linear approximation). Krahenbuhl further teaches first performing a within-layer initialization at the affine layers to normalize those weights at the affine layers, and then performing a between-layer normalization iterative procedure to compute the scale correction to correct the weights and biases for all layers using the estimation of the columns of the weight matrix to obtain roughly constant weight parameter change rates                         
                            
                                
                                    C
                                
                                
                                    k
                                    ,
                                    i
                                
                                
                                    2
                                
                            
                        
                     across all layers in a neural network. Krahenbuhl further teaches that although random weights were initially used, the described procedure is updated to support PCA-based initialization or k-means based weight initialization. Hence, this data dependent initialization process involving an estimation of weights across all layers (based on the evaluated within-layer initialization of normalized weights at the affine layers representing a data distribution in a feature space) corresponds to a process for “estimating initializing values for parameters of an output layer of the first DNN by finding an approximate solution to each of the one or more classification tasks, wherein the approximate solutions are based on a data distribution in the feature space of the first DNN” (Krahenbuhl p.3 Section 3 1st-3rd paragraphs; p.5 Algorithm 2 and p.5 2nd paragraph; and p.5 Section 3.3 Weight Initializations 1st paragraph).) …
… the distributions having Gaussian distributions (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites a normalized distribution based on mean and variance. As indicated earlier, Krahenbuhl teaches performing a within-layer normalization procedure for affine layers in a convolutional neural network (where the affine layers represent the fully-connected layers and thus the last hidden layers of a neural network), where this normalization procedure involves computing the per-channel sample mean and variance for a set of samples (representing a distribution of data) in order to rescale the weights in the affine layers. This normalization process to rescale the weights in the affine layers represents an evaluation of a distribution of the data in a feature space, where this evaluation involves normalizing the parameters according to a sample mean and variance, which makes the distribution a Gaussian distribution (Krahenbuhl p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs).) …
While Yosinski in view of Krahenbuhl teaches a data-dependent weight initialization procedure that results in approximating the weights in all layers learning at the same rate (where this approximation to learn the weights at the same rate corresponds to a linear expression), Yosinski in view of Krahenbuhl does not explicitly teach
… a shared covariance …
… deriving a linear classifier based on the distribution …
… calculating the initializing values of the output layer using the derived linear classifier.
Alberti teaches
… a shared covariance (Examiner’s note: As indicated earlier, Alberti teaches using linear discriminant analysis to initialize the weights of a CNN layer-wise with data-based values, where linear discriminant analysis is further used to estimate a linear classifier using within-class mean                         
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                     and pooled covariance matrix                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     (representing shared covariance matrix statistics) (Alberti p.1 col.2 last paragraph: “…we present a novel initialization method based on LDA which allows us to quickly initialize the weights of a CNN layer-wise with data-based values.”; p.3 col.1 4th paragraph; and p.3 Section 3 LDA as Classifier: “… one must compute the discriminant functions                         
                            
                                
                                    δ
                                
                                
                                    c
                                
                            
                        
                     for each class c:                         
                            
                                
                                    δ
                                
                                
                                    c
                                
                            
                            =
                             
                            
                                
                                    x
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            -
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            +
                            
                                
                                    log
                                
                                ⁡
                                
                                    
                                        
                                            
                                                
                                                    π
                                                
                                                
                                                    c
                                                
                                            
                                        
                                    
                                
                            
                             
                            (
                            7
                            )
                        
                     where                         
                            
                                
                                    π
                                
                                
                                    c
                                
                            
                        
                     and                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     are the prior probability [9] and the pooled covariance matrix, for the class c. … An observation x will then be classified into class c as:                         
                            c
                            =
                            
                                
                                    arg
                                
                                ⁡
                                
                                    
                                        
                                            max
                                        
                                        ⁡
                                        
                                            
                                                
                                                    δ
                                                
                                                
                                                    c
                                                
                                            
                                            (
                                            x
                                            )
                                        
                                    
                                
                            
                        
                     … The entire vector 𝛅 can be computed in a matrix form (for all classes) given an input vector x:                         
                            δ
                            =
                            W
                            ∙
                            x
                            +
                            b
                        
                     … To initialize a neural layer to compute it we set the initial values of the bias                         
                            
                                
                                    b
                                
                                
                                    c
                                
                            
                        
                     to the constant part of Equation 7:                         
                            
                                
                                    b
                                
                                
                                    c
                                
                            
                            =
                            -
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            +
                            
                                
                                    log
                                
                                ⁡
                                
                                    
                                        
                                            
                                                
                                                    π
                                                
                                                
                                                    c
                                                
                                            
                                        
                                    
                                     
                                     
                                    
                                        
                                            11
                                        
                                    
                                
                            
                        
                     and the rows of the weight matrix W to be the linear part of Equation 7, such that at the row c we have                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                    .”).) …
… deriving a linear classifier based on the distribution (Examiner’s note: As indicated earlier, Alberti teaches using linear discriminant analysis to initialize the weights of a CNN layer-wise with data-based values, where linear discriminant analysis is further used to estimate a linear classifier (represented by a linear matrix equation) using within-class mean                         
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                     (representing class centroid statistics) and pooled covariance matrix                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     (representing shared covariance matrix statistics) (Alberti p.1 col.2 last paragraph: “…we present a novel initialization method based on LDA which allows us to quickly initialize the weights of a CNN layer-wise with data-based values.”; p.3 col.1 4th paragraph: “… Let                         
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                     denote the within-class mean of class c …”; and p.3 Section 3 LDA as Classifier: “Even though LDA is most used for dimensionality reduction, it can be used to directly perform data classification … one must compute the discriminant functions                         
                            
                                
                                    δ
                                
                                
                                    c
                                
                            
                        
                     for each class c:                         
                            
                                
                                    δ
                                
                                
                                    c
                                
                            
                            =
                             
                            
                                
                                    x
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            -
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            +
                            
                                
                                    log
                                
                                ⁡
                                
                                    
                                        
                                            
                                                
                                                    π
                                                
                                                
                                                    c
                                                
                                            
                                        
                                    
                                
                            
                             
                            (
                            7
                            )
                        
                     where                         
                            
                                
                                    π
                                
                                
                                    c
                                
                            
                        
                     and                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     are the prior probability [9] and the pooled covariance matrix, for the class c. … An observation x will then be classified into class c as:                         
                            c
                            =
                            
                                
                                    arg
                                
                                ⁡
                                
                                    
                                        
                                            max
                                        
                                        ⁡
                                        
                                            
                                                
                                                    δ
                                                
                                                
                                                    c
                                                
                                            
                                            (
                                            x
                                            )
                                        
                                    
                                
                            
                        
                     … The entire vector 𝛅 can be computed in a matrix form (for all classes) given an input vector x:                         
                            δ
                            =
                            W
                            ∙
                            x
                            +
                            b
                        
                     … To initialize a neural layer to compute it we set the initial values of the bias                         
                            
                                
                                    b
                                
                                
                                    c
                                
                            
                        
                     to the constant part of Equation 7:                         
                            
                                
                                    b
                                
                                
                                    c
                                
                            
                            =
                            -
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            +
                            
                                
                                    log
                                
                                ⁡
                                
                                    
                                        
                                            
                                                
                                                    π
                                                
                                                
                                                    c
                                                
                                            
                                        
                                    
                                     
                                     
                                    
                                        
                                            11
                                        
                                    
                                
                            
                        
                     and the rows of the weight matrix W to be the linear part of Equation 7, such that at the row c we have                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                    .”).) …
… calculating the initializing values of the output layer using the derived linear classifier (Examiner’s note: As indicated earlier, Alberti further teaches that this LDA initialization procedure is used in an iterative fashion to calculate the weights from the first layer to the last layer of a deep neural network, where this calculation including the last layer represents initializing values of the output layer using the derived linear classifier (Alberti p.4 Section 3.3 Experimental Setup 3rd paragraph).).
Both Yosinski in view of Krahenbuhl and Alberti are analogous art since they both teach performing layer-wise weight initialization for deep neural networks.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the data-dependent weight initialization method taught in Yosinski in view of Krahenbuhl and enhance it to incorporate the linear discriminant analysis-based weight initialization method taught in Alberti as a way to initialize the weights for a neural network layer (including the output layer). The motivation to combine is taught in Alberti, as provided in the prior art claim mapping from Claim 4.
Regarding amended Claim 12, 
Yosinski in view of Krahenbuhl, in further view of Alberti teaches
(Currently amended) The method of claim 10, 
wherein the estimating the initializing values is based on how data is distributed in the feature space (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites the approximation procedure based on the Gaussian distribution and a shared covariance recited in dependent Claim 10, where the Gaussian distribution and a shared covariance represent “how” the data is distributed in the feature space. As indicated earlier, Krahenbuhl teaches performing a within-layer normalization procedure for affine layers in a convolutional neural network (where the affine layers represent the fully-connected layers and thus the last hidden layers of a neural network), where this normalization procedure involves computing the per-channel sample mean and variance for a set of samples (representing a distribution of data) in order to rescale the weights in the affine layers. This normalization process to rescale the weights in the affine layers represents an evaluation of a distribution of the data in a feature space, where this evaluation involves normalizing the parameters according to a sample mean and variance, which makes the distribution a Gaussian distribution (Krahenbuhl p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs). Additionally, as indicated earlier, Alberti teaches using linear discriminant analysis to initialize the weights of a CNN layer-wise with data-based values, where linear discriminant analysis is further used to estimate a linear classifier (represented by a linear matrix equation) using within-class mean                         
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                     (representing class centroid statistics) and pooled covariance matrix                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     (representing shared covariance matrix statistics) (Alberti p.1 col.2 last paragraph; p.3 col.1 4th paragraph; and p.3 Section 3 LDA as Classifier).).  
Regarding previously presented Claim 17, 
Yosinski in view of Krahenbuhl as applied to Claim 14 teaches
(Previously presented) The system of claim 14.
While Yosinski in view of Krahenbuhl teaches a data-dependent weight initialization procedure that results in approximating the weights in all layers learning at the same rate (where this approximation to learn the weights at the same rate corresponds to a linear expression), Yosinski in view of Krahenbuhl does not explicitly teach
wherein the approximate solutions are resolved via results of a variant of a linear discriminant analysis algorithm.
Alberti teaches
wherein the approximate solutions are resolved via results of a variant of a linear discriminant analysis algorithm (Examiner’s note: As indicated earlier, Alberti teaches using linear discriminant analysis to initialize the weights of a CNN layer-wise with data-based values, where linear discriminant analysis is further used to estimate a linear classifier (represented by a linear matrix equation) (Alberti p.1 col.2 last paragraph; and p.3 Section 3 LDA as Classifier).).
Both Yosinski in view of Krahenbuhl and Alberti are analogous art since they both teach performing layer-wise weight initialization for deep neural networks.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the data-dependent weight initialization method taught in Yosinski in view of Krahenbuhl and enhance it to incorporate the linear discriminant analysis-based weight initialization method taught in Alberti as a way to initialize the weights for a neural network layer (including the output layer). The motivation to combine is taught in Alberti, as provided in the prior art claim mapping from Claim 4.
Regarding amended Claim 18, 
Yosinski in view of Krahenbuhl as applied to Claim 14 teaches
(Currently amended) The system of claim 14, wherein the output level initializing logic estimates the parameters of the output level by: 
… finding an approximate solution to the first classification task; approximating a distribution of features for the first classification task (Examiner’s note: Under its broadest reasonable interpretation, these two limitations broadly recite performing an estimation procedure involving estimating parameters for an output layer, where this estimation is based on an approximation based on a data distribution in a feature space from a hidden layer of an artificial neural network. As indicated earlier, Krahenbuhl teaches performing a within-layer normalization procedure for affine layers in a convolutional neural network (where the affine layers represent the fully-connected layers and thus the last hidden layers of a neural network), where this normalization procedure involves computing the per-channel sample mean and variance for a set of samples (representing a distribution of data) in order to rescale the weights in the affine layers, such that this normalization process to rescale the weights in the affine layers represents an evaluation of a distribution of the data in a feature space (Krahenbuhl p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs). As indicated earlier, Krahenbuhl teaches performing a data dependent initialization procedure for the weights in a neural network layer by estimating the expected norm of the gradient with respect to weights                         
                            
                                
                                    
                                        
                                            C
                                        
                                        ~
                                    
                                
                                
                                    k
                                    ,
                                    j
                                
                                
                                    2
                                
                            
                        
                     through an approximation involving computing an estimate of the columns of the weight matrix                         
                            
                                
                                    W
                                
                                
                                    k
                                
                            
                        
                     to enforce weight learning at the same rate (where this estimation of the columns of the weight matrix represents a linear approximation). Krahenbuhl further teaches first performing a within-layer initialization at the affine layers to normalize those weights at the affine layers, and then performing a between-layer normalization iterative procedure to compute the scale correction to correct the weights and biases for all layers using the estimation of the columns of the weight matrix to obtain roughly constant weight parameter change rates                         
                            
                                
                                    C
                                
                                
                                    k
                                    ,
                                    i
                                
                                
                                    2
                                
                            
                        
                     across all layers in a neural network. Krahenbuhl further teaches that although random weights were initially used, the described procedure is updated to support PCA-based initialization or k-means based weight initialization. Hence, this data dependent initialization process involving an estimation of weights across all layers (based on the evaluated within-layer initialization of normalized weights at the affine layers representing a data distribution in a feature space) corresponds to a process for “estimating initializing values for parameters of an output layer of the first DNN by finding an approximate solution to each of the one or more classification tasks, wherein the approximate solutions are based on a data distribution in the feature space of the first DNN” (Krahenbuhl p.3 Section 3 1st-3rd paragraphs; p.5 Algorithm 2 and p.5 2nd paragraph; and p.5 Section 3.3 Weight Initializations 1st paragraph).) …
While Yosinski in view of Krahenbuhl teaches a data-dependent weight initialization procedure that results in approximating the weights in all layers learning at the same rate (where this approximation to learn the weights at the same rate corresponds to a linear expression), Yosinski in view of Krahenbuhl does not explicitly teach
… deriving a linear classifier based on results of the approximating, the linear classifier being usable to initialize the parameters of the output layer.
Alberti teaches
… deriving a linear classifier based on results of the approximating, the linear classifier being usable to initialize the parameters of the output layer (Examiner’s note: Under its broadest reasonable interpretation in light of Applicant’s specification paragraph [0090] and associated equation (6), this limitation broadly recites the determination of a linear classifier, where the linear classifier is represented by a bias calculation based on a weight matrix and a corresponding class mean. As indicated earlier, Alberti teaches using linear discriminant analysis to initialize the weights of a CNN layer-wise with data-based values, where linear discriminant analysis is further used to estimate a linear classifier using within-class mean                         
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                     (representing class centroid statistics) and pooled covariance matrix                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     (representing shared covariance matrix statistics). Alberti p.3 Equations 7,                         
                            c
                            =
                            
                                
                                    arg
                                
                                ⁡
                                
                                    
                                        
                                            max
                                        
                                        ⁡
                                        
                                            
                                                
                                                    δ
                                                
                                                
                                                    c
                                                
                                            
                                            (
                                            x
                                            )
                                        
                                    
                                
                            
                        
                    , and Equations 11 further teach the determination of an optimal linear classifier involving a bias calculation based on a weight matrix and a corresponding class mean (Alberti p.1 col.2 last paragraph; p.3 col.1 4th paragraph; and p.3 Section 3 LDA as Classifier). As indicated earlier, Alberti further teaches that this LDA initialization procedure is used in an iterative fashion to calculate the weights from the first layer to the last layer of a deep neural network (Alberti p.4 Section 3.3 Experimental Setup 3rd paragraph).).
Both Yosinski in view of Krahenbuhl and Alberti are analogous art since they both teach performing layer-wise weight initialization for deep neural networks.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the data-dependent weight initialization method taught in Yosinski in view of Krahenbuhl and enhance it to incorporate the linear discriminant analysis-based weight initialization method taught in Alberti as a way to initialize the weights for a neural network layer (including the output layer). The motivation to combine is taught in Alberti, as provided in the prior art claim mapping from Claim 4.
Regarding original Claim 19, 
Yosinski in view of Krahenbuhl, in further view of Alberti teaches
(Original) The system of claim 18, wherein the distribution is Gaussian (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites a normalized distribution based on mean and variance. As indicated earlier, Krahenbuhl teaches performing a within-layer normalization procedure for affine layers in a convolutional neural network (where the affine layers represent the fully-connected layers and thus the last hidden layers of a neural network), where this normalization procedure involves computing the per-channel sample mean and variance for a set of samples (representing a distribution of data) in order to rescale the weights in the affine layers. This normalization process to rescale the weights in the affine layers represents an evaluation of a distribution of the data in a feature space, where this evaluation involves normalizing the parameters according to a sample mean and variance, which makes the distribution a Gaussian distribution (Krahenbuhl p.4 Algorithm 1 and p.4 Section 3.1 1st-2nd paragraphs).).
Regarding previously presented Claim 25, 
Yosinski in view of Krahenbuhl as applied to Claim 14 teaches
(Previously presented) The system of claim 14.
 While Yosinski in view of Krahenbuhl teaches a data-dependent weight initialization procedure that results in approximating the weights in all layers learning at the same rate (where this approximation to learn the weights at the same rate corresponds to a linear expression), Yosinski in view of Krahenbuhl does not explicitly teach
wherein the level initializing logic includes a linear classifier that is configured to initialize the parameters of the output level, wherein the linear classifier is derived based upon the data distribution in the feature space.
Alberti teaches
wherein the level initializing logic includes a linear classifier that is configured to initialize the parameters of the output level, wherein the linear classifier is derived based upon the data distribution in the feature space (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites a process that specifies a linear classifier to initialize the parameters of an output layer in a neural network. Alberti teaches using linear discriminant analysis to initialize the weights of a CNN layer-wise with data-based values, where linear discriminant analysis is further used to estimate a linear classifier using within-class mean                         
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                     (representing class centroid statistics) and pooled covariance matrix                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     (representing shared covariance matrix statistics). Alberti p.3 Equations 7,                         
                            c
                            =
                            
                                
                                    arg
                                
                                ⁡
                                
                                    
                                        
                                            max
                                        
                                        ⁡
                                        
                                            
                                                
                                                    δ
                                                
                                                
                                                    c
                                                
                                            
                                            (
                                            x
                                            )
                                        
                                    
                                
                            
                        
                    , and Equations 11 further teach the determination of a linear classifier involving a bias calculation based on a weight matrix and a corresponding class mean. Alberti further teaches the linear classifier is also based on prior probability                         
                            
                                
                                    π
                                
                                
                                    c
                                
                            
                        
                    , where these prior probabilities represent the data distribution in a feature space for each class (Alberti p.1 col.2 last paragraph; p.3 col.1 4th paragraph: “… Let                         
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                     denote the within-class mean of class c …”; and p.3 Section 3 LDA as Classifier: “… one must compute the discriminant functions                         
                            
                                
                                    δ
                                
                                
                                    c
                                
                            
                        
                     for each class c:                         
                            
                                
                                    δ
                                
                                
                                    c
                                
                            
                            =
                             
                            
                                
                                    x
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            -
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            +
                            
                                
                                    log
                                
                                ⁡
                                
                                    
                                        
                                            
                                                
                                                    π
                                                
                                                
                                                    c
                                                
                                            
                                        
                                    
                                
                            
                             
                            (
                            7
                            )
                        
                     where                         
                            
                                
                                    π
                                
                                
                                    c
                                
                            
                        
                     and                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                            
                        
                     are the prior probability [9] and the pooled covariance matrix, for the class c. … The entire vector 𝛅 can be computed in a matrix form (for all classes) given an input vector x:                         
                            δ
                            =
                            W
                            ∙
                            x
                            +
                            b
                        
                     … To initialize a neural layer to compute it we set the initial values of the bias                         
                            
                                
                                    b
                                
                                
                                    c
                                
                            
                        
                     to the constant part of Equation 7:                         
                            
                                
                                    b
                                
                                
                                    c
                                
                            
                            =
                            -
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                                
                                    T
                                
                            
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                            +
                            
                                
                                    log
                                
                                ⁡
                                
                                    
                                        
                                            
                                                
                                                    π
                                                
                                                
                                                    c
                                                
                                            
                                        
                                    
                                     
                                     
                                    
                                        
                                            11
                                        
                                    
                                
                            
                        
                     and the rows of the weight matrix W to be the linear part of Equation 7, such that at the row c we have                         
                            
                                
                                    Σ
                                
                                
                                    c
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    μ
                                
                                
                                    c
                                
                            
                        
                    .”). As indicated earlier, Alberti further teaches that this LDA initialization procedure is used in an iterative fashion to calculate the weights from the first layer to the last layer of a deep neural network (Alberti p.4 Section 3.3 Experimental Setup 3rd paragraph).).
Both Yosinski in view of Krahenbuhl and Alberti are analogous art since they both teach performing layer-wise weight initialization for deep neural networks.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the data-dependent weight initialization method taught in Yosinski in view of Krahenbuhl and enhance it to incorporate the linear discriminant analysis-based weight initialization method taught in Alberti as a way to initialize the weights for a neural network layer (including the output layer). The motivation to combine is taught in Alberti, as provided in the prior art claim mapping from Claim 4.
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over 
Yosinski et al., How transferable are features in deep neural networks?, November 6 2014 [hereafter referred as Yosinski] in view of Krahenbuhl et al., Data-Dependent Initializations of Convolutional Neural Networks, September 22 2016 [hereafter referred as Krahenbuhl] as applied to Claim 1; in further view of Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 [hereafter referred as Krizhevsky].
Regarding original Claim 7, 
Yosinski in view of Krahenbuhl as applied to Claim 1 teaches
(Original) The method of claim 1,
the at least one hidden layer comprises a plurality of hidden layers (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites the structure of a DNN. As indicated earlier, Yosinski p.4 Figure 1 teaches both base and target networks have a deep convolutional neural network structure that contains an input layer, an output layer, and one or more hidden layers between the input and output layer, with interconnections between the input and the first hidden layer, and the last hidden layer and the output layer, where these activations to produce these features represent a data transformation (Yosinski p.3 2nd paragraph; and p.4 Figure 1, including caption).) …
… a lowest one of the plurality of hidden layers receives an output from the input layer (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites the structure of a DNN. As indicated earlier, Yosinski p.4 Figure 1 teaches both base and target networks have a deep convolutional neural network structure that contains an input layer, an output layer, and one or more hidden layers between the input and output layer, with interconnections between the input and the first hidden layer, and the last hidden layer and the output layer, where the first hidden layer (in the example shown,                         
                            
                                
                                    W
                                
                                
                                    A
                                    1
                                
                            
                        
                    ) is connected to an input layer, and receives output from the input layer.) …
… the output layer receives an output from a highest one of the plurality of hidden layers (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites the structure of a DNN. As indicated earlier, Yosinski p.4 Figure 1 teaches both base and target networks have a deep convolutional neural network structure that contains an input layer, an output layer, and one or more hidden layers between the input and output layer, with interconnections between the input and the first hidden layer, and the last hidden layer and the output layer, where the last hidden layer (in the example shown,                         
                            
                                
                                    W
                                
                                
                                    A
                                    8
                                
                            
                        
                    ) is connected to an output layer, which receives output from the last hidden layer.).
	While Yosinski in view of Krahenbuhl teaches the structure of a deep convolutional neural network having an input layer with a set of nodes, an output layer with a set of nodes, and one or more hidden layers (Yosinski p.4 Figure 1), Yosinski in view of Krahenbuhl does not explicitly teach
… each hidden layer comprises a respective plurality of nodes, each node in a hidden layer being configured to perform a transformation on output of at least one node from an adjacent, lower layer …
Krizhevsky teaches
… each hidden layer comprises a respective plurality of nodes, each node in a hidden layer being configured to perform a transformation on output of at least one node from an adjacent, lower layer (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites the node connectivity in each hidden layer of a deep neural network. Krizhevsky p.5 Figure 2 teaches the kernels in each convolutional layer and fully connected layers in a deep convolutional neural network (representing the hidden layers in a deep neural network) consists of neurons, where these neurons correspond to a plurality of nodes, and where each set of nodes in a hidden layer are connected to at least one node from the previous hidden layer, with ReLU non-linearity functions applied to the output of every convolutional and fully-connected layer. A person having ordinary skill in the art would understand that these ReLU non-linearity functions are for performing data transformations to generate the neuron activation patterns to be propagated to the next hidden layer (Krizhevsky p.3 Section 3.1; Section 3.2 1st paragraph: “… The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU … the kernels of layer 3 take input from all kernel maps in layer 2 …”; and pp.4-5 Section 3.5 1st-3rd paragraphs: “ … the overall architecture of our CNN. As depicted in Figure 2, the net contains eight layers with weights; the first five are convolutional and the remaining three are fully-connected. … The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer … The kernels of the third convolutional layer are connected to all kernel maps in the second layer. The neurons in the fully-connected layers are connected to all neurons in the previous layer. … The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer. … The fully-connected layers have 4096 neurons each.”; and p.5 Figure 2).) …
Both Yosinski in view of Krahenbuhl and Krizhevsky are analogous art since they both teach classification using a deep convolutional neural network.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention take the deep convolutional neural network structure taught in Yosinski in view of Krahenbuhl and incorporate the deep convolutional neural network structure taught in Krizhevsky as a way to apply GPU parallelization of the convolution operation in each of the convolutional layers to improve training time for large datasets. The motivation to combine is taught in Krizhevsky, where the neuron connections in certain convolutional layers are optimized to take advantage of reading and writing from each other memory’s directly, bypassing host machine memory, thus improving the computational efficiency for training a system utilizing this deep convolutional neural network structure (Krizhevsky p.3 Section 3.2 1st-2nd paragraphs: “… It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU. Therefore we spread the net across two GPUs. Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory. The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers … this allows us to precisely tune the amount of communication until it is an acceptable fraction of the amount of computation … The two-GPU net takes slightly less time to than the one-GPU net.”).
Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over 
Yosinski et al., How transferable are features in deep neural networks?, November 6 2014 [hereafter referred as Yosinski] in view of Krahenbuhl et al., Data-Dependent Initializations of Convolutional Neural Networks, September 22 2016 [hereafter referred as Krahenbuhl], in further view of Alberti et al., Historical Document Image Segmentation with LDA-Initialized Deep Neural Networks, October 19 2017 [hereafter referred as Alberti] as applied to Claim 10; in further view of Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 [hereafter referred as Krizhevsky].
Regarding previously presented Claim 13, 
Yosinski in view of Krahenbuhl, in further view of Alberti as applied to Claim 10 teaches
(Previously presented) The method of claim 10.
While Yosinski in view of Krahenbuhl, in further view of Alberti teaches a pooled covariance matrix (where the pooled covariance matrix represents the weights for a layer, Alberti p.3 col.1 2nd-4th paragraphs), Yosinski in view of Krahenbuhl, in further view of Alberti does not explicitly teach
… introducing a regularization term to a covariance matrix so as to minimize variability of covariance matrix estimation.
Krizhevsky teaches
… introducing a regularization term to a covariance matrix so as to minimize variability of covariance matrix estimation (Examiner’s note: Under its broadest reasonable interpretation in light of Applicant’s specification [0091], the limitation broadly recites applying a value to a covariance matrix containing eigenvalues and associated eigenvectors. Krizhevsky teaches a data augmentation method of applying random values                         
                            
                                
                                    α
                                
                                
                                    i
                                
                            
                        
                     to eigenvectors                         
                            
                                
                                    p
                                
                                
                                    i
                                
                            
                        
                     and eigenvalues                         
                            
                                
                                    λ
                                
                                
                                    i
                                
                            
                        
                     to alter the intensities of the RGB channels in training images, as a way to reduce overfitting of image data, and as such, these                         
                            
                                
                                    α
                                
                                
                                    i
                                
                            
                        
                     values represent a regularization term to minimize variability (Krizhevsky pp.5 Section 4.1 1st-3rd paragraphs: “The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations … We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk … The second form of data augmentation consists of altering the intensities of the RGB channels in the training images … we perform PCA on the set of RGB pixels … we add multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. Therefore to each RGB image pixel … we add the following quantity:                         
                            [
                            
                                
                                    p
                                
                                
                                    1
                                
                            
                            ,
                             
                            
                                
                                    p
                                
                                
                                    2
                                
                            
                            ,
                             
                            
                                
                                    p
                                
                                
                                    3
                                
                            
                            ]
                            
                                
                                    [
                                    
                                        
                                            α
                                        
                                        
                                            1
                                        
                                    
                                    
                                        
                                            λ
                                        
                                        
                                            1
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            α
                                        
                                        
                                            2
                                        
                                    
                                    
                                        
                                            λ
                                        
                                        
                                            2
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            α
                                        
                                        
                                            3
                                        
                                    
                                    
                                        
                                            λ
                                        
                                        
                                            3
                                        
                                    
                                    ]
                                
                                
                                    T
                                
                            
                        
                     where                         
                            
                                
                                    p
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    λ
                                
                                
                                    i
                                
                            
                        
                     are ith eigenvector and eigenvalue of the 3x3 covariance matrix of RGB pixel values, respectively, and                         
                            
                                
                                    α
                                
                                
                                    i
                                
                            
                        
                     is the aforementioned random variable. … This scheme approximately captures an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination. This scheme reduces the top-1 error rate by over 1%.”).).
Both Yosinski in view of Krahenbuhl, in further view of Alberti and Krizhevsky are analogous art since they both teach image classification using a deep convolutional neural network.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention take the weight initialization method taught in Yosinski in view of Krahenbuhl, in further view of Alberti and additionally incorporate the data augmentation method taught in Krizhevsky as a way to augment the image training data to avoid data overfitting. The motivation to combine is taught in Krizhevsky, as this data augmentation method prevents overfitting and reduces the error rate, as well as providing a way to alter the image data with minimal computation and storage requirements, thus improving the accuracy of the model classification predictions, as well as being computationally efficient and providing a small memory footprint (Krizhevsky pp.5-6 Section 4.1 1st-3rd paragraphs).

Conclusion


The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Long et al., Learning Transferable Features with Deep Adaptation Networks, arXiv:1502.02791v2, May 27 2015, 9 pages, where Long teaches constructing a classifier for a deep convolutional neural network to learn transferable features (Long p.3 Section 3 and p.3 Figure 1), where the training of the deep convolutional neural network is based on the teachings from the Yosinski reference, and the fine-tuning includes adding a multi-layer adaptation regularizer as a penalty parameter to minimize the empirical CNN risk (Long pp.3-4 Section 3.1, subsection Deep Adaptation Networks, including equations (3) and (4)).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is 303-297-4332. The examiner can normally be reached Monday-Friday 8:00am - 4:30pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/WILLIAM WAI YIN KWAN/Examiner, Art Unit 2121                                                                                                                                                                                                        



/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121