DETAILED ACTION
This is the response to applicant’s amendment action regarding application number 15/945,888, filed April 5, 2018.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendments
The amendment filed August 25, 2021 has been entered. Examiner acknowledges receipt of Amendments to Application 15/945,888, which include: Amendments to the Claims pp.2-6, and Remarks pp.7-11 (containing applicant’s amendments). 
Regarding applicant’s Remarks on p.7, examiner has acknowledged Claims 1-4, 9-10, 13-14, and 17-18 have been amended. Examiner has acknowledged original Claims 8, 11, 15-16, and 20 have been canceled, and new Claims 21-25 have been added. Claims 1-7, 9-10, 12-14, 17-19, and 21-25 remain pending in the application. 
Regarding applicant’s Remarks on p.7, examiner has acknowledged applicant’s amendment to Claim 4, which has overcome the claim objection previously set forth in the Non-Final Office Action mailed March 25, 2021. However, examiner has noted that other claim amendments have introduced new claim objections, and these new claim objections will be identified in the specified section below.
Regarding applicant’s Remarks on p.8, examiner acknowledges applicant’s Amendments to the Claims have resolved the indefiniteness/lack of antecedent issues identified in Claims 3, 10-13, and 14, and therefore the respective §112(b) rejections previously set forth in the Non-Final Office Action mailed March 25, 2021 for Claims 3, 10-13, and 14 are withdrawn. 
Regarding applicant’s Remarks on p.8, examiner acknowledges applicant’s Amendments to the Claims have resolved the lack of written description issues identified in Claims 8 and 15-16, and therefore the respective §112(a) rejections previously set forth in the Non-Final Office Action mailed March 25, 2021 for Claims 8 and 15-16 are withdrawn. 

Response to Arguments
Examiner acknowledges receipt of Arguments to Application 15/945,888, which include: Remarks pp.7-11 (containing applicant’s arguments). 
Applicant's arguments regarding examiner’s 35 U.S.C §102(a)(1) and 35 U.S.C §103 rejections have been fully considered but they are not persuasive. Examiner has noted that the applicant has amended the claims to the extent such that the scope of the claims have changed, which necessitates further examination and re-evaluation of the amended and original claims, as well as the newly introduced claims. The additional rejections and updated claim mappings according to the applicant’s amended claims are provided in the sections indicated below.
Regarding applicant’s Remarks on p.8:
“Independent claim 1, as amended, recites a method of training a deep neural network, comprising: inputting training data into the deep neural network, wherein the deep neural network comprises multiple layers, the multiple layers including: an input layer that receives the training data; an output layer from which output is generated in a manner consistent with one or more classification tasks; and at least one hidden layer that is interconnected with the input layer and the output layer, that receives output from the input layer, and that outputs transformed data to a feature space between the at least one hidden layer and the output layer, wherein the at least one hidden layer is parameterized by parameters of a pretrained model; evaluating a distribution of the data in the feature space, wherein the distribution of the data in the feature space is based upon the parameters of the pretrained model that parameterize the at least one hidden layer; and initializing, non-randomly, parameters of the output layer based on the evaluated distribution of the data in the feature space. At least the features of claim 1 highlighted above are not disclosed or otherwise suggested by Akusok. 
In the "Background" section of the specification, it is noted that in conventional fine- tuning of deep neural networks (DNNs), parameters of lower-level layers of a DNN (that is to be trained) are initialized to have the same values as a pretrained model, and then the parameters of the last (output) layer of the DNN are set to random numbers sampled from a Gaussian distribution. The specification then notes that an approach described therein avoids inefficiencies that have plagued conventional DNN model fine-tuning strategies; in particular, the approach described in the specification does not randomly initialize the parameters of the last layer of the DNN - instead, values of the parameters of the last layer of the DNN are estimated based upon training data and task(s) of the last layer. The independent claims have been clarified to note that some layers of a DNN to be trained are parameterized by parameters of a pretrained DNN, while parameters of the last layer are initialized based upon operation of the DNN when provided with training data. 
With reference now to wherein the at least one hidden layer is parameterized by parameters of a pretrained model, such features are contrary as to how Akusok describes training ELMs. As described in Akusok, parameters of ELM hidden neurons are randomly generated; thus, in Akusok, hidden neurons are randomly assigned parameters. There is nothing in Akusok that suggests the possibility of neurons in a hidden layer of an ELM being parameterized with parameters from a pretrained model, and accordingly Akusok fails to disclose such features.
Claim 1, as amended, additionally recites evaluating a distribution of the data in the feature space, wherein the distribution of the data in the feature space is based upon the parameters of the pretrained model that parameterize the at least one hidden layer. The Office has cited page 1012, cols. 1 and 2 of Akusok as suggesting features relating to "evaluating a distribution of data in features space". It is respectfully submitted, however, that the portion of Akusok cited by the Office to support the rejection of this claim is not germane to initializing parameters of a DNN; instead, such portion is directed towards output of a DNN when input layer weights (assigned to nodes in an input layer of an ELM) are randomly generated. Hence, Akusok fails to disclose or otherwise suggest the above-highlighted features of claim 1. Finally, because Akusok does not disclose or suggest the features of claim 1 highlighted above, Akusok cannot fairly be characterized as disclosing or suggesting initializing, non-randomly, parameters of the output layer based on the evaluated distribution of the data in the feature space.
For at least the foregoing reasons, as Akusok fails to disclose each and every feature of claim 1, Akusok does not anticipate claim 1, and thus withdrawal of the rejection of claim 1 is requested. Claims 9 and 14, as amended, recite features similar to those set forth in claim 1, and withdrawal of the rejection of such claims is likewise requested.”
Examiner has considered this argument, and has found the argument to be not persuasive. Applicant’s above arguments are directed to the new claim limitations introduced in the amended claims, which requires further analysis and re-examination of the amended and related original claims. The additional rejections and updated claim mappings according to the applicant’s amended claims are provided in the sections indicated below.
With regards to applicant’s arguments that the Akusok reference does not teach the original un-amended claim limitation “initializing, non-randomly, the parameters of the output layer based on the evaluated distribution of the data in the feature space”, examiner notes that Akusok teaches performing a linear calculation of the weights for the output layer, with this calculation being interpreted as producing a non-random initialization of the parameters (e.g., weights) of the output layer. Examiner notes that Akusok calculates the output weights β based on a matrix transformation using projections into the hidden layer (where this projection from the hidden layer represents the distribution of the data feature space as shown in Akusok p.1013 Section II.C. Matrix Form of ELMs). This calculation is summarized in general terms in Akusok p.1012 col.2 Section II.B. Hidden Neurons, with the detailed calculations shown in Akusok p.1014 col.2 Section III.B. ELM Solution with Best Linear Unbiased Estimator. Examiner notes that paragraph [0014] in the applicant’s specification indicates that “The level initializing logic non-randomly initializes the parameters of the output level by resolving approximate solutions to the last layer, based on data distribution in the feature space.”, which indicates that the approximate solution calculations performed by the level initializing logic at the output level is sufficient enough to result in a non-random initialization, and hence does not explicitly require parameters from a pre-trained model to perform an initialization of the data distribution in the feature space in order for that calculation to be considered as a non-random initialization. Similarly, examiner also notes that paragraph [0053] in the applicant’s specification does not state that parameter information from a pre-trained model is required to perform this “non-random” estimation of the parameters of the last layer of the DNN model. Instead, paragraph [0053] only indicates that the estimation is based “on the training data and the task(s) of the last layer”. Examiner only finds a suggestion of a pre-trained model being used as part of the claimed invention in paragraph [0098] in the applicant’s specification, where it states: “… there is no analytical solution to finding an optimal set of parameters which can minimize the cross entropy loss. So, instead, of solving it directly, the weights {            
                
                    
                        w
                    
                    
                        k
                    
                    
                        '
                    
                
            
        } of the last linear layer of a pre-trained DNN can be used as reference.”. However, even in this paragraph, the presence of the phrase “can be used” indicates that the applicant’s claimed invention does not explicitly require the usage of a pre-trained model to approximate a solution of the parameters at the last layer. Hence, examiner finds that applicant’s arguments for the Akusok reference not teaching elements of the original set of claims to be not persuasive in light of the identified evidence indicated in the above-mentioned paragraphs in the applicant’s specification.  

Claim Objections
The following claims are objected to because of the following informalities: 
Claim 21: The following claim limitation is missing the following term: “wherein the at least one hidden layer comprises multiple hidden layers, …”. Appropriate correction is required.
Claim 23: The following claim limitation “wherein multiple hidden layers of the deep neural network are parameterized with parameters of corresponding hidden layers of the pretrained deep neural network” contains the terms “multiple hidden layers of the deep neural network” and “corresponding hidden layers of the pretrained deep neural network”, which assume that the term “at least one hidden layer” from parent Claim 9 consists of a plurality of (multiple) hidden layers for both the deep neural network and pre-trained neural network. However, the term “at least one hidden layer” from parent Claim 9 can also indicate a deep neural network and a pre-trained neural network with just one hidden layer. Applicant is advised to modify the terms “multiple hidden layers of the deep neural network” and “corresponding hidden layers of the pretrained deep neural network” to be consistent with parent Claim 9. For the purposes of examination, this claim limitation will be interpreted as “wherein the at least one hidden layer of the deep neural network are parameterized with parameters of the corresponding at least one hidden layer of the pretrained deep neural network”. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:


Claim 3 is rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
Applicant has amended Claim 3 to recite the following limitation: “wherein results of initializing are within 2% of an optimal solution for each classification task”. However, paragraph [0054] in the applicant’s specification is the only place in the specification which recites a percentage, and it states: “Further, the results of the initializing are close to the optimal solution to each classification task. Usually, after model initialization, further fine-tuning the model can give an additional 1-2% gain in accuracy when the parameters in the feature extraction layers are fixed.”. Examiner finds that this paragraph does not support the amended claim, where the amended claim now recites that the claimed invention identifies the results of initializing to be within 2% of an optimal solution. However, paragraph [0054] merely states that further fine-tuning (after performing model initialization) can provide an 1-2% gain in accuracy (over an existing solution), which is not the same as indicating that the results of the initialization itself are within 2% (of an unidentified measurement) of an optimal solution. Furthermore, the specification does not further indicate any method or series of steps to measure, quantify, or calculate this 2% “target achievement” towards an optimal solution, and the specification does not further indicate the baseline solution in which it considers to be the reference “optimal solution” that is used for comparison to determine this 2% “target achievement”. The this claim limitation in Claim 3 fails to comply with the written description requirement. For the purposes of examination, this claim limitation will be addressed accordingly in the context of the prior art.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-3, 7, 9, 14, 18, and 21-25 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Cao et al., A Deep and Stable Extreme Learning Approach for Classification and Regression, Proceedings of ELM-2014 Volume 1, Proceedings in Adaptation, Learning and Optimization 3, DOI: 10.1007/978-3-319-14063-6_13, Springer International Publishing Switzerland 2015, pp.141-150 [hereafter referred as Cao].
Regarding amended Claim 1, Cao teaches
(Currently amended) A method of training a deep neural network, comprising:
inputting training data into the deep neural network, wherein the deep neural network comprises (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1(a) and (b), Cao teaches training a deep belief network to initialize parameters and a feature space H for a deep and  (Cao p.145 Section 4. Proposed Approach: “This section presents a new machine learning approach named deep and stable extreme learning machine (DS-ELM) …” 

    PNG
    media_image1.png
    622
    925
    media_image1.png
    Greyscale
), 
the multiple layers including:
an input layer that receives the training data (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an input layer receiving input vector x (thus corresponding to “an input layer that receives the training data”).);
an output layer from which output is generated in a manner consistent with one or more classification tasks (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an output layer (indicated by “top layer units”) that produces output labels T based on input received from the last hidden layer, where the application of this DS-ELM is targeted for classification and regression tasks, thus Cao p.141 Section 1 Introduction: “The research introduced in this paper will focus on the performance of learning machines with respect to accuracy and stability of both classification and regression tasks.” and Cao p.147 Table 1 and Cao p.145 Section 5. Experiments and Analysis: “In order to extensively verify the performance of DS-ELM, a variety type of real-world data was chosen for each problem category (regression, binary classification, and multi-category classification).”).); and
at least one hidden layer that is interconnected with the input layer and the output layer, that receives output from the input layer, and that outputs transformed data to a feature space between the at least one hidden layer and the output layer, wherein the at least one hidden layer is parameterized by parameters of a pretrained model (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has three hidden layers between the input and output layers in both the deep belief network and the deep extreme learning machine (corresponding to “at least one hidden layer that is interconnected with the input layer and the output layer …”). Cao further teaches training of the deep belief network DBN, where the input vector is processed by the hidden layers to calculate the feature space H and the initialized weights W, and where this training algorithm involves performing greedy learning and backpropagation to generate a transformation of represented features (thus corresponding to “…that receives output from the input layer, and that outputs transformed data to a feature space between the at least one hidden layer and the output layer”). Cao further teaches applying the learned parameters from the DBN (the learned weights) and the feature space H to further train the deep extreme learning machine by using this information to calculate the output layer weights β (represented as the weights between the last hidden layer and the output layer) for the deep extreme learning machine, thus corresponding to “… wherein the at least one hidden layer is parameterized by parameters of a pretrained model” (Cao pp.144-145 Section 3 Deep Belief Networks: “DBNs are probabilistic generative mod[e]ls, or alternatively a kind of deep neural network, composed of multiple latent variables (hidden units). … As illustrated in Fig. 1(a), each layer tries to model the distribution of its input. Every RBM has a layer containing visible nodes v that represent the data and a layer condition hidden nodes h that learn to represent input features capturing higher-order correlations in the data. … By starting with the data vector on the visible units and alternating several times between sampling from p(h|v,W) and p(v|h, W), it is easy to get the learning weights W. The learning algorithm for DBNs proposed by Hinton et al., [14,18] has two training phases; (1) greedy learning algorithm for transforming representations (unsupervised learning), and (2) Back-Fitting with the up-down algorithm (fine tune).” and Cao p.145 Section 4. Proposed Approach 1st paragraph: “Our overall intention is to use a quick-and-dirty DBN to generate a relatively stable feature space H that is fed into an ELM to calculate the output weights. … Step 1. Setup a DBM structure fed with input vector x; and perform a quick-and-dirty training based on the pre-defined DBN structure [cf. Fig. 1(a)] … Step 2. … The nodes in the top hidden layer can be viewed equal to hidden nodes in a typical ELM network; those hidden layer output matrix H is feature space of the input vector. the feature space H initiated via Step 1 with help of a DBN is then fed into a typical ELM solver to calculate the output weights 𝛃 with equation (3). [cf. Fig. 1(b)]”).); and
evaluating a distribution of the data in the feature space, wherein the distribution of the data in the feature space is based upon the parameters of the pretrained model that parameterize the at least one hidden layer (Cao p.146 Figure 1(a) and (b); Cao p.145 Section 4 Proposed Approach 1st paragraph: examiner’s note: Cao teaches training a DS-ELM method for initializing the hidden layers of a deep extreme learning machine by receiving a set of inputs                         
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                            ,
                             
                            …
                            ,
                             
                            
                                
                                    x
                                
                                
                                    N
                                
                            
                        
                     (corresponding to “a distribution of data”), and learning parameter weights (corresponding to “parameters of the pre-trained model”) and a feature space H (corresponding Cao pp.144-145 Section 3 Deep Belief Networks and Cao p.146 Figure 1(a)). Hence, the learned weights and feature space H from the pretrained DBN corresponds to “evaluating a distribution of data in the feature space, wherein the distribution of the data in the feature space is based upon the parameters of the pretrained model that parameterized that at least one hidden layer” (Cao p.145 Section 4. Proposed Approach 1st paragraph).); and
initializing, non-randomly, (Cao p.146 Figure 1(a) and (b); Cao p.145 Section 4 Proposed Approach 1st paragraph: examiner’s note: Cao teaches a method for calculating the output weights β for a DS-ELM (Cao p.145 Section 4. Proposed Approach 1st paragraph), where the method performs the calculation shown in Cao p.144 equation (3), based on the learned feature space H from the trained DBN network. This feature space H is based on the distribution of the data in the feature space (Cao pp.144-145 Section 3 Deep Belief Networks) derived from the pretraining of a DBN, and hence this calculation of the output weights represents a non-random initialization of the parameters of the output layer (as it is based on the learned weights of a pre-trained DBN as well as a linear matrix calculation based on a feature space H (Cao pp.143-144 Section 2. Extreme Learning Machine 2nd-3rd paragraphs)). Hence, this calculation method for the DS-ELM corresponds to “initializing, non-randomly, parameters of the output layer based on the evaluated distribution of the data in the feature space”.).
Regarding amended Claim 2, Cao teaches
(Currently amended) The method of claim 1, wherein 
estimating parameter values of the output layer by finding an approximate solution to each classification task (Cao p.146 Figure 1(b): examiner’s note: Cao teaches training a DS-Cao p.145 Section 4. Proposed Approach 1st paragraph and Cao pp.143-144 Section 2. Extreme Learning Machine 2nd-3rd paragraphs), where the ELM solver performs the linear matrix calculation shown in Cao p.144 equation (3), and this ELM solver/linear matrix calculation is based on the extreme learning machine generalization and universal approximation properties (Cao p.143 Section 2 Extreme Learning Machine). The fact that the learned weights and feature space H from the trained DBN is applied to this ELM solver/linear matrix calculation to determine the output weights of a deep extreme learning machine makes this calculation an approximate solution, thus corresponding to “estimating parameter values of the output layer by finding an approximate solution to each classification task”.).  
Regarding amended Claim 3, Cao teaches
(Currently amended) The method of claim 1, 
wherein results of within 2% of an optimal solution for each classification task (Cao p.148 Table 2: examiner’s note: Under its broadest reasonable interpretation, the limitation “wherein the results of initializing are within 2% of an optimal solution for each classification task” is interpreted to indicate that the performance metric for the training method using learned weights and feature space from a pre-trained deep neural network are within 2% of the performance metrics from a baseline training method (with the baseline training method interpreted to be “an optimal solution”). Referring to Cao Table 2, Cao teaches that the testing rate performance percentages from the DS-ELM training method (using the learned weights and feature space from a trained DBN) using different datasets (Vowel, Segment, Shuttle, DNA, Protein) are within 2% of either the traditional ELM training method or DBN training method (where either training method represents a baseline reference training method corresponding to an optimal solution, thus corresponding to “wherein results of initializing are within 2% of an optimal solution for each classification task”) (Cao pp.147-148 Section 5 Experiments and Analysis: “Table 2 shows the performance comparison of ELM, DBN, and DS-ELM for classification problems. It can be seen from binary classification tests that (1) DS-ELM tends to obtain the lowest standard deviation for five out of six datasets; … Observed from multi-category classification simulations, we found that (1) the performance of DS-ELM is more stable … than the other two approaches in all tested datasets; (2) for DNA … and Protein … datasets, DS-ELM method achieved better testing rate compared to the other two methods.”).).  
Regarding original Claim 7, Cao teaches
(Original) The method of claim 1,
the at least one hidden layer comprises a plurality of hidden layers, each hidden layer comprises a respective plurality of nodes, each node in a hidden layer being configured to perform a transformation on output of at least one node from an adjacent, lower layer (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has three hidden layers between the input and output layers in both the deep belief network and the deep extreme learning machine (corresponding to “the at least one hidden layer comprises a plurality of hidden layers”), and each hidden layer from both the deep belief network and deep extreme learning machine has the same plurality of nodes (corresponding to “each hidden layer comprises a respectively plurality of nodes…”). Cao further teaches training of the deep belief network DBN, where the input vector is processed by the hidden layers to calculate the feature space H and the initialized weights W, and where this training algorithm involves performing greedy learning and backpropagation to generate a transformation of represented features (thus corresponding to “… each node in a hidden layer being configured to perform a transformation on output of at least one node from an adjacent, lower layer”) (Cao pp.144-145 Section 3 Deep Belief Networks: “DBNs are probabilistic generative mod[e]ls, or alternatively a kind of deep neural network, composed of multiple latent variables (hidden units). … As illustrated in Fig. 1(a), each layer tries to model the distribution of its input. Every RBM has a layer containing visible nodes v that represent the data and a layer condition hidden nodes h that learn to represent input features capturing higher-order correlations in the data. … By starting with the data vector on the visible units and alternating several times between sampling from p(h|v,W) and p(v|h, W), it is easy to get the learning weights W. The learning algorithm for DBNs proposed by Hinton et al., [14,18] has two training phases; (1) greedy learning algorithm for transforming representations (unsupervised learning), and (2) Back-Fitting with the up-down algorithm (fine tune).”).); 
a lowest one of the plurality of hidden layers receives an output from the input layer (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an input layer receiving input vector x and producing output to be fed into the first hidden layer (corresponding to “a lowest one of the plurality of hidden layers receives an output from the input layer”).); and 
the output layer receives an output from a highest one of the plurality of hidden layers (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an output layer (indicated by “top layer units”) that produces output labels T based on input received from the last hidden layer, thus corresponding to “the output layer receives an output from a highest one of the plurality of hidden layers”).).
Regarding amended Claim 9, Cao teaches
(Currently amended) A method of 
a task-specific layer from which output is generated in a manner consistent with one or more classification tasks (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM using a pre-trained deep belief network and a deep extreme learning machine, where both the deep extreme learning machine network and the deep belief network are deep neural networks that have an output layer (indicated by “top layer Cao pp.144-145 Section 3 Deep Belief Networks) that results in the lower layers of the hidden layer extracting and representing low-level features from the input data and the upper layers performing gradual refinement of previously learnt concepts (Cao p.142 2nd paragraph (Section 1 Introduction): “… DBNs are known to have good modeling ability for higher-order and highly non-linear statistical structure in the input [6]. … the first layers of DBNs are expected to extract relatively low-level features out of the input space while the upper layers are expected to gradually refine previously learnt concept to generate more abstract ones. … Because the output of the higher DBN layers can easily be used as the input of a supervised classifier, Ribeiro et al. [5] used an ELM classifier for classify the deep concepts …”). This refinement of previously learnt concepts is interpreted as aggregating the learnt low-level features into more complex and abstract functions or tasks, and the upper layers of a DBN being used in an earlier application as input to approximate a supervised classifier. Hence, the process of training the DBN using this greedy unsupervised training approach produces the upper layers of the hidden layer that can be used as input into an output layer, such that the upper layers of the hidden layer of a DBN corresponds to “a task-specific layer from which output is generated in a manner consistent with one or more classification tasks” (Cao p.141 Section 1 Introduction: “The research introduced in this paper will focus on the performance of learning machines with respect to accuracy and stability of both classification and regression tasks.” and Cao p.145 Section 5. Experiments and Analysis and Cao p.147 Table 1: “In order to extensively verify the performance of DS-ELM, a variety type of real-world data was chosen for each problem category (regression, binary classification, and multi-category classification).” and Cao p.145 Section 4 Proposed Approach – Auto-abstraction of deep concepts: “…the unsupervised pre-training of DBNs allows learning those complex functions by mapping the input to the output directly. Specifically speaking, the bottom layers are expected to extract and represent low-level features from the input data while the upper layers are expected to gradually refine previously learnt concepts [5].”).); and 
at least one hidden layer that is connected to the output layer and that outputs transformed data to a feature space between the at least one hidden layer and the task-specific layer, wherein the at least one hidden layer is parameterized with parameters of a pre-trained deep neural network, and further wherein the transformed data is based upon the parameters of the pre-trained deep neural network (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Figure 1, Cao teaches training a DS-ELM that has three hidden layers between the input and output layers in both the deep belief network and the deep extreme learning machine (with the plurality of hidden layers in between the input and output layer corresponding to “at least one hidden layer that is connected to the output layer …”). Cao further teaches training of the deep belief network DBN, where the input vector is processed by the hidden layers to calculate the feature space H and the initialized weights W, and where this training algorithm involves performing greedy learning and backpropagation to generate a transformation of represented features (thus corresponding to “… that outputs transformed data to a feature space between the at least one hidden layer and the task-specific layer …”). Cao further teaches applying the learned parameters from the DBN (the learned weights) and the feature space H to further train the deep extreme learning machine by using this information to calculate the output layer weights β (represented as the weights between the last hidden layer and the output layer) for the deep extreme learning machine, thus corresponding to “… wherein the at least one hidden layer is parameterized by parameters of a pretrained deep neural network, and further wherein the transformed data is based upon the parameters of the pre-trained deep neural network” (Cao pp.144-145 Section 3 Deep Belief Networks: “DBNs are probabilistic generative mod[e]ls, or alternatively a kind of deep neural network, composed of multiple latent variables (hidden units). … As illustrated in Fig. 1(a), each layer tries to model the distribution of its input. Every RBM has a layer containing visible nodes v that represent the data and a layer condition hidden nodes h that learn to represent input features capturing higher-order correlations in the data. … By starting with the data vector on the visible units and alternating several times between sampling from p(h|v,W) and p(v|h, W), it is easy to get the learning weights W. The learning algorithm for DBNs … has two training phases; (1) greedy learning algorithm for transforming representations (unsupervised learning), and (2) Back-Fitting with the up-down algorithm (fine tune).” and Cao p.145 Section 4. Proposed Approach 1st paragraph: “Our overall intention is to use a quick-and-dirty DBN to generate a relatively stable feature space H that is fed into an ELM to calculate the output weights. … Step 1. Setup a DBM structure fed with input vector x; and perform a quick-and-dirty training based on the pre-defined DBN structure [cf. Fig. 1(a)] … Step 2. … The nodes in the top hidden layer can be viewed equal to hidden nodes in a typical ELM network; those hidden layer output matrix H is feature space of the input vector. the feature space H initiated via Step 1 with help of a DBN is then fed into a typical ELM solver to calculate the output weights 𝛃 with equation (3). [cf. Fig. 1(b)]”).), 
the method comprising:
determining one or more tasks of the task-specific layer (Cao p.146 Figure 1(a) and (b): examiner’s note: As indicated earlier in Cao p.142 2nd paragraph (Section 1 Introduction), Cao teaches pre-training of the DBN to learn the parameters of the hidden layer and the feature space H by using a greedy unsupervised training approach (Cao pp.144-145 Section 3 Deep Belief Networks) that results in the lower layers of the hidden layer extracting and representing low-level features from the input data and the upper layers performing gradual refinement of previously learnt concepts, with this refinement of previously learnt concepts being interpreted Cao p.141 Section 1 Introduction: “The research introduced in this paper will focus on the performance of learning machines with respect to accuracy and stability of both classification and regression tasks.” and Cao p.145 Section 5. Experiments and Analysis and Cao p.147 Table 1: “In order to extensively verify the performance of DS-ELM, a variety type of real-world data was chosen for each problem category (regression, binary classification, and multi-category classification).” and Cao p.145 Section 4 Proposed Approach – Auto-abstraction of deep concepts: “…the unsupervised pre-training of DBNs allows learning those complex functions by mapping the input to the output directly. Specifically speaking, the bottom layers are expected to extract and represent low-level features from the input data while the upper layers are expected to gradually refine previously learnt concepts [5].”).); and 
estimating initializing values for parameters of the task-specific layer by finding an approximate solution to each of the one or more classification tasks (Cao p.146 Figure 1(b): examiner’s note: Cao teaches training a DS-ELM that applies the learned weights and feature space H from the trained DBN into an ELM solver to calculate the output weights (Cao p.145 Section 4. Proposed Approach 1st paragraph and Cao pp.143-144 Section 2. Extreme Learning Machine 2nd-3rd paragraphs), where the ELM solver performs the linear matrix calculation shown in Cao p.144 equation (3), and this ELM solver/linear matrix calculation is based on the extreme learning machine generalization and universal approximation properties (Cao p.143 Section 2 Extreme Learning Machine). The process of applying learned weights and feature space H from the trained DBN is applied to this ELM solver/linear matrix calculation ), 
wherein the approximate solutions are based on a data distribution in the feature space (Cao p.146 Figure 1(a) and (b); Cao p.145 Section 4 Proposed Approach 1st paragraph: examiner’s note: Cao teaches training a DS-ELM method for initializing the hidden layers of a deep extreme learning machine by receiving a set of inputs                         
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                            ,
                             
                            …
                            ,
                             
                            
                                
                                    x
                                
                                
                                    N
                                
                            
                        
                    , and learning parameter weights and a feature space H (corresponding to “a data distribution in the feature space”) by training a deep belief network DBN (Cao pp.144-145 Section 3 Deep Belief Networks and Cao p.146 Figure 1(a)), where the learned weights and feature space H from the pretrained DBN corresponds to “wherein the approximate solutions are based on a data distribution in the feature space” (Cao p.145 Section 4. Proposed Approach 1st paragraph).), and 
further wherein the data distribution is based upon the parameters of the pretrained deep neural network (Cao p.146 Figure 1(a) and (b); Cao p.145 Section 4 Proposed Approach 1st paragraph: examiner’s note: Cao teaches training a DS-ELM method for initializing the hidden layers of a deep extreme learning machine by receiving a set of inputs                         
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                            ,
                             
                            …
                            ,
                             
                            
                                
                                    x
                                
                                
                                    N
                                
                            
                        
                     (corresponding to “a distribution of data”), and learning parameter weights (corresponding to “parameters of the pre-trained model”) and a feature space H (corresponding to “a distribution of data in the feature space”) by training a deep belief network DBN (corresponding to the “pretrained model”) (Cao pp.144-145 Section 3 Deep Belief Networks and Cao p.146 Figure 1(a)), where the learned weights and feature space H from the pretrained DBN corresponds to “further wherein the data distribution is based upon the parameters of the pretrained deep neural network” (Cao p.145 Section 4. Proposed Approach 1st paragraph).).  
Regarding amended Claim 14, Cao teaches
A system comprising:
an artificial neural network (Cao p.146 Figure 1(a) and 1(b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a deep and stable extreme learning machine (DS-ELM) using a trained deep belief network (DBN) and a deep extreme learning machine (ELM), both of which are deep neural networks and have input and output layers, and a plurality of hidden layers with a same plurality of nodes. Hence, the structure of the DS-ELM corresponds to “an artificial neural network”.), 
comprising: 
an input level of nodes that receives a set of features and applies a first non-linear function to the set of features to output a first set of modified values (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an input layer receiving input vector x that feeds into a first hidden layer (corresponding to “an input level of nodes”), where the plurality of hidden layers is implemented with the learned weight parameters and feature space H from a pretrained DBN (Cao pp.144-145 Section 3 Deep Belief Networks). Cao further teaches that each hidden layer is a simple unsupervised network that models the distribution of its input, with each layer containing visible nodes that represent the data and a layer containing hidden nodes that learn to represent input features (with this feature representation as the output of a hidden layer). This output is non-linear due to the sigmoidal hidden layer activation function used in the hidden nodes (Cao p.147 2nd paragraph (Section 5 Experiments and Analysis)). This repeated process of mapping the input into features (to produce an output that is fed into the next hidden layer) forming a joint distribution that receives input from the previous hidden layer to produce an output representation of features, and applies this output as input to the next hidden layer, thus corresponding to “an input level of nodes that receives a set of features and applies a first non-linear function to a set of features to output a first set of modified values” (Cao p.144 Section 3 Deep Belief Networks 2nd-3rd paragraphs: “DBNs can be viewed as a composition of simple, unsupervised networks … where each subnetwork’s hidden layer serves as the visible layer for the next. As is illustrated in Fig. 1(a), each layer tries to model the distribution of its input. Every RBM has a layer containing visible nodes v that represent the data and a layer containing hidden nodes h that learn to represent features capturing higher-order correlations in the data. [5] The topology of DBNs depicts a joint distribution based on observation input v and multiple hidden units                         
                            
                                
                                    h
                                
                                
                                    1
                                
                            
                        
                    ,                         
                            
                                
                                    h
                                
                                
                                    2
                                
                            
                        
                    , …,                         
                            
                                
                                    h
                                
                                
                                    L
                                
                            
                        
                     …”).); 
a hidden level of nodes that receives the first set of modified values and applies an intermediate non-linear function to the first set of modified values to obtain a first set of intermediate modified values, wherein the hidden level of nodes are assigned parameters of a pretrained artificial neural network (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an input layer receiving input vector x that feeds into a first hidden layer (corresponding to “an input level of nodes”), where the plurality of hidden layers is implemented with the learned weight parameters and feature space H from a pretrained DBN (Cao pp.144-145 Section 3 Deep Belief Networks). As indicated earlier (Cao p.144 Section 3 Deep Belief Networks 2nd-3rd paragraphs), Cao further teaches that each hidden layer is a simple unsupervised network that models the distribution of its input, with each layer containing visible nodes that represent the data and a layer containing hidden nodes that learn to represent input features (with this feature representation as the output of a hidden layer). This output is non-linear due to the sigmoidal hidden layer activation function used in the hidden nodes (Cao p.147 2nd paragraph (Section 5 Experiments and Analysis)). This repeated process of mapping the input into features (to produce an output that is fed into the next hidden layer) forming a joint distribution that receives input from the previous hidden layer to produce an output representation of features, and applies this output as input to the next hidden layer, thus corresponding to “a hidden level of nodes that receives the first set of modified values and applies an intermediate non-linear function to the first set of modified values to obtain a first set of intermediate modified values …”). This process of training the pre-Cao pp.144-145 Section 3 Deep Belief Networks 3rd-4th paragraphs: “… This matrix of symmetrically weighted connections is learned by an RBM which defines both p(v|h, W) and the prior distribution over hidden vectors, p(h|W) … By starting with the data vector on the visible units and alternating several times between sampling from p(h|v, W) and p(v|h,W), it is easy to get the learning weights W. The learning algorithm for DBNs … has two training phases: (1) greedy learning algorithm for transforming representations (unsupervised learning), and (2) Back-Fitting with the up-down algorithm (fine-tune).” and Cao p.145 Section 4. Proposed Approach 1st paragraph: “Our overall intention is to use a quick-and-dirty DBN to generate a relatively stable feature space H that is fed into an ELM to calculate the output weights. … Step 1. Setup a DBM structure fed with input vector x; and perform a quick-and-dirty training based on the pre-defined DBN structure [cf. Fig. 1(a)] … Step 2. … The nodes in the top hidden layer can be viewed equal to hidden nodes in a typical ELM network; those hidden layer output matrix H is feature space of the input vector. the feature space H initiated via Step 1 with help of a DBN is then fed into a typical ELM solver to calculate the output weights 𝛃 with equation (3). [cf. Fig. 1(b)]”).); 
an output level of nodes that receives the first set of intermediate modified values[[,]] and generates a set of output values, the output values being indicative of a pattern relating to a classification task of the output level (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an input layer receiving input vector x that feeds into a first hidden layer (corresponding to “an input level of nodes”), where the plurality of hidden layers is implemented with the learned weight parameters and feature space H from a pretrained DBN (Cao pp.144-145 Section 3 Deep Belief Networks). Cao p.144 Section 3 Deep Belief Networks 2nd-3rd paragraphs), Cao further teaches a repeated process of mapping the input into features (to produce an output that is fed into the next hidden layer) forming a joint distribution that receives input from the previous hidden layer to produce an output representation of features, and applies this output as input to the next hidden layer until the last hidden layer of the plurality of hidden layers is reached, thus corresponding to “an output level of nodes that receives the first set of intermediate modified values and generates a set of output values …”). The output from the last hidden layer serves as input to an output layer (indicated by “top layer units”) that produces output labels T based on input received from the last hidden layer. The application of this DS-ELM is targeted for classification and regression tasks, and the pre-training of the DBN to learn the parameters of the hidden layer and the feature space H involves using a greedy unsupervised training approach (Cao pp.144-145 Section 3 Deep Belief Networks) that results in the lower layers of the hidden layer extracting and representing low-level features from the input data and the upper layers performing gradual refinement of previously learnt concepts (as taught in Cao p.142 2nd paragraph (Section 1 Introduction)). This refinement of previously learnt concepts is interpreted as aggregating the learnt low-level features into more complex and abstract functions or tasks, with Cao Figure 1(b) showing the hidden layers of a pretrained DBN (including the upper layer of the hidden layer) being used as input to approximate a supervised classifier (based on an earlier motivation from an existing Ribeiro reference as taught in Cao p.142 2nd paragraph). Hence, the process of training the DBN using this greedy unsupervised training approach produces the upper layers of the hidden layer that can be used as input into an output layer, such that the upper layers of the hidden layer of a DBN corresponds to “… the output values being indicative of a pattern relating to a classification task of the output level” (Cao p.141 Section 1 Introduction: “The research introduced in this paper will focus on the performance of learning machines with respect to accuracy and stability of both classification and regression tasks.” and Cao p.145 Section 5. Experiments and Analysis and Cao p.147 Table 1: “In order to extensively verify the performance of DS-ELM, a variety type of real-world data was chosen for each problem category (regression, binary classification, and multi-category classification).” and Cao p.145 Section 4 Proposed Approach – Auto-abstraction of deep concepts: “…the unsupervised pre-training of DBNs allows learning those complex functions by mapping the input to the output directly. Specifically speaking, the bottom layers are expected to extract and represent low-level features from the input data while the upper layers are expected to gradually refine previously learnt concepts [5].”).); and 
level initializing logic that non-randomly initializes output level, wherein the approximate solutions to the output level are based on data distribution in the feature space, and further wherein the data distribution in the feature space is based upon the parameters of the pretrained artificial neural network assigned to the hidden level of nodes (Cao p.146 Figure 1(b): examiner’s note: Cao teaches training a DS-ELM method for initializing the hidden layers of a deep extreme learning machine by receiving a set of inputs                         
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                            ,
                             
                            …
                            ,
                             
                            
                                
                                    x
                                
                                
                                    N
                                
                            
                        
                    , and learning weight parameters (corresponding to “parameters of the pre-trained artificial neural network”) and a feature space H (corresponding to “the data distribution in the feature space”) by training a deep belief network DBN (corresponding to the “pretrained artificial neural network”) and applying this learned weight parameters and feature space H into an ELM solver to calculate the output weights (Cao p.145 Section 4. Proposed Approach 1st paragraph and Cao pp.143-144 Section 2. Extreme Learning Machine 2nd-3rd paragraphs). As indicated earlier, the ELM solver performs the linear matrix calculation identified in Cao p.144 equation (3) to calculate the output weights β (corresponding to “parameters of the output level”), and this ELM solver/linear matrix calculation is based on the extreme learning machine generalization and universal approximation properties (Cao p.143 Section 2 Extreme Learning Machine). This process of applying the learned weights and feature space H from the trained DBN is applied to this ELM ).  
Regarding amended Claim 18, Cao teaches
(Currently amended) The system of claim 14, wherein the output level initializing logic estimates the parameters of the output level by: 
finding an approximate solution to each classification task (Cao p.146 Figure 1(b): examiner’s note: Cao teaches training a DS-ELM that applies the learned weights and feature space H from the trained DBN into an ELM solver to calculate the output weights (Cao p.145 Section 4. Proposed Approach 1st paragraph and Cao pp.143-144 Section 2. Extreme Learning Machine 2nd-3rd paragraphs), where the ELM solver performs the linear matrix calculation shown in Cao p.144 equation (3), and this ELM solver/linear matrix calculation is based on the extreme learning machine generalization and universal approximation properties (Cao p.143 Section 2 Extreme Learning Machine). The process of applying learned weights and feature space H from the trained DBN is applied to this ELM solver/linear matrix calculation to determine the output weights of a deep and stable extreme learning machine (where this calculation represents an approximate solution) corresponds to a method for “finding an approximate solution to each classification task”.); 
approximating a distribution of features for each classification task (Cao p.146 Figure 1(b): examiner’s note: Cao teaches training a DS-ELM method for initializing the hidden layers of a deep extreme learning machine by receiving a set of inputs                         
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                            ,
                             
                            …
                            ,
                             
                            
                                
                                    x
                                
                                
                                    N
                                
                            
                        
                    , and learning weight Cao p.145 Section 4. Proposed Approach 1st paragraph and Cao pp.143-144 Section 2. Extreme Learning Machine 2nd-3rd paragraphs). As indicated earlier (Cao p.144 Section 3 Deep Belief Networks 2nd-3rd paragraphs), Cao further teaches a repeated process of mapping the input into features (to produce an output that is fed into the next hidden layer) forming a joint distribution that receives input from the previous hidden layer to produce an output representation of features, and applies this output as input to the next hidden layer until the last hidden layer of the plurality of hidden layers is reached, thus corresponding to a method for “approximating a distribution of features for each classification task”.); and 
deriving a linear classifier[[,]] based on results of the approximating, the (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an output layer (indicated by “top layer units”) that produces output labels T based on input received from the last hidden layer, where the application of this DS-ELM is targeted for classification and regression tasks, and where the pre-training of the DBN to learn the parameters of the hidden layer and the feature space H involves using a greedy unsupervised training approach (Cao pp.144-145 Section 3 Deep Belief Networks) that results in the lower layers of the hidden layer extracting and representing low-level features from the input data and the upper layers performing gradual refinement of previously learnt concepts (Cao p.142 2nd paragraph (Section 1 Introduction): “… DBNs are known to have good modeling ability for higher-order and highly non-linear statistical structure in the input [6]. … the first layers of DBNs are expected to extract relatively low-level features out of the input space while the upper layers are expected to gradually refine previously learnt concept to generate more abstract ones. … Because the output of the higher DBN layers can easily be used as the input of a supervised classifier, Ribeiro et al. [5] used an ELM classifier for classify the deep concepts …”). This refinement of previously learnt concepts is interpreted as aggregating the learnt low-level features into more complex and abstract functions or tasks, with Cao Figure 1(b) showing the hidden layers of a pretrained DBN (including the upper layer of the hidden layer) being used as input to approximate a supervised classifier (based on an earlier motivation from an existing Ribeiro reference as taught in Cao p.142 2nd paragraph). Hence this method shown in Cao Figure 1(b) and described in Cao p.145 Section 4 Proposed Approach Step 2 corresponds to a method for “deriving a linear classifier based on results of the approximating, the linear classifier being usable to initialize the parameters of the output layer”.).
Regarding new Claim 21, Cao teaches
(New) The method of claim 1, 
wherein the at least one hidden layer comprises multiple hidden layers (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches a DS-ELM that has three hidden layers between the input and output layers in both the deep belief network and the deep extreme learning machine (corresponding to “wherein the at least one hidden layer comprises multiple hidden layers”).), and 
further wherein the multiple hidden layers are parameterized by parameters of the pretrained deep neural network (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Figure 1, Cao teaches a DS-ELM that has three hidden layers between the input and output layers in both the deep belief network and the deep extreme learning machine (corresponding to “multiple hidden layers”). Cao further teaches training of the deep belief network DBN, where the input vector is processed by the hidden layers to calculate the feature space H and the initialized weights W, and where this training algorithm involves performing greedy learning and backpropagation to generate a transformation of represented features, where this training  (Cao pp.144-145 Section 3 Deep Belief Networks: “DBNs are probabilistic generative mod[e]ls, or alternatively a kind of deep neural network, composed of multiple latent variables (hidden units). … As illustrated in Fig. 1(a), each layer tries to model the distribution of its input. Every RBM has a layer containing visible nodes v that represent the data and a layer condition hidden nodes h that learn to represent input features capturing higher-order correlations in the data. … By starting with the data vector on the visible units and alternating several times between sampling from p(h|v,W) and p(v|h, W), it is easy to get the learning weights W. The learning algorithm for DBNs proposed by Hinton et al., [14,18] has two training phases; (1) greedy learning algorithm for transforming representations (unsupervised learning), and (2) Back-Fitting with the up-down algorithm (fine tune).” and Cao p.145 Section 4. Proposed Approach 1st paragraph: “Our overall intention is to use a quick-and-dirty DBN to generate a relatively stable feature space H that is fed into an ELM to calculate the output weights. … Step 1. Setup a DBM structure fed with input vector x; and perform a quick-and-dirty training based on the pre-defined DBN structure [cf. Fig. 1(a)] … Step 2. … The nodes in the top hidden layer can be viewed equal to hidden nodes in a typical ELM network; those hidden layer output matrix H is feature space of the input vector. the feature space H initiated via Step 1 with help of a DBN is then fed into a typical ELM solver to calculate the output weights 𝛃 with equation (3). [cf. Fig. 1(b)]”).).
Regarding new Claim 22, Cao teaches
(New) The method of claim 1, 
wherein the parameters of the output layer of the deep neural network are initialized based upon the deep neural network (Cao p.146 Figure 1(a) and (b); Cao p.145 Section 4 Proposed Approach 1st paragraph: examiner’s note: Cao teaches a DS-ELM method for β for a deep extreme learning machine Cao p.145 Section 4. Proposed Approach 1st paragraph), where the method performs the calculation shown in Cao p.144 equation (3), which is based on the learned feature space H from the trained DBN network. This feature space H is based on the distribution of the data in the feature space (Cao pp.144-145 Section 3 Deep Belief Networks), and hence this calculation of the output weights represents a non-random initialization of the parameters of the output layer, as it is based on the learned weights of a pre-trained DBN as well as a linear matrix calculation based on a feature space H (Cao pp.143-144 Section 2. Extreme Learning Machine 2nd-3rd paragraphs), thus corresponding to “wherein the parameters of the output layer of the deep neural network are initialized based upon the deep neural network …”).) …
… being trained to perform object recognition (Cao p.145 Section 4 Proposed Approach: “A typical example is object image classification problem which is especially challenging due to the fact that same object might appear different because of pose and illumination conditions; the low level visual features are far detached from the semantics of the scene, making it problem-prone when used to infer object presence … but the unsupervised pre-training of DBNs allows learning those complex functions by mapping the input to the output directly. Specifically speaking, the bottom layers are expected to extract and represent low-level features from the input data while the upper layers are expected to gradually refine previously learnt concepts [5].”).  
Regarding new Claim 23, Cao teaches
(New) The method of claim 9, 
wherein the at least one hidden layer of the deep neural network are parameterized with parameters of the corresponding at least one hidden layer of the pretrained deep neural network (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches a DS-ELM that has three hidden layers between the input and output layers in both the deep belief network and the deep extreme learning machine (corresponding to “the at least β (represented as the weights between the last hidden layer and the output layer) for the deep extreme learning machine, where this applying of the learned weight parameters and feature space H to initialize the hidden layers of the deep extreme learning machine corresponds to “… wherein the at least one hidden layer of the deep neural network are parameterized with parameters of the corresponding at least one hidden layer of the pretrained deep neural network”) (Cao pp.144-145 Section 3 Deep Belief Networks: “DBNs are probabilistic generative mod[e]ls, or alternatively a kind of deep neural network, composed of multiple latent variables (hidden units). … As illustrated in Fig. 1(a), each layer tries to model the distribution of its input. Every RBM has a layer containing visible nodes v that represent the data and a layer condition hidden nodes h that learn to represent input features capturing higher-order correlations in the data. … By starting with the data vector on the visible units and alternating several times between sampling from p(h|v,W) and p(v|h, W), it is easy to get the learning weights W. The learning algorithm for DBNs proposed by Hinton et al., [14,18] has two training phases; (1) greedy learning algorithm for transforming representations (unsupervised learning), and (2) Back-Fitting with the up-down algorithm (fine tune).” and Cao p.145 Section 4. Proposed Approach 1st paragraph: “Our overall intention is to use a quick-and-dirty DBN to generate a relatively stable feature space H that is fed into an ELM to calculate the output weights. … Step 1. Setup a DBM structure fed with input vector x; and perform a quick-and-dirty training based on the pre-defined DBN structure [cf. Fig. 1(a)] … Step 2. … The nodes in the top hidden layer can be viewed equal to hidden nodes in a typical ELM network; those hidden layer output matrix H is feature space of the input vector. the feature space H initiated via Step 1 with help of a DBN is then fed into a typical ELM solver to calculate the output weights 𝛃 with equation (3). [cf. Fig. 1(b)]”).).
Regarding new Claim 24, Cao teaches
(New) The method of claim 9, 
wherein the deep neural network and the pretrained deep neural network have an identical structure in that the deep neural network and the pretrained deep neural network include a same number of hidden layers and a same number of nodes in the hidden layers (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches a DS-ELM that has three hidden layers between the input and output layers in both the deep belief network and the deep extreme learning machine, and each hidden layer from both the deep belief network and deep extreme learning machine has the same plurality of nodes (corresponding to “wherein the deep neural network and the pretrained deep neural network have an identical structure in that the deep neural network and the pretrained deep neural network include a same number of hidden layers and a same number of nodes in the hidden layers”).).
Regarding new Claim 25, Cao teaches
(New) The system of claim 14, wherein the level initializing logic includes a linear classifier that is configured to initialize the parameters of the output level, wherein the linear classifier is derived based upon the data distribution in the feature space (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an output layer (indicated by “top layer units”) that produces output labels T based on input received from the last hidden layer, where the application of this DS-ELM is targeted for Cao pp.144-145 Section 3 Deep Belief Networks) that results in the lower layers of the hidden layer extracting and representing low-level features from the input data and the upper layers performing gradual refinement of previously learnt concepts (Cao p.142 2nd paragraph (Section 1 Introduction): “… DBNs are known to have good modeling ability for higher-order and highly non-linear statistical structure in the input [6]. … the first layers of DBNs are expected to extract relatively low-level features out of the input space while the upper layers are expected to gradually refine previously learnt concept to generate more abstract ones. … Because the output of the higher DBN layers can easily be used as the input of a supervised classifier, Ribeiro et al. [5] used an ELM classifier for classify the deep concepts …”). This refinement of previously learnt concepts is interpreted as aggregating the learnt low-level features into more complex and abstract functions or tasks, with Cao Figure 1(b) showing the hidden layers of a pretrained DBN (including the upper layer of the hidden layer) being used as input to approximate a supervised classifier (based on an earlier motivation from an existing Ribeiro reference as taught in Cao p.142 2nd paragraph). Hence this method shown in Cao Figure 1(b) and described in Cao p.145 Section 4 Proposed Approach Step 2 corresponds to “wherein the level initializing logic includes a linear classifier that is configured to initialize the parameters of the output level, wherein the linear classifier is derived based upon the data distribution in the feature space”.).

Claim Rejections - 35 USC § 103












The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Cao et al., A Deep and Stable Extreme Learning Approach for Classification and Regression, Proceedings of ELM-2014 Volume 1, Proceedings in Adaptation, Learning and Optimization 3, DOI: 10.1007/978-3-319-14063-6_13, Springer International Publishing Switzerland 2015, pp.141-150 [hereafter referred as Cao] in view of Akusok et al., High-Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications, July 17 2015, IEEE Access Volume 3, 2015, pp.1011-1025 [hereafter referred as Akusok].
Regarding amended Claim 4, Cao teaches
The method of claim 1,
wherein 
approximating a distribution of features for each classification task (Cao p.146 Figure 1(b): examiner’s note: Cao teaches training a DS-ELM method for initializing the hidden layers of a deep extreme learning machine by receiving a set of inputs                         
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                            ,
                             
                            …
                            ,
                             
                            
                                
                                    x
                                
                                
                                    N
                                
                            
                        
                    , and learning weight parameters (corresponding to “parameters of the pre-trained artificial neural network”) and a feature space H (corresponding to “the data distribution in the feature space”) by training a deep belief network DBN (corresponding to the “pretrained artificial neural network”) and applying this learned weight parameters and feature space H into an ELM solver to calculate the output weights (Cao p.145 Section 4. Proposed Approach 1st paragraph and Cao pp.143-144 Section 2. Extreme Learning Machine 2nd-3rd paragraphs). As indicated earlier (Cao p.144 Section 3 Deep Belief Networks 2nd-3rd paragraphs), Cao further teaches a repeated process of mapping the input into features (to produce an output that is fed into the next hidden layer) forming a joint distribution that receives input from the previous hidden layer to produce an output representation of features, and applies this output as input to the next hidden layer until the last hidden layer of the plurality of hidden layers is reached, thus corresponding to a method for “approximating a distribution of features for each classification task”, as this method applies for binary classification as well as multiple category classification (Cao p.147 Table 1 (Section 5 Experiments and Analysis, corresponding to “each classification task”).).) …
deriving an … linear classifier[[,]] based upon results of the approximating, the … linear classifier being usable to update the parameters of the output layer of the DNN model (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an output layer (indicated by “top layer units”) that produces output labels T based on input received from the last hidden layer, where the application of this DS-ELM is targeted for classification and regression tasks, and where the pre-training of the DBN to learn the parameters of the hidden layer and the feature space H involves using a greedy Cao pp.144-145 Section 3 Deep Belief Networks) that results in the lower layers of the hidden layer extracting and representing low-level features from the input data and the upper layers performing gradual refinement of previously learnt concepts (Cao p.142 2nd paragraph (Section 1 Introduction): “… DBNs are known to have good modeling ability for higher-order and highly non-linear statistical structure in the input [6]. … the first layers of DBNs are expected to extract relatively low-level features out of the input space while the upper layers are expected to gradually refine previously learnt concept to generate more abstract ones. … Because the output of the higher DBN layers can easily be used as the input of a supervised classifier, Ribeiro et al. [5] used an ELM classifier for classify the deep concepts …”). This refinement of previously learnt concepts is interpreted as aggregating the learnt low-level features into more complex and abstract functions or tasks, with Cao Figure 1(b) showing the hidden layers of a pretrained DBN (including the upper layer of the hidden layer) being used as input to approximate a supervised classifier (based on an earlier motivation from an existing Ribeiro reference as taught in Cao p.142 2nd paragraph). Hence this method shown in Cao Figure 1(b) and described in Cao p.145 Section 4 Proposed Approach Step 2 corresponds to a method for “deriving an … linear classifier based upon results of the approximating, the … linear classifier being usable to update the parameters of the output layer of the DNN model”.).
While Cao teaches deriving a linear classifier, Cao does not explicitly teach
… deriving an optimal linear classifier[[,]] …
Akusok teaches
… deriving an optimal linear classifier[[,]] (Examiner’s note: Akusok teaches calculating an estimate of output weights 𝛃 (shown in Akusok Eq.13) involving estimations based on the feature space (represented by matrix H) and the output target (represented by T) (as shown in Akusok Eq.12), where the best linear unbiased estimator (shown in Akusok Eq.11) and the corresponding regularization term applied to the inverse matrix                         
                            
                                
                                    H
                                
                                
                                    T
                                
                            
                        
                     of the feature space results Akusok p.1014 col.2-p.1015 col.1 Section III.B. ELM Solution with Best Linear Unbiased Estimator and Section III.C. Numerical Stability of an ELM Solution with Correlation Matrices: “The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    . … If numerical instabilities are faced in the inverse, a regularization term is applied to the correlation matrix                         
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                            
                        
                     =                         
                            
                                
                                    H
                                
                                
                                    T
                                
                            
                            H
                            +
                            α
                            I
                        
                    , where 𝛂 is a small positive constant. This approach is called Ridge Regression …”).).
Both Cao and Akusok are analogous art since they both teach extreme learning machines and calculating the output weight values using approximate solutions to derive linear classifiers.  
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the calculations for deriving a linear classifier for the DS-ELM taught in Cao and further incorporate the calculations with regularization shown in the best linear unbiased estimator calculation taught in Akusok to derive an optimal classifier. The motivation to combine is taught in Akusok, as extreme learning machines provide fast techniques to initialize and train a neural network, thus improving the training efficiency of a neural network algorithm when training with large sets of data (“big data”), as well as (Akusok p.1011 col.2 3rd paragraph-p.1012 col.1 2nd paragraph (Section I. Introduction): “Extreme Learning Machines are well suited for solving Big Data [18] problems because their solution is so rapidly obtained … Extreme Learning Machines also benefit greatly from model structure selection and regularization, which reduces the negative effects of random initialization and over-fitting. The methods include                         
                            
                                
                                    L
                                
                                
                                    1
                                
                            
                        
                     [27], [28] and                         
                            
                                
                                    L
                                
                                
                                    2
                                
                            
                        
                     [29] regularization, as well as other methods [30] like handling imbalance classification [31].”).
Claims 5-6, 10, 12-13, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Cao et al., A Deep and Stable Extreme Learning Approach for Classification and Regression, Proceedings of ELM-2014 Volume 1, Proceedings in Adaptation, Learning and Optimization 3, DOI: 10.1007/978-3-319-14063-6_13, Springer International Publishing Switzerland 2015, pp.141-150 [hereafter referred as Cao] in view of Akusok et al., High-Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications, July 17 2015, IEEE Access Volume 3, 2015, pp.1011-1025 [hereafter referred as Akusok]; in further view of Huang et al., Discriminative clustering via extreme learning machine, June 19 2015, Elsevier, Neural Networks 70 (2015), pp.1-8 [hereafter referred as Huang].
Regarding original Claim 5, Cao in view of Akusok as applied to Claim 4 teaches
(Original) The method of claim 4, wherein 
each distribution is Gaussian (Examiner’s note: Akusok teaches the extreme learning machine hidden layer neurons support various non-linear transformation functions, including radial basis function neurons that exhibit Gaussian behavior (corresponding to “each distribution is Gaussian”) (Akusok p.1013 col.1 1st paragraph II.B. Hidden Neurons: “The hidden layer is not constrained to have only one type of transformation function in neurons. Different functions can be used (sigmoid, hyperbolic tangent, threshold, etc.)”; Akusok p.1013 col.1 2nd paragraph II.B. Hidden Neurons: “Another type of neurons commonly present in ELMs is the Radial Basis Function (RBF) neurons [32]. They use distances to centroids as inputs to the hidden layer, instead of a linear projections. The non-linear projection function is applied as usual. ELMs with RBF neurons compute predictions based on similar training data samples, which helps solving tasks with a complex dependency between data features and targets. Any function (norm) of distances between samples and centroids can be used, for instance L2, L1 or L1 norms.”; and Akusok p.1018 col.2 IV.M. How to Use Gaussian (RBF) Neurons: “The ELM toolbox has Gaussian neurons. Centroids are given instead of a projection matrix W and kernel widths in a bias vector b. There are three kinds of distance functions: L2 (Euclidean), L1 and L1.”).).
However, Cao in view of Akusok does not teach
[each distribution] … shares a same covariance, and does not share a same mean.
Huang teaches
[each distribution] … shares a same covariance, and does not share a same mean (Examiner’s note: Huang teaches using an extreme learning machine to perform discriminative clustering based on linear discriminant analysis, where the output weights are learned by performing linear discriminant analysis on the hidden layer output, and where clustering involves grouping of outputs with different mean (corresponding to “[each distribution] … does not share a same mean”), with the hidden layer scatter matrices representing the shared covariance (corresponding to “[each distribution] … shares a same covariance”) (Huang p.4 col.2 Section 4.2 ELM clustering based on LDA: “Inspired by the DisCluster algorithm (Ding & Li, 2007), we extend ELM for discriminative clustering based on LDA. The idea is to perform LDA and k-means in the output space of ELM alternatively. Since the transformation matrix learned by LDA is a linear mapping, it can be absorbed by the output weight matrix of ELM, and we can directly learn the output weight β by performing LDA on the hidden layer output of ELM. … Basically, the hidden layer output matrix H can be viewed as the new data matrix, and its within-class and between-class scatter matrices can be computed similarly as that in standard LDA.”).).
Both Cao in view of Akusok and Huang are analogous art as both describe the usage of extreme learning machines to perform classification and to determine initialization of output weights.
It would have been obvious to a person having ordinary skill in the art before the effective filing date to take the extreme learning machine method of initializing output weights of Cao in view of Akusok and enhance it with the extreme learning machine method of initializing output weights of Huang to perform the initialization of output weights using a linear discriminant method for classified distributions that share a same covariance and have different mean. The motivation to combine is taught in Huang, as a way to leverage the benefits of extreme learning machine (high-efficiency, ease of implementation, capability to handle multi-classification problems) with linear discriminant analysis, with the combination shown in Table 2 of Huang having the added benefit of outperforming other clustering classification methods such as k-means, thus making this combined solution an improvement for solving multi-classification problems (Huang p.2 col.1, 4th paragraph: “The motivation is to take advantage of ELM, and to design clustering algorithms which inherit its salient advantages, such as high efficiency, easiness of implementation and capable of handling multi-class data set.” and Huang p.6 col.1, 6th paragraph; p.6 Table 2: “ELMCIter, ELMCLDA and ELMCKM outperform the baseline methods, k-means and ELM k-means, on most data sets.”).
Regarding original Claim 6, Cao in view of Akusok as applied to Claim 4 teaches
(Original) The method of claim 4.
However, Cao in view of Akusok does not teach
wherein the approximating is based on at least one of class centroid statistics and shared covariance matrix statistics.  

wherein the approximating is based on at least one of class centroid statistics and shared covariance matrix statistics (Examiner’s note: Huang teaches using an extreme learning machine to perform discriminative clustering based on linear discriminant analysis, where the output weights are learned by performing linear discriminant analysis on the hidden layer output, and where clustering involves grouping of outputs with different mean (corresponding to “at least one of class centroid statistics”), with the hidden layer scatter matrices representing the shared covariance (corresponding to “shared covariance matrix statistics”) (Huang p.4 col.2 Section 4.2 ELM clustering based on LDA: “Inspired by the DisCluster algorithm (Ding & Li, 2007), we extend ELM for discriminative clustering based on LDA. The idea is to perform LDA and k-means in the output space of ELM alternatively. Since the transformation matrix learned by LDA is a linear mapping, it can be absorbed by the output weight matrix of ELM, and we can directly learn the output weight β by performing LDA on the hidden layer output of ELM. … Basically, the hidden layer output matrix H can be viewed as the new data matrix, and its within-class and between-class scatter matrices can be computed similarly as that in standard LDA.”).).  
Both Cao in view of Akusok and Huang are analogous art as both describe the usage of extreme learning machines to perform classification and to determine initialization of output weights.
It would have been obvious to a person having ordinary skill in the art before the effective filing date to take the extreme learning machine method of initializing output weights of Cao in view of Akusok and enhance it with the extreme learning machine method of initializing output weights of Huang to perform the initialization of output weights using a linear discriminant method for classified distributions that share a same covariance and have different mean. The motivation to combine is taught in Huang, as a way to leverage the benefits of extreme learning machine (high-efficiency, ease of implementation, capability to handle multi-classification (Huang p.2 col.1, 4th paragraph: “The motivation is to take advantage of ELM, and to design clustering algorithms which inherit its salient advantages, such as high efficiency, easiness of implementation and capable of handling multi-class data set.” and Huang p.6 col.1, 6th paragraph; p.6 Table 2: “ELMCIter, ELMCLDA and ELMCKM outperform the baseline methods, k-means and ELM k-means, on most data sets.”).
Regarding amended Claim 10, Cao teaches
(Currently amended) The method of claim 9, wherein the estimating the initializing values includes:
approximating a distribution of the features for each class of data (Cao p.146 Figure 1(b): examiner’s note: Cao teaches training a DS-ELM method for initializing the hidden layers of a deep extreme learning machine by receiving a set of inputs                         
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                            ,
                             
                            …
                            ,
                             
                            
                                
                                    x
                                
                                
                                    N
                                
                            
                        
                    , and learning weight parameters (corresponding to “parameters of the pre-trained artificial neural network”) and a feature space H (corresponding to “the data distribution in the feature space”) by training a deep belief network DBN (corresponding to the “pretrained artificial neural network”) and applying this learned weight parameters and feature space H into an ELM solver to calculate the output weights (Cao p.145 Section 4. Proposed Approach 1st paragraph and Cao pp.143-144 Section 2. Extreme Learning Machine 2nd-3rd paragraphs). As indicated earlier (Cao p.144 Section 3 Deep Belief Networks 2nd-3rd paragraphs), Cao further teaches a repeated process of mapping the input into features (to produce an output that is fed into the next hidden layer) forming a joint distribution that receives input from the previous hidden layer to produce an output representation of features, and applies this output as input to the next hidden layer until the last hidden layer of the plurality of hidden layers is reached, thus corresponding to a method for “approximating a distribution of features for each class of data”, as this method applies for Cao p.147 Table 1 (Section 5 Experiments and Analysis, corresponding to “each class of data”).).) …
… deriving a linear classifier based on the distribution (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an output layer (indicated by “top layer units”) that produces output labels T based on input received from the last hidden layer, where the application of this DS-ELM is targeted for classification and regression tasks, and where the pre-training of the DBN to learn the parameters of the hidden layer and the feature space H (corresponding to a data distribution in the feature space) involves using a greedy unsupervised training approach (Cao pp.144-145 Section 3 Deep Belief Networks) that results in the lower layers of the hidden layer extracting and representing low-level features from the input data and the upper layers performing gradual refinement of previously learnt concepts (Cao p.142 2nd paragraph (Section 1 Introduction): “… DBNs are known to have good modeling ability for higher-order and highly non-linear statistical structure in the input [6]. … the first layers of DBNs are expected to extract relatively low-level features out of the input space while the upper layers are expected to gradually refine previously learnt concept to generate more abstract ones. … Because the output of the higher DBN layers can easily be used as the input of a supervised classifier, Ribeiro et al. [5] used an ELM classifier for classify the deep concepts …”). This refinement of previously learnt concepts is interpreted as aggregating the learnt low-level features into more complex and abstract functions or tasks, with Cao Figure 1(b) showing the hidden layers of a pretrained DBN (including the upper layer of the hidden layer, the learned weight parameters and the feature space H) being used as input to approximate a supervised classifier (based on an earlier motivation from an existing Ribeiro reference as taught in Cao p.142 2nd paragraph). Hence this method shown in Cao Figure 1(b) and described in Cao p.145 Section 4 Proposed Approach Step 2 corresponds to a method for “deriving a linear classifier based on the distribution”.); and 
calculating the initializing values of the task-specific layer (Cao p.146 Figure 1(b): examiner’s note: Cao teaches a DS-ELM that applies the learned weights and feature space H from the trained DBN into an ELM solver to calculate the output weights (Cao p.145 Section 4. Proposed Approach 1st paragraph and Cao pp.143-144 Section 2. Extreme Learning Machine 2nd-3rd paragraphs), where the ELM solver performs the linear matrix calculation shown in Cao p.144 equation (3), and where this ELM solver/linear matrix calculation is based on the extreme learning machine generalization and universal approximation properties (Cao p.143 Section 2 Extreme Learning Machine). The fact that the learned weights and feature space H from the trained DBN is applied to this ELM solver/linear matrix calculation to determine the output weights of a deep extreme learning machine makes this calculation an approximate solution to the linear classifier, thus corresponding to “calculating the initializing values of the task-specific layer using the derived linear classifier”.).
However, Cao does not teach
… the distributions having Gaussian distributions …
Akusok teaches
… the distributions having Gaussian distributions (Examiner’s note: Akusok teaches the extreme learning machine hidden layer neurons support various non-linear transformation functions, including radial basis function neurons that exhibit Gaussian behavior (corresponding to “the distributions having Gaussian distributions”) (Akusok p.1013 col.1 1st paragraph II.B. Hidden Neurons: “The hidden layer is not constrained to have only one type of transformation function in neurons. Different functions can be used (sigmoid, hyperbolic tangent, threshold, etc.)”; Akusok p.1013 col.1 2nd paragraph II.B. Hidden Neurons: “Another type of neurons commonly present in ELMs is the Radial Basis Function (RBF) neurons [32]. They use distances to centroids as inputs to the hidden layer, instead of a linear projections. The non-linear projection function is applied as usual. ELMs with RBF neurons compute predictions based on similar training data samples, which helps solving tasks with a complex dependency between data features and targets. Any function (norm) of distances between samples and centroids can be used, for instance L2, L1 or L1 norms.”; and Akusok p.1018 col.2 IV.M. How to Use Gaussian (RBF) Neurons: “The ELM toolbox has Gaussian neurons. Centroids are given instead of a projection matrix W and kernel widths in a bias vector b. There are three kinds of distance functions: L2 (Euclidean), L1 and L1.”).) … 
Both Cao and Akusok are analogous art since they both teach calculating output weight values at the output layer using approximate solutions.  
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the calculations for deriving a linear classifier for the DS-ELM taught in Cao and further incorporate the detailed calculations with regularization shown in the best linear unbiased estimator calculation taught in Akusok as a way to improve the linear classifier to derive an optimal classifier. The motivation to combine is taught in Akusok, as extreme learning machines provide fast techniques to initialize and train a neural network, thus improving the training efficiency of a neural network algorithm when training with large sets of data (“big data”), as well as accommodating the regularization in the calculation of the best linear unbiased estimator to offset any negative effects due to over-fitted training data, thus making the system utilizing the trained neural network more resilient and robust to various types of training data (Akusok p.1011 col.2 3rd paragraph-p.1012 col.1 2nd paragraph (Section I. Introduction): “Extreme Learning Machines are well suited for solving Big Data [18] problems because their solution is so rapidly obtained … Extreme Learning Machines also benefit greatly from model structure selection and regularization, which reduces the negative effects of random initialization and over-fitting. The methods include                         
                            
                                
                                    L
                                
                                
                                    1
                                
                            
                        
                     [27], [28] and                         
                            
                                
                                    L
                                
                                
                                    2
                                
                            
                        
                     [29] regularization, as well as other methods [30] like handling imbalance classification [31].”).
Cao in view of Akusok does not teach
… the distributions having … a shared covariance…
Huang teaches
… the distributions having … a shared covariance (Examiner’s note: Huang teaches using an extreme learning machine to perform discriminative clustering based on linear discriminant analysis, where the output weights are learned by performing linear discriminant analysis on the hidden layer output, and where clustering involves grouping of outputs with different mean, with the hidden layer scatter matrices representing the shared covariance (Huang p.4 col.2 Section 4.2 ELM clustering based on LDA: “Inspired by the DisCluster algorithm (Ding & Li, 2007), we extend ELM for discriminative clustering based on LDA. The idea is to perform LDA and k-means in the output space of ELM alternatively. Since the transformation matrix learned by LDA is a linear mapping, it can be absorbed by the output weight matrix of ELM, and we can directly learn the output weight β by performing LDA on the hidden layer output of ELM. … Basically, the hidden layer output matrix H can be viewed as the new data matrix, and its within-class and between-class scatter matrices can be computed similarly as that in standard LDA.”).) …
 Both Cao in view of Akusok and Huang are analogous art as both describe the usage of extreme learning machines to perform classification and to determine initialization of output weights.
It would have been obvious to a person having ordinary skill in the art before the effective filing date to take the extreme learning machine method of initializing output weights of Cao in view of Akusok and enhance it with the extreme learning machine method of initializing output weights of Huang to perform the initialization of output weights using a linear discriminant method for classified distributions that share a same covariance and have different mean. The motivation to combine is taught in Huang, as a way to leverage the benefits of extreme learning machine (high-efficiency, ease of implementation, capability to handle multi-classification (Huang p.2 col.1, 4th paragraph: “The motivation is to take advantage of ELM, and to design clustering algorithms which inherit its salient advantages, such as high efficiency, easiness of implementation and capable of handling multi-class data set.” and Huang p.6 col.1, 6th paragraph; p.6 Table 2: “ELMCIter, ELMCLDA and ELMCKM outperform the baseline methods, k-means and ELM k-means, on most data sets.”).
Regarding original Claim 12, Cao in view of Akusok, in further view of Huang teaches
(Original) The method of claim 10, 
wherein the determining is based on how data is distributed in the feature space (Cao p.146 Figure 1(a) and (b): examiner’s note: Referring to Cao Figure 1, Cao teaches training a DS-ELM that has an output layer (indicated by “top layer units”) that produces output labels T based on input received from the last hidden layer, where the application of this DS-ELM is targeted for classification and regression tasks, and where the pre-training of the DBN to learn the parameters of the hidden layer and the feature space H involves using a greedy unsupervised training approach (Cao pp.144-145 Section 3 Deep Belief Networks) that results in the lower layers of the hidden layer extracting and representing low-level features from the input data and the upper layers performing gradual refinement of previously learnt concepts (Cao p.142 2nd paragraph (Section 1 Introduction): “… DBNs are known to have good modeling ability for higher-order and highly non-linear statistical structure in the input [6]. … the first layers of DBNs are expected to extract relatively low-level features out of the input space while the upper layers are expected to gradually refine previously learnt concept to generate more abstract ones. … Because the output of the higher DBN layers can easily be used as the input of a supervised classifier, Ribeiro et al. [5] used an ELM classifier for classify the deep concepts …”). This refinement of previously learnt concepts is Cao p.145 Section 4 Proposed Approach – Auto-abstraction of deep concepts 1st paragraph), and as such, the output from the last hidden layer to the output layer corresponds to “wherein the determining is based on how data is distributed in the feature space” (Cao p.141 Section 1 Introduction: “The research introduced in this paper will focus on the performance of learning machines with respect to accuracy and stability of both classification and regression tasks.” and Cao p.145 Section 5. Experiments and Analysis and Cao p.147 Table 1: “In order to extensively verify the performance of DS-ELM, a variety type of real-world data was chosen for each problem category (regression, binary classification, and multi-category classification).” and Cao p.145 Section 4 Proposed Approach – Auto-abstraction of deep concepts: “…the unsupervised pre-training of DBNs allows learning those complex functions by mapping the input to the output directly. Specifically speaking, the bottom layers are expected to extract and represent low-level features from the input data while the upper layers are expected to gradually refine previously learnt concepts [5].”).).  
Regarding amended Claim 13, Cao in view of Akusok, in further view of Huang teaches
(Currently amended) The method of claim 10, further comprising
introducing a regularization term to a covariance matrix so as to minimize variability of covariance matrix estimation ([Huang p.4 col.2 Section 4.2 ELM clustering based on LDA: examiner’s note: Huang teaches using an extreme learning machine to perform discriminative clustering based on linear discriminant (“Inspired by the DisCluster algorithm (Ding & Li, 2007), we extend ELM for discriminative clustering based on LDA. The idea is to perform LDA and k-means in the output space of ELM alternatively. Since the transformation matrix learned by LDA is a linear mapping, it can be absorbed by the output weight matrix of ELM, and we can directly learn the output weight β by performing LDA on the hidden layer output of ELM. … Basically, the hidden layer output matrix H can be viewed as the new data matrix, and its within-class and between-class scatter matrices can be computed similarly as that in standard LDA.”).] [Huang p.3 col.2 Section 3.2 Discriminative clustering via LDA: examiner’s note: Huang teaches in eq.9 and eq.10 in Huang Section 3.2 the applying of a ridge term λId, where λ is a regularization term, to the within-class scatter matrix (“It can be observed that both Σb and Σw are functions of the label matrix Y, since Y decides which cluster a sample is assigned to. For high dimensional data, a ridge term λId (Id is the identity matrix of dimension d) is added to the within-class scatter matrix to avoid numeric problems.”).] [Huang p.5 col.1, Section 4.2 ELM clustering based on LDA: examiner’s note: Huang teaches using LDA minimizes the within-class distortion (corresponding to “introducing a regularization term to a covariance matrix so as to minimize variability of covariance matrix estimation”) (“Since LDA minimizes the within-class distortion, and maximizes between class discrimination, the algorithm is able to find cluster structure in the ELM feature space.”).]).
Regarding original Claim 19, Cao teaches
(Original) The system of claim 18.
However, Cao does not teach
wherein each distribution is Gaussian …
Akusok teaches
wherein each distribution is Gaussian (This claim limitation is similar in scope as Claim 5, and hence is rejected under similar rationale.) …
Both Cao and Akusok are analogous art since they both teach calculating output weight values at the output layer using approximate solutions.  
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the calculations for deriving a linear classifier for the DS-ELM taught in Cao and further incorporate the detailed calculations with regularization shown in the best linear unbiased estimator calculation taught in Akusok as a way to improve the linear classifier to derive an optimal classifier. The motivation to combine is taught in Akusok, as extreme learning machines provide fast techniques to initialize and train a neural network, thus improving the training efficiency of a neural network algorithm when training with large sets of data (“big data”), as well as accommodating the regularization in the calculation of the best linear unbiased estimator to offset any negative effects due to over-fitted training data, thus making the system utilizing the trained neural network more resilient and robust to various types of training data (Akusok p.1011 col.2 3rd paragraph-p.1012 col.1 2nd paragraph (Section I. Introduction): “Extreme Learning Machines are well suited for solving Big Data [18] problems because their solution is so rapidly obtained … Extreme Learning Machines also benefit greatly from model structure selection and regularization, which reduces the negative effects of random initialization and over-fitting. The methods include                         
                            
                                
                                    L
                                
                                
                                    1
                                
                            
                        
                     [27], [28] and                         
                            
                                
                                    L
                                
                                
                                    2
                                
                            
                        
                     [29] regularization, as well as other methods [30] like handling imbalance classification [31].”).
However, Cao in view of Akusok does not teach
… shares a same covariance, and does not share a same mean, or
wherein each approximate solution is based on at least one of class centroid statistics and shared covariance matrix statistics.
Huang teaches
… shares a same covariance, and does not share a same mean (This claim limitation is similar in scope as Claim 5, and hence is rejected under similar rationale.), or 
wherein each approximate solution is based on at least one of class centroid statistics and shared covariance matrix statistics (This claim limitation is similar in scope as Claim 6, and hence is rejected under similar rationale.).  
Both Cao in view of Akusok and Huang are analogous art as both describe the usage of extreme learning machines to perform classification and to determine initialization of output weights.
It would have been obvious to a person having ordinary skill in the art before the effective filing date to take the extreme learning machine method of initializing output weights of Cao in view of Akusok and enhance it with the extreme learning machine method of initializing output weights of Huang to perform the initialization of output weights using a linear discriminant method for classified distributions that share a same covariance and have different mean. The motivation to combine is taught in Huang, as a way to leverage the benefits of extreme learning machine (high-efficiency, ease of implementation, capability to handle multi-classification problems) with linear discriminant analysis, with the combination shown in Table 2 of Huang having the added benefit of outperforming other clustering classification methods such as k-means, thus making this combined solution an improvement for solving multi-classification problems (Huang p.2 col.1, 4th paragraph: “The motivation is to take advantage of ELM, and to design clustering algorithms which inherit its salient advantages, such as high efficiency, easiness of implementation and capable of handling multi-class data set.” and Huang p.6 col.1, 6th paragraph; p.6 Table 2: “ELMCIter, ELMCLDA and ELMCKM outperform the baseline methods, k-means and ELM k-means, on most data sets.”).
Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Cao et al., A Deep and Stable Extreme Learning Approach for Classification and Regression, Proceedings of ELM-2014 Volume 1, Proceedings in Adaptation, Learning and Optimization 3, DOI: .
Regarding amended Claim 17, Cao teaches
The system of claim 14.
However, Cao does not teach
wherein the approximate solutions are resolved via results of a variant of a linear discriminant analysis algorithm.  
Huang teaches
wherein the approximate solutions are resolved via results of a variant of a linear discriminant analysis algorithm (Examiner’s note: Huang teaches using an extreme learning machine to perform discriminative clustering based on linear discriminant analysis, where the output weights are learned by performing linear discriminant analysis on the hidden layer output (corresponding to “the approximate solutions are resolved via results of a variant of a linear discriminant analysis algorithm”), and where clustering involves grouping of outputs with different mean, with the hidden layer scatter matrices representing the shared covariance (Huang p.4 col.2, Section 4.2 ELM clustering based on LDA: “Inspired by the DisCluster algorithm (Ding & Li, 2007), we extend ELM for discriminative clustering based on LDA. The idea is to perform LDA and k-means in the output space of ELM alternatively. Since the transformation matrix learned by LDA is a linear mapping, it can be absorbed by the output weight matrix of ELM, and we can directly learn the output weight β by performing LDA on the hidden layer output of ELM. … Basically, the hidden layer output matrix H can be viewed as the new data matrix, and its within-class and between-class scatter matrices can be computed similarly as that in standard LDA.”).).  

It would have been obvious to a person having ordinary skill in the art before the effective filing date to take the extreme learning machine method of initializing output weights of Cao and enhance it with the extreme learning machine method of initializing output weights of Huang to perform the initialization of output weights using a linear discriminant method for classified distributions that share a same covariance and have different mean. The motivation to combine is taught in Huang, as a way to leverage the benefits of extreme learning machine (high-efficiency, ease of implementation, capability to handle multi-classification problems) with linear discriminant analysis, with the combination shown in Table 2 of Huang having the added benefit of outperforming other clustering classification methods such as k-means, thus making this combined solution an improvement for solving multi-classification problems (Huang p.2 col.1, 4th paragraph: “The motivation is to take advantage of ELM, and to design clustering algorithms which inherit its salient advantages, such as high efficiency, easiness of implementation and capable of handling multi-class data set.” and Huang p.6 col.1, 6th paragraph; p.6 Table 2: “ELMCIter, ELMCLDA and ELMCKM outperform the baseline methods, k-means and ELM k-means, on most data sets.”).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Cheng et al., Revisit Multinominal Logistic Regression in Deep Learning: Data Dependent Model Initialization for Image Recognition, arXiv:1809.06131v1, September 17, 2018.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is (303)297-4332. The examiner can normally be reached Monday-Friday 8:00am - 4:30pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about 

/WILLIAM WAI YIN KWAN/Examiner, Art Unit 2121                                                                                                                                                                                                        



/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121