DETAILED ACTION
This is the first office action regarding application number 15/945,888, filed April 5, 2018.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Claim Objections
Claim 4 is objected to because of the following informality: a missing word in the following claim limitation “…deriving an optimal linear classifier, based on results of the approximating, …” in line 3. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 3, 10-13, and 14-19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding Claim 3,
The term "results of the initializing are close to the optimal solution" in claim 13 is a relative term which renders the claim indefinite. The term "close to the optimal solution" is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. Paragraph [0054] in the specification states: “One consequence of this novel approach is that it is not constrained to the low learning rate for the parameters in the non-linear feature extraction layers, which are required in conventional approaches so that the randomly initialized parameters in the last layer do not ruin the pre-trained model. Further the results of the initializing are close to the optimal solution to each classification task.”, but the specification fails to further define the boundaries of what is considered “close to the optimal solution”, either expressed in terms of a percentage, or in terms of an absolute amount, or some other measurement of how one skilled in the relevant art would measure the initializing results to be “close to the optimal solution”. The specification describes in exact terms deriving an “optimal linear classifier” for initializing the parameters of an artificial neural network, but the specification fails to define the metes and bounds of the term “close to” with respect to an optimal solution. A person of ordinary skill in the relevant art would be able to reasonably initialize the parameters of an artificial neural network, but one would not be able to measure or determine whether their initialization results qualify as being “close to the optimal solution” according to the claimed invention, due to the indefiniteness of this claim.
Regarding Claim 10, 
Claim 10 recites the limitation "The method of claim 9, wherein the resolving includes:..." in line 1. There is insufficient antecedent basis for this limitation in the claim, since there is no mention of a “resolving” claim limitation found in Claim 9. It is unclear whether this “resolving” is referring to the “determining one or more tasks of the task-specific layer” or the “estimating initializing values for parameters of the task-specific layer by finding an approximate solution…” claim limitations found in Claim 9, or to an unidentified claim limitation. For the purposes of examination, this claim limitation will be interpreted as “The method of claim 9, wherein the [estimating] includes:…”, due to the similarity of the claim language found in claim 4.
Regarding Claim 13, 
The term "in the absence of sufficient training data" in claim 13 is a relative term which renders the claim indefinite. The term "sufficient training data" is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. Paragraph [0091] in the specification states: “Importantly, the present invention avoids the problems of the high variability of covariance matrix estimation in the absence of sufficient training data.”, but the specification fails to further clarify the boundaries or conditions that would result in the input training data for the artificial neural network to be deemed “sufficient”, either expressed as a percentage, or in terms of an absolute amount, or some other measurement of what is considered “sufficient training data”, whether it pertains to the quality of the data, or whether it pertains to the resulting performance of the artificial neural network. A person of ordinary skill in the relevant art would be able to train an artificial neural network with what would be considered to be “sufficient training data”, but given the lack of metes and bounds for the term “sufficient training data”, one would not be able to determine if their amount of “sufficient training data” would satisfy the claimed invention’s criteria of “sufficient training data” for introducing the regularization term according to the claimed invention, due to the indefiniteness of this claim.
Regarding Claim 14, 
Claim 14 recites the limitation "A system comprising: an artificial neural network, comprising: an input level of nodes that receives the set of features..." in line 1. There is insufficient antecedent basis for this limitation in the claim, since Claim 14 is an independent a set of features”. For the purposes of examination, this claim limitation will be interpreted as “[a] set of features”.
Claim 14 also recites the limitation “level initializing logic that non-randomly initializes the parameters of the output level by resolving approximate solutions to the last layer, based on data distribution in the feature space.” There is insufficient antecedent basis for the two terms “to the last layer” and “the feature space”, since Claim 14 is an independent claim with no prior discussion of “to a last layer” or “a feature space”. With regards to the term “to the last layer”, it is unclear whether this term is referring to the earlier disclosed hidden level, a last layer in the hidden level (i.e., a hidden level comprising one or more layers), or to the output level itself (as it is the last layer in an artificial neural network). For purposes of examination, this term “to the last layer” will be interpreted as “to the last layer [of the DNN model]” (i.e., the output level). Using this interpretation, it then follows that the term “the feature space” will be interpreted as being the feature space between the hidden level and the last layer in the DNN model (i.e., the output level).
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 8 and 15-16 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to 
Regarding Claim 8, the claim recites “The method of claim 1, further comprising initializing the one or more of the hidden layers using estimates and/or solutions from general training models.” Paragraph [0105] states: “In block 640, the values of the parameters of the hidden layers are initialized using estimates and/or solutions from general training models.”, but the specification does not further describe or provide support as to the type of estimates and solutions from general training models the claimed invention is referencing to be used for the hidden layer. The specification must describe and support the claims such that the public is informed of the boundaries of what constitutes infringement of the patent, as well as determining whether the claimed invention meets all the criteria for patentability by distinctly claiming the subject matter which the inventor regards as the invention. See MPEP 2163. Given that there is no support of this limitation present in the specification, this claim limitation fails to comply with the written description requirement. For the purposes of examination, this limitation “using estimates and/or solutions from general training models” will not be given any patentable weight in terms of searching for prior art.
Regarding Claim 15, the claim recites “The system of claim 14, wherein the level initializing logic initializes the parameters of the hidden level using values from general training models.” Paragraph [0105] states: “In block 640, the values of the parameters of the hidden layers are initialized using estimates and/or solutions from general training models.”, but the specification does not further describe or provide support as to the type of estimates and solutions from general training models the claimed invention is referencing to be used for the hidden layer. The specification must describe and support the claims such that the public is informed of the boundaries of what constitutes infringement of the patent, as well as determining whether the claimed invention meets all the criteria for patentability by distinctly claiming the this claim limitation fails to comply with the written description requirement. For the purposes of examination, this limitation “using values from general training models” will not be given any patentable weight in terms of searching for prior art.
Regarding Claim 16, the claim recites “The system of claim 14, wherein the level initializing logic is a first level initializing logic, wherein the system further comprises a second level initializing logic that initializes the parameters of the hidden level using values from general training models.” The specification does not mention the usage of a first level and second level initialization logic to perform the initializing of parameters of the hidden level. Paragraph [0014] re-states the claim language from Claim 14: “According to still another aspect of the present invention, there is provided a system that includes: an artificial neural network; and level initializing logic. The artificial neural network includes: an input level of nodes that receives the set of features and applies a first non-linear function to the set of features to output a first set of modified values; a hidden level of nodes that receives the first set of modified values and applies an intermediate non- linear function to the first set of modified values to obtain a first set of intermediate modified values; and an output level of nodes that receives the first set of intermediate modified values, and generates a set of output values, the output values being indicative of a pattern relating to the image recognition tasks of the output level. The level initializing logic non- randomly initializes the parameters of the output level by resolving approximate solutions to the last layer, based on data distribution in the feature space.” The “first level initializing logic” in Claim 16 corresponds to the level-initializing logic stated in paragraph [0014] and in Claim 14 (that performs the non-random initializing of parameters of the output level), but there is no mention of a second level initializing logic, or a discussion of level initializing logic to initialize the parameters of the hidden level using values from general training models. Furthermore, paragraph [0105] states: “In block 640, the values of the parameters of the hidden layers are initialized using estimates and/or solutions from general training models.”, but the specification does not further describe or provide support as to the type of estimates and solutions from general training models the claimed invention is referencing to be used for the hidden level. The specification must describe and support the claims such that the public is informed of the boundaries of what constitutes infringement of the patent, as well as determining whether the claimed invention meets all the criteria for patentability by distinctly claiming the subject matter which the inventor regards as the invention. See MPEP 2163. Given that there is no support of this limitation present in the specification, this claim limitation fails to comply with the written description requirement. For the purposes of examination, the limitations “a second level initializing logic” and “using values from general training models” will not be given any patentable weight in terms of searching for prior art.

Claim Rejections - 35 USC § 102








The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-4, 8-9, 14-16, and 18-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Akusok et al., High-Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications, July 17 2015, IEEE Access Volume 3, 2015, pp.1011-1025 [hereafter referred as Akusok].
Regarding Claim 1, Akusok teaches
A method of training a deep neural network, comprising:
inputting training data into a deep neural network comprising multiple layers that are parameterized by a plurality of parameters, the multiple layers including ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: referring to Figure 1, a three-layer extreme learning machine (“a deep neural network comprising multiple layers”), with an input layer receiving data X1, X2, X3 (“inputting training data into a deep neural network”), an output layer, an interconnected hidden layer between the input and output layer, with parameters b,                         
                            
                                
                                    w
                                
                                
                                    1,1
                                
                            
                        
                    , …                         
                            
                                
                                    w
                                
                                
                                    3,1
                                
                            
                        
                    , and                         
                            
                                
                                    β
                                
                                
                                    1,1
                                
                            
                             
                            …
                            
                                
                                    β
                                
                                
                                    5,1
                                
                            
                        
                     as respective biases and input weights and output weights for the network (“multiple layers that are parameterized by a plurality of parameters”) (“An ELM is a fast training method for SLFN networks (Figure 1). A SLFN has three layers of neurons, but the name Single comes from the only layer of non-linear neurons in the model: the hidden layer. Input layer provides data features and performs no computations, while an output layer is linear without a transformation function and without bias. In the ELM method, input layer weights W and biases b are set randomly and never adjusted (random distribution of the weights is discussed in section III-A). Because the input weights are fixed, the output weights β are independent of them (unlike in Back-propagation [13] training method) and have a direct solution without iteration. For a linear output layer, such solution is also linear and very fast to compute.”).]): 
an input layer that receives training data ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: referring to Figure 1, a three-layer extreme learning machine, with an input layer receiving data X1, X2, X3 (“an input layer that receives training data”), an output layer, an interconnected hidden layer between the input and output layer, with parameters b,                         
                            
                                
                                    w
                                
                                
                                    1,1
                                
                            
                        
                    , …                         
                            
                                
                                    w
                                
                                
                                    3,1
                                
                            
                        
                    , and                         
                            
                                
                                    β
                                
                                
                                    1,1
                                
                            
                             
                            …
                            
                                
                                    β
                                
                                
                                    5,1
                                
                            
                        
                     as respective biases and input weights and output weights for the network (“An ELM is a fast training method for SLFN networks (Figure 1). A SLFN has three layers of neurons, but the name Single comes from the only layer of non-linear neurons in the model: the hidden layer. Input layer provides data features and performs no computations, while an output layer is linear without a transformation function and without bias. In the ELM method, input layer weights W and biases b are set randomly and never adjusted (random distribution of the weights is discussed in section III-A). Because the input weights are fixed, the output weights β are independent of them (unlike in Back-propagation [13] training method) and have a direct solution without iteration. For a linear output layer, such solution is also linear and very fast to compute.”).]); 
an output layer from which output is generated in a manner consistent with one or more classification tasks ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: referring to Figure 1, a three-layer extreme learning machine, with an input layer receiving data X1, X2, X3, an output layer (“an output layer from which output is generated”), an interconnected hidden layer between the input and output layer, with parameters b,                         
                            
                                
                                    w
                                
                                
                                    1,1
                                
                            
                        
                    , …                         
                            
                                
                                    w
                                
                                
                                    3,1
                                
                            
                        
                    , and                         
                            
                                
                                    β
                                
                                
                                    1,1
                                
                            
                             
                            …
                            
                                
                                    β
                                
                                
                                    5,1
                                
                            
                        
                     as respective biases and input weights and output weights for the network (“An ELM is a fast training method for SLFN networks (Figure 1). A SLFN has three layers of neurons, but the name Single comes from the only layer of non-linear neurons in the model: the hidden layer. Input layer provides data features and performs no computations, while an output layer is linear without a transformation function and without bias. In the ELM method, input layer weights W and biases b are set randomly and never adjusted (random distribution of the weights is discussed in section III-A). Because the input weights are fixed, the output weights β are independent of them (unlike in Back-propagation [13] training method) and have a direct solution without iteration. For a linear output layer, such solution is also linear and very fast to compute.”).] [Akusok p.1011 col.2 2nd paragraph Section I. Introduction: extreme learning machines support classification problems (“output is generated in a manner consistent with one or more classification tasks”) (“ELMs are also easily adapted for classification problems [3]. For multiclass classification, the index of the output node with the highest output indicates the predicted label of input. Then the predicted class is assigned by the maximum output of an ELM. Multi-label classification [17] is handled similarly, but the predicted classes are assigned by all outputs, which are greater than some threshold value”).]); and 
at least one hidden layer that is interconnected with the input layer and the output layer, that receives output from the input layer, and that outputs transformed data to a feature space between the at least one hidden layer and the output layer ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: referring to Figure 1, a three-layer extreme learning machine, with an input layer receiving data X1, X2, X3, an output layer, an interconnected hidden layer between the input and output layer (“at least one hidden layer that is interconnected with the input layer and the output layer, that receives output from the input layer”), with parameters b,                         
                            
                                
                                    w
                                
                                
                                    1,1
                                
                            
                        
                    , …                         
                            
                                
                                    w
                                
                                
                                    3,1
                                
                            
                        
                    , and                         
                            
                                
                                    β
                                
                                
                                    1,1
                                
                            
                             
                            …
                            
                                
                                    β
                                
                                
                                    5,1
                                
                            
                        
                     as respective biases and input weights and output weights for the network (“An ELM is a fast training method for SLFN networks (Figure 1). A SLFN has three layers of neurons, but the name Single comes from the only layer of non-linear neurons in the model: the hidden layer. Input layer provides data features and performs no computations, while an output layer is linear without a transformation function and without bias. In the ELM method, input layer weights W and biases b are set randomly and never adjusted (random distribution of the weights is discussed in section III-A). Because the input weights are fixed, the output weights β are independent of them (unlike in Back-propagation [13] training method) and have a direct solution without iteration. For a linear output layer, such solution is also linear and very fast to compute.”).] [Akusok p.1012 col.1-col.2 Section II.A. ELM Model: the output of the hidden layer represents a feature space (“Random input layer weights improve the generalization properties of the solution of a linear output layer, because they produce almost orthogonal (weakly correlated) hidden layer features. The solution of a linear system is always in a span of inputs. If the range of solution weights is limited, orthogonal inputs provide a larger solution space volume with these constrained weights. Small norms of the weights tend to make the system more stable and noise resistant as errors in input will not be amplified in the output of the linear system with smaller coefficients. Thus random hidden layer generates weakly correlated hidden layer features, which allow for a solution with a small norm and a good generalization performance.”).] [Akusok p.1012 col.2 Section II.B. Hidden Neurons: the hidden layer performing a non-linear transformation of the input data into a different representation, with the output of the hidden layer representing a feature space (“at least one hidden layer … that outputs transformed data to a feature space between the at least one hidden layer and the output layer”), and evaluating the feature space to find the output layer weights (“Hidden neurons transform the input data into a different representation. The transformation is done in two steps. First, the data is projected into the hidden layer using the input layer weights and biases. Second, the projected data is transformed. A non-linear transformation function greatly increases the learning capabilities of an ELM, because it is the only place where a non-linear part can be added in ELM method. After transformation, the data in the hidden layer representation h (see Figure 1) is used for finding output layer weights.”).]); 
evaluating a distribution of the data in the feature space ([Akusok p.1012 col.1-col.2 Section II.A. ELM Model: the output of the hidden layer represents a feature space (“Random input layer weights improve the generalization properties of the solution of a linear output layer, because they produce almost orthogonal (weakly correlated) hidden layer features. The solution of a linear system is always in a span of inputs. If the range of solution weights is limited, orthogonal inputs provide a larger solution space volume with these constrained weights. Small norms of the weights tend to make the system more stable and noise resistant as errors in input will not be amplified in the output of the linear system with smaller coefficients. Thus random hidden layer generates weakly correlated hidden layer features, which allow for a solution with a small norm and a good generalization performance.”).] [Akusok p.1012 col.2 Section II.B. Hidden Neurons: the a distribution of the data”), with the output of the hidden layer representing a feature space, and evaluating the feature space to find the output layer weights (“evaluating a distribution of the data in the feature space”) (“Hidden neurons transform the input data into a different representation. The transformation is done in two steps. First, the data is projected into the hidden layer using the input layer weights and biases. Second, the projected data is transformed. A non-linear transformation function greatly increases the learning capabilities of an ELM, because it is the only place where a non-linear part can be added in ELM method. After transformation, the data in the hidden layer representation h (see Figure 1) is used for finding output layer weights.”).]); and 
initializing, non-randomly, the parameters of the output layer based on the evaluated distribution of the data in the feature space ([Akusok p.1012 col.1-col.2 Section II.A. ELM Model: the output of the hidden layer represents a feature space (“Random input layer weights improve the generalization properties of the solution of a linear output layer, because they produce almost orthogonal (weakly correlated) hidden layer features. The solution of a linear system is always in a span of inputs. If the range of solution weights is limited, orthogonal inputs provide a larger solution space volume with these constrained weights. Small norms of the weights tend to make the system more stable and noise resistant as errors in input will not be amplified in the output of the linear system with smaller coefficients. Thus random hidden layer generates weakly correlated hidden layer features, which allow for a solution with a small norm and a good generalization performance.”).] [Akusok p.1012 col.2 Section II.B. Hidden Neurons: the hidden layer performing a non-linear transformation of the input data into a different representation, with the output of the hidden layer representing a feature space, and evaluating the feature space to find the output layer weights (“initializing, non-randomly, the parameters of the output layer based on the evaluated distribution of the data in the feature space”) (“Hidden neurons transform the input data into a different representation. The transformation is done in two steps. First, the data is projected into the hidden layer using the input layer weights and biases. Second, the projected data is transformed. A non-linear transformation function greatly increases the learning capabilities of an ELM, because it is the only place where a non-linear part can be added in ELM method. After transformation, the data in the hidden layer representation h (see Figure 1) is used for finding output layer weights.”).]).
Regarding Claim 2, Akusok teaches
The method of claim 1, wherein the initializing the parameters comprises
estimating parameter values of the output layer by finding an approximate solution to each classification task ([Akusok p.1014 col.2 Section III.B. ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution, involving estimations based on the feature space (represented by H) and the output target (represented by T), as shown in Eq.12 (“estimating parameter values of the output layer by finding an approximate solution to each classification task”) (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]).  
Regarding Claim 3, Akusok teaches
The method of claim 1, wherein results of the initializing are close to the optimal solution to each classification task ([Akusok p.1014 col.2 Section III.B. ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) (“results of the initializing”) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution (“close to the optimal solution”), involving estimations based on the feature space (represented by H) and the output target (represented by T) (“to each classification task”), as shown in Eq.12 (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]).  
Regarding Claim 4, Akusok teaches
The method of claim 1, wherein the initializing the parameters comprises:
approximating a distribution of features for each classification task ([Akusok p.1014 col.2 Section III.B. ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution, involving estimations based on the feature space (represented by H) and the output target (represented by T), as shown in Eq.12 (“approximating a distribution of features for each classification task”) (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]); and 
deriving an optimal linear classifier, based on results of the approximating, the optimal linear classifier being usable to initialize the parameters of the output layer of the DNN model ([Akusok p.1014 col.2 Section III.B. ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) (“initialize the parameters of the output layer of the DNN model”) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution (“deriving an optimal linear classifier”), involving estimations based on the feature space (represented by H) and the output target (represented by T), as shown in Eq.12 (“based on results of the approximating, the optimal linear classifier being usable to initialize the parameters of the output layer of the DNN model”) (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]).  
Regarding Claim 8, Akusok teaches 
The method of claim 1, further comprising
initializing the one or more of the hidden layers ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: randomly setting the weights and biases between the input layer and hidden layer (“initializing the one or more of the hidden layers”) (“In the ELM method, input layer weights W and biases b are set randomly and never adjusted”).]).
Regarding Claim 9, Akusok teaches
A method of computing initializing parameters of a task-specific layer of a deep neural network comprising: 
a task-specific layer from which output is generated in a manner consistent with one or more classification tasks ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: referring to Figure 1, a three-layer extreme learning machine, with an input layer receiving data X1, X2, X3, an output layer (“a task-specific layer from which output is generated”), an interconnected hidden layer between the input and output layer with parameters b,                         
                            
                                
                                    w
                                
                                
                                    1,1
                                
                            
                        
                    , …                         
                            
                                
                                    w
                                
                                
                                    3,1
                                
                            
                        
                    , and                         
                            
                                
                                    β
                                
                                
                                    1,1
                                
                            
                             
                            …
                            
                                
                                    β
                                
                                
                                    5,1
                                
                            
                        
                     as respective biases and input weights and output weights for the network (“An ELM is a fast training method for SLFN networks (Figure 1). A SLFN has three layers of neurons, but the name Single comes from the only layer of non-linear neurons in the model: the hidden layer. Input layer provides data features and performs no computations, while an output layer is linear without a transformation function and without bias. In the ELM method, input layer weights W and biases b are set randomly and never adjusted (random distribution of the weights is discussed in section III-A). Because the input weights are fixed, the output weights β are independent of them (unlike in Back-propagation [13] training method) and have a direct solution without iteration. For a linear output layer, such solution is also linear and very fast to compute.”).] [Akusok p.1011 col.2 2nd paragraph Section I. Introduction: extreme learning machines support classification problems (“output is generated in a manner consistent with one or more classification tasks”) (“ELMs are also easily adapted for classification problems [3]. For multiclass classification, the index of the output node with the highest output indicates the predicted label of input. Then the predicted class is assigned by the maximum output of an ELM. Multi-label classification [17] is handled similarly, but the predicted classes are assigned by all outputs, which are greater than some threshold value”).]); and 
at least one hidden layer that is connected to the output layer and that outputs transformed data to a feature space between the at least one hidden layer and the task-specific layer ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: referring to Figure 1, a three-layer extreme learning machine, with an input layer receiving data X1, X2, X3, an output layer, an interconnected hidden layer between the input and output layer (“at least one hidden layer that is connected to the output layer”) with parameters b,                         
                            
                                
                                    w
                                
                                
                                    1,1
                                
                            
                        
                    , …                         
                            
                                
                                    w
                                
                                
                                    3,1
                                
                            
                        
                    , and                         
                            
                                
                                    β
                                
                                
                                    1,1
                                
                            
                             
                            …
                            
                                
                                    β
                                
                                
                                    5,1
                                
                            
                        
                     as respective biases and input weights and output weights for the network (“An ELM is a fast training method for SLFN networks (Figure 1). A SLFN has three layers of neurons, but the name Single comes from the only layer of non-linear neurons in the model: the hidden layer. Input layer provides data features and performs no computations, while an output layer is linear without a transformation function and without bias. In the ELM method, input layer weights W and biases b are set randomly and never adjusted (random distribution of the weights is discussed in section III-A). Because the input weights are fixed, the output weights β are independent of them (unlike in Back-propagation [13] training method) and have a direct solution without iteration. For a linear output layer, such solution is also linear and very fast to compute.”).] [Akusok p.1012 col.1-col.2 Section II.A. ELM Model: the output of the hidden layer represents a feature space between the hidden layer and the output layer (“task-specific layer”) (“Random input layer weights improve the generalization properties of the solution of a linear output layer, because they produce almost orthogonal (weakly correlated) hidden layer features. The solution of a linear system is always in a span of inputs. If the range of solution weights is limited, orthogonal inputs provide a larger solution space volume with these constrained weights. Small norms of the weights tend to make the system more stable and noise resistant as errors in input will not be amplified in the output of the linear system with smaller coefficients. Thus random hidden layer generates weakly correlated hidden layer features, which allow for a solution with a small norm and a good generalization performance.”).] [Akusok p.1012 col.2 Section II.B. Hidden Neurons: the hidden layer performing a non-linear transformation of the input data into a different representation, with the output of the hidden layer representing a feature space (“at least one hidden layer … that outputs transformed data to a feature space between the at least one hidden layer and the task-specific layer”), and evaluating the feature space to find the output layer weights (“Hidden neurons transform the input data into a different representation. The transformation is done in two steps. First, the data is projected into the hidden layer using the input layer weights and biases. Second, the projected data is transformed. A non-linear transformation function greatly increases the learning capabilities of an ELM, because it is the only place where a non-linear part can be added in ELM method. After transformation, the data in the hidden layer representation h (see Figure 1) is used for finding output layer weights.”).]), 
the method comprising:
determining one or more tasks of the task-specific layer ([Akusok p.1012 col.2 Section II.B. Hidden Neurons: the hidden layer performing a non-linear transformation of the input data into a different representation, with the output of the hidden layer representing a feature space determining one or more tasks of the task-specific layer”), and evaluating the feature space to find the output layer weights (“Hidden neurons transform the input data into a different representation. The transformation is done in two steps. First, the data is projected into the hidden layer using the input layer weights and biases. Second, the projected data is transformed. A non-linear transformation function greatly increases the learning capabilities of an ELM, because it is the only place where a non-linear part can be added in ELM method. After transformation, the data in the hidden layer representation h (see Figure 1) is used for finding output layer weights.”).]); and 
estimating initializing values for parameters of the task-specific layer by finding an approximate solution to each of the one or more classification tasks, based on the data distribution in the feature space ([Akusok p.1012 col.2 Section II.B. Hidden Neurons: the hidden layer performing a non-linear transformation of the input data into a different representation, with the output of the hidden layer representing a feature space (“the data distribution in the feature space”), and evaluating the feature space to find the output layer weights (“estimating initializing values for parameters of the task-specific layer … based on the data distribution in the feature space”) (“Hidden neurons transform the input data into a different representation. The transformation is done in two steps. First, the data is projected into the hidden layer using the input layer weights and biases. Second, the projected data is transformed. A non-linear transformation function greatly increases the learning capabilities of an ELM, because it is the only place where a non-linear part can be added in ELM method. After transformation, the data in the hidden layer representation h (see Figure 1) is used for finding output layer weights.”).] [Akusok p.1014 col.2 Section III.B. ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution, involving estimations estimating initializing values for parameters of the task-specific layer by finding an approximate solution to each classification task, based on the data distribution in the feature space”) (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]).  
Regarding Claim 14, Akusok teaches
A system comprising:
an artificial neural network, comprising ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: referring to Figure 1, a three-layer extreme learning machine (“an artificial neural network”), with an input layer receiving data X1, X2, X3, an output layer, an interconnected hidden layer between the input and output layer, with parameters b,                         
                            
                                
                                    w
                                
                                
                                    1,1
                                
                            
                        
                    , …                         
                            
                                
                                    w
                                
                                
                                    3,1
                                
                            
                        
                    , and                         
                            
                                
                                    β
                                
                                
                                    1,1
                                
                            
                             
                            …
                            
                                
                                    β
                                
                                
                                    5,1
                                
                            
                        
                     as respective biases and input weights and output weights for the network (“An ELM is a fast training method for SLFN networks (Figure 1). A SLFN has three layers of neurons, but the name Single comes from the only layer of non-linear neurons in the model: the hidden layer. Input layer provides data features and performs no computations, while an output layer is linear without a transformation function and without bias. In the ELM method, input layer weights W and biases b are set randomly and never adjusted (random distribution of the weights is discussed in section III-A). Because the input weights are fixed, the output weights β are independent of them (unlike in Back-propagation [13] training method) and have a direct solution without iteration. For a linear output layer, such solution is also linear and very fast to compute.”).]): 
an input level of nodes that receives the set of features and applies a first non-linear function to [a] set of features to output a first set of modified values ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: referring to Figure 1, a three-layer extreme learning machine, with an input layer receiving data X1, X2, X3 (“an input level of nodes that receives [a] set of features”), an output layer, an interconnected hidden layer between the input and output layer, with parameters b,                         
                            
                                
                                    w
                                
                                
                                    1,1
                                
                            
                        
                    , …                         
                            
                                
                                    w
                                
                                
                                    3,1
                                
                            
                        
                    , and                         
                            
                                
                                    β
                                
                                
                                    1,1
                                
                            
                             
                            …
                            
                                
                                    β
                                
                                
                                    5,1
                                
                            
                        
                     as respective biases and input weights and output weights for the network; the input level of nodes performs no computations, which is interpreted to use a non-linear step or threshold activation function to generate its output to the hidden layer (“applies a first non-linear function to the set of features to output a first set of modified values”) (“An ELM is a fast training method for SLFN networks (Figure 1). A SLFN has three layers of neurons, but the name Single comes from the only layer of non-linear neurons in the model: the hidden layer. Input layer provides data features and performs no computations, while an output layer is linear without a transformation function and without bias. In the ELM method, input layer weights W and biases b are set randomly and never adjusted (random distribution of the weights is discussed in section III-A). Because the input weights are fixed, the output weights β are independent of them (unlike in Back-propagation [13] training method) and have a direct solution without iteration. For a linear output layer, such solution is also linear and very fast to compute.”).]); 
a hidden level of nodes that receives the first set of modified values and applies an intermediate non-linear function to the first set of modified values to obtain a first set of intermediate modified values ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: referring to Figure 1, a three-layer extreme learning machine, with an input layer receiving data X1, X2, X3, an output layer, an interconnected hidden layer between the input and output layer, with parameters b,                         
                            
                                
                                    w
                                
                                
                                    1,1
                                
                            
                        
                    , …                         
                            
                                
                                    w
                                
                                
                                    3,1
                                
                            
                        
                    , and                         
                            
                                
                                    β
                                
                                
                                    1,1
                                
                            
                             
                            …
                            
                                
                                    β
                                
                                
                                    5,1
                                
                            
                        
                     as respective biases and input weights and output weights for the network (“a hidden level of nodes that receives the first set of modified values”) (“An ELM is a fast training method for SLFN networks (Figure 1). A SLFN has three layers of neurons, but the name Single comes from the only layer of non-linear neurons in the model: the hidden layer. Input layer provides data features and performs no computations, while an output layer is linear without a transformation function and without bias. In the ELM method, input layer weights W and biases b are set randomly and never adjusted (random distribution of the weights is discussed in section III-A). Because the input weights are fixed, the output weights β are independent of them (unlike in Back-propagation [13] training method) and have a direct solution without iteration. For a linear output layer, such solution is also linear and very fast to compute.”).] [Akusok p.1012 col.2 Section II.B. Hidden Neurons: the hidden layer performing a non-linear transformation of the input data into a different representation (“applies an intermediate non-linear function to the first set of modified values to obtain a first set of intermediate modified values”), with the output of the hidden layer representing a feature space, and evaluating the feature space to find the output layer weights (“Hidden neurons transform the input data into a different representation. The transformation is done in two steps. First, the data is projected into the hidden layer using the input layer weights and biases. Second, the projected data is transformed. A non-linear transformation function greatly increases the learning capabilities of an ELM, because it is the only place where a non-linear part can be added in ELM method. After transformation, the data in the hidden layer representation h (see Figure 1) is used for finding output layer weights.”).]); 
an output level of nodes that receives the first set of intermediate modified values, and generates a set of output values, the output values being indicative of a pattern relating to a classification task of the output level ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: referring to Figure 1, a three-layer extreme learning machine, with an input layer receiving data X1, X2, X3, an output layer (“an output level of nodes that receives the first set of intermediate modified values, and generates a set of output values”), an interconnected hidden layer between the input and output layer, with parameters b,                         
                            
                                
                                    w
                                
                                
                                    1,1
                                
                            
                        
                    , …                         
                            
                                
                                    w
                                
                                
                                    3,1
                                
                            
                        
                    , and                         
                            
                                
                                    β
                                
                                
                                    1,1
                                
                            
                             
                            …
                            
                                
                                    β
                                
                                
                                    5,1
                                
                            
                        
                     as respective biases and input weights and output weights for the network (“An ELM is a fast training method for SLFN networks (Figure 1). A SLFN has three layers of neurons, but the name Single comes from the only layer of non-linear neurons in the model: the hidden layer. Input layer provides data features and performs no computations, while an output layer is linear without a transformation function and without bias. In the ELM method, input layer weights W and biases b are set randomly and never adjusted (random distribution of the weights is discussed in section III-A). Because the input weights are fixed, the output weights β are independent of them (unlike in Back-propagation [13] training method) and have a direct solution without iteration. For a linear output layer, such solution is also linear and very fast to compute.”).] [Akusok p.1011 col.2 2nd paragraph, Section I. Introduction: extreme learning machines support classification problems (“output values being indicative of a pattern relating to a classification task of the output level”) (“ELMs are also easily adapted for classification problems [3]. For multiclass classification, the index of the output node with the highest output indicates the predicted label of input. Then the predicted class is assigned by the maximum output of an ELM. Multi-label classification [17] is handled similarly, but the predicted classes are assigned by all outputs, which are greater than some threshold value”).]); and 
level initializing logic that non-randomly initializes the parameters of the output level by resolving approximate solutions to the last layer [of the DNN model], based on data distribution in the feature space ([Akusok p.1012 col.1-col.2 Section II.A. ELM Model: the output of the hidden layer represents a feature space (“Random input layer weights improve the generalization properties of the solution of a linear output layer, because they produce almost orthogonal (weakly correlated) hidden layer features. The solution of a linear system is always in a span of inputs. If the range of solution weights is limited, orthogonal inputs provide a larger solution space volume with these constrained weights. Small norms of the weights tend to make the system more stable and noise resistant as errors in input will not be amplified in the output of the linear system with smaller coefficients. Thus random hidden layer generates weakly correlated hidden layer features, which allow for a solution with a small norm and a good generalization performance.”).] [Akusok p.1012 col.2 Section II.B. Hidden Neurons: the hidden layer performing a non-linear transformation of the input data into a different representation, with the output of the hidden layer representing a feature space, and evaluating the feature space to find the output layer weights (“level initializing logic that non-randomly initializes the parameters of the output level … based on data distribution in the feature space”) (“Hidden neurons transform the input data into a different representation. The transformation is done in two steps. First, the data is projected into the hidden layer using the input layer weights and biases. Second, the projected data is transformed. A non-linear transformation function greatly increases the learning capabilities of an ELM, because it is the only place where a non-linear part can be added in ELM method. After transformation, the data in the hidden layer representation h (see Figure 1) is used for finding output layer weights.”).] [Akusok p.1014 col.2 Section III.B. ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-resolving approximate solutions to the last layer [of the DNN model]”) (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]).  
Regarding Claim 15, Akusok teaches
The system of claim 14, wherein the level initializing logic initializes the parameters of the hidden level ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: randomly setting the weights and biases between the input layer and hidden layer (“level initializing logic initializes the parameters of the hidden level”) (“In the ELM method, input layer weights W and biases b are set randomly and never adjusted”).]).  
Regarding Claim 16, Akusok teaches
The system of claim 14, wherein the level initializing logic is a first level initializing logic ([Akusok p.1012 col.2 Section II.B. Hidden Neurons: the hidden layer performing a non-linear transformation of the input data into a different representation, with the output of the hidden layer representing a feature space, and evaluating the feature space to find the output layer weights (“the level initializing logic is a first level initializing logic”) (“Hidden neurons transform the input data into a different representation. The transformation is done in two steps. First, the data is projected into the hidden layer using the input layer weights and biases. Second, the projected data is transformed. A non-linear transformation function greatly increases the learning capabilities of an ELM, because it is the only place where a non-linear part can be added in ELM method. After transformation, the data in the hidden layer representation h (see Figure 1) is used for finding output layer weights.”).] [Akusok p.1014 col.2 Section III.B. ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution, involving estimations based on the feature space (represented by H) and the output target (represented by T), as shown in Eq.12 (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]), 
wherein the system further comprises a ([Akusok p.1012 Figure 1; p.1012 col.1 Section II.A. ELM Model: randomly setting the weights and biases between the input layer and hidden layer (“level initializing logic that initializes the parameters of the hidden level”) (“In the ELM method, input layer weights W and biases b are set randomly and never adjusted”).]).
Regarding Claim 18, Akusok teaches
The system of claim 14, wherein the output level initializing logic estimates parameter values of the output level by: 
finding an approximate solution to each classification task ([Akusok p.1014 col.2 Section III.B. ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution, involving estimations based on the feature space (represented by H) and the output target (represented by T), as shown in Eq.12 (“finding an approximate solution to each classification task”)  (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]); 
approximating a distribution of features for each classification task ([Akusok p.1014 col.2 Section III.B. ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution, involving estimations based on the feature space (represented by H) (“approximating a distribution of features for each classification task”) and the output target (represented by T), as shown in Eq.12 (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]); and 
deriving an optimal linear classifier, based on results of the approximating, the optimal linear classifier being usable to initialize the parameters of the output layer of the DNN model ([Akusok p.1014 col.2 Section III.B. ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) (“initialize the parameters of the output layer of the DNN model”) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution (“deriving an optimal linear classifier”), involving estimations based on the feature space (represented by H) and the output target (represented by T), as shown in Eq.12 (“based on results of the approximating, the optimal linear classifier being usable to initialize the parameters of the output layer of the DNN model”) (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]).  
Regarding Claim 19,
The system of claim 18, 
wherein each distribution is Gaussian, shares a same covariance, and does not share a same mean (This claim limitation is similar in scope as Claim 5, and hence is rejected under similar rationale.), or 
wherein each approximate solution is based on at least one of class centroid statistics and shared covariance matrix statistics (This claim limitation is similar in scope as Claim 6, and hence is rejected under similar rationale.).  
Regarding Claim 20, Akusok teaches
A system comprising
one or more computing devices ([Akusok p.1020 col.2-p.1021 col.1 Section V.D. Performance on Large Datasets: extreme learning machine running the ELM toolbox on a workstation (“one or more computing devices”), with a workstation being a computing device that has storage for storing the ELM toolbox program, to perform the classification of large datasets (“Large datasets are classified with the toolbox on a workstation with 4-core 4GHz CPU and GTX Titan Black GPU. Additional experiments show runtime comparison with a cluster node having two 8-core 2.6GHz CPUs, and with a Macbook Air laptop having a 2-core 1.4GHz CPU. Dataset is split into training and test sets, stored in HDF5 format. They are processed by HPELM toolbox class on both CPU (up to 4096 hidden neurons) and GPU (up to 19,000 hidden neurons, limited by the GPU memory).”).]) and 
one or more storage devices storing instructions that are operable ([Akusok p.1020 col.2-p.1021 col.1 Section V.D. Performance on Large Datasets: extreme learning machine instructions”) on a workstation, with a workstation being a computing device that has storage for storing the ELM toolbox program (“one or more storage devices storing instructions that are operable”), to perform the classification of large datasets (“Large datasets are classified with the toolbox on a workstation with 4-core 4GHz CPU and GTX Titan Black GPU. Additional experiments show runtime comparison with a cluster node having two 8-core 2.6GHz CPUs, and with a Macbook Air laptop having a 2-core 1.4GHz CPU. Dataset is split into training and test sets, stored in HDF5 format. They are processed by HPELM toolbox class on both CPU (up to 4096 hidden neurons) and GPU (up to 19,000 hidden neurons, limited by the GPU memory).”).]), 
when executed by the one or more computing devices, to cause the one or more computing devices to perform the method of claim 9 (This claim limitation is similar in scope as Claim 9, and hence is rejected under similar rationale.).  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 5-6, 10-13, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Akusok et al., High-Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications, July 17 2015, IEEE Access Volume 3, 2015, pp.1011-1025 [hereafter referred as Akusok] as applied to Claim 4, Claim 9, and Claim 14, in view of Huang et al., Discriminative clustering via extreme learning machine, June 19 2015, Elsevier, Neural Networks 70 (2015), pp.1-8 [hereafter referred as Huang].
Regarding Claim 5, Akusok as applied to Claim 4 teaches
The method of claim 4, 
wherein each distribution is Gaussian ([Akusok p.1013 col.1 1st paragraph Section II.B. Hidden Neurons: extreme learning machine hidden layer neurons supporting various non-linear transformation functions (“The hidden layer is not constrained to have only one type of transformation function in neurons. Different functions can be used (sigmoid, hyperbolic tangent, threshold, etc.)”.] [Akusok p.1013 col.1 2nd paragraph Section II.B. Hidden Neurons: “Another type of neurons commonly present in ELMs is the Radial Basis Function (RBF) neurons [32]. They use distances to centroids as inputs to the hidden layer, instead of a linear projections. The non-linear projection function is applied as usual. ELMs with RBF neurons compute predictions based on similar training data samples, which helps solving tasks with a complex dependency between data features and targets. Any function (norm) of distances between samples and centroids can be used, for instance L2, L1 or L1 norms.”] [Akusok p.1018, Section IV M. How to Use Gaussian (RBF) Neurons: radial basis function neurons exhibit Gaussian behavior (“each distribution is Gaussian”) (“The ELM toolbox has Gaussian neurons. Centroids are given instead of a projection matrix W and kernel widths in a bias vector b. There are three kinds of distance functions: L2 (Euclidean), L1 and L1.”).]).
However, Akusok does not teach
[each distribution] … shares a same covariance, and does not share a same mean.
Huang teaches
[each distribution] … shares a same covariance, and does not share a same mean ([Huang p.4 col.2 Section 4.2 ELM clustering based on LDA: using an extreme learning machine to perform discriminative clustering based on linear discriminant analysis, where the output weights are learned by performing linear discriminant analysis on the hidden layer output, and where clustering involves grouping of outputs with different mean (“[each distribution] … does not share a same mean”), with the hidden layer scatter matrices representing the shared covariance (“[each distribution] … shares a same covariance”) (“Inspired by the DisCluster algorithm (Ding & Li, 2007), we extend ELM for discriminative clustering based on LDA. The idea is to perform LDA and k-means in the output space of ELM alternatively. Since the transformation matrix learned by LDA is a linear mapping, it can be absorbed by the output weight matrix of ELM, and we can directly learn the output weight β by performing LDA on the hidden layer output of ELM. … Basically, the hidden layer output matrix H can be viewed as the new data matrix, and its within-class and between-class scatter matrices can be computed similarly as that in standard LDA.”).]).
Both Akusok and Huang are analogous art as both describe the usage of extreme learning machines to perform classification and to determine initialization of output weights.
([Huang p.2 col.1, 4th paragraph: “The motivation is to take advantage of ELM, and to design clustering algorithms which inherit its salient advantages, such as high efficiency, easiness of implementation and capable of handling multi-class data set.”] [Huang p.6 col.1, 6th paragraph; p.6 Table 2: “ELMCIter, ELMCLDA and ELMCKM outperform the baseline methods, k-means and ELM k-means, on most data sets.”]).
Regarding Claim 6, Akusok as applied to Claim 4 teaches
The method of claim 4.
However, Akusok does not teach
wherein the approximating is based on at least one of class centroid statistics and shared covariance matrix statistics.  
Huang teaches
wherein the approximating is based on at least one of class centroid statistics and shared covariance matrix statistics ([Huang p.4 col.2 Section 4.2 ELM clustering based on LDA: using an extreme learning machine to perform discriminative clustering based on linear discriminant analysis, where the output weights are learned by performing linear discriminant analysis on the hidden layer output, and where clustering involves grouping of outputs with at least one of class centroid statistics”), with the hidden layer scatter matrices representing the shared covariance (“shared covariance matrix statistics”) (“Inspired by the DisCluster algorithm (Ding & Li, 2007), we extend ELM for discriminative clustering based on LDA. The idea is to perform LDA and k-means in the output space of ELM alternatively. Since the transformation matrix learned by LDA is a linear mapping, it can be absorbed by the output weight matrix of ELM, and we can directly learn the output weight β by performing LDA on the hidden layer output of ELM. … Basically, the hidden layer output matrix H can be viewed as the new data matrix, and its within-class and between-class scatter matrices can be computed similarly as that in standard LDA.”).]).  
Both Akusok and Huang are analogous art as both describe the usage of extreme learning machines to perform classification and to determine initialization of output weights.
It would have been obvious to a person having ordinary skill in the art before the effective filing date to take the extreme learning machine method of initializing output weights of Akusok and enhance it with the extreme learning machine method of initializing output weights of Huang to perform the initialization of output weights using a linear discriminant method for classified distributions that share a same covariance and have different mean. The motivation to combine is taught in Huang, as a way to leverage the benefits of extreme learning machine (high-efficiency, ease of implementation, capability to handle multi-classification problems) with linear discriminant analysis, with the combination shown in Table 2 of Huang having the added benefit of outperforming other clustering classification methods such as k-means, thus making this combined solution an improvement for solving multi-classification problems ([Huang p.2 col.1, 4th paragraph: “The motivation is to take advantage of ELM, and to design clustering algorithms which inherit its salient advantages, such as high efficiency, easiness of implementation and capable of handling multi-class data set.”] [Huang p.6 col.1, 6th paragraph; p.6 Table 2: “ELMCIter, ELMCLDA and ELMCKM outperform the baseline methods, k-means and ELM k-means, on most data sets.”]).
Regarding Claim 10, Akusok as applied to Claim 9 teaches
The method of claim 9, wherein the [estimating] includes:
approximating a distribution of the features for each class of data ([Akusok p.1014 col.2 III.B ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution, involving estimations based on the feature space (represented by H) and the output target (represented by T), as shown in Eq.12 (“approximating a distribution of the features for each class of data”) (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]), 
the distributions having Gaussian distributions … ([Akusok p.1013 col.1 1st paragraph II.B. Hidden Neurons: extreme learning machine hidden layer neurons supporting various non-linear transformation functions (“The hidden layer is not constrained to have only one type of transformation function in neurons. Different functions can be used (sigmoid, hyperbolic tangent, threshold, etc.)”.] [Akusok p.1013 col.1 2nd paragraph II.B. Hidden Neurons: “Another type of neurons commonly present in ELMs is the Radial Basis Function (RBF) neurons [32]. They use distances to centroids as inputs to the hidden layer, instead of a linear projections. The non-linear projection function is applied as usual. ELMs with RBF neurons compute predictions based on similar training data samples, which helps solving tasks with a complex dependency between data features and targets. Any function (norm) of distances between samples and centroids can be used, for instance L2, L1 or L1 norms.”] [Akusok p.1018 col.2 IV.M. How to Use Gaussian (RBF) Neurons: radial basis function neurons exhibit Gaussian behavior (“the distributions having Gaussian distributions”) (“The ELM toolbox has Gaussian neurons. Centroids are given instead of a projection matrix W and kernel widths in a bias vector b. There are three kinds of distance functions: L2 (Euclidean), L1 and L1.”).]); 
deriving a linear classifier based on the distribution ([Akusok p.1014 col.2 III.B ELM solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution, involving estimations based on the feature space (represented by H) and the output target (represented by T), as shown in Eq.12 (“deriving an optimal linear classifier based on the distribution”) (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]); and 
calculating initializing parameters of the last layer of the DNN model using the derived linear classifier ([Akusok p.1014 col.2 III.B ELM solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) (“calculating initializing parameters of the last layer of the DNN model”) using the best linear unbiased estimator (shown in Eq.11) (“using the derived linear classifier”), which provides the optimal least-squares solution, involving estimations based on the feature space (represented by H) and the output target (represented by T), as shown in Eq.12 (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]).
However, Akusok does not teach
… and a shared covariance…
Huang teaches
… and a shared covariance… ([Huang p.4 col.2 Section 4.2 ELM clustering based on LDA: using an extreme learning machine to perform discriminative clustering based on linear discriminant analysis, where the output weights are learned by performing linear discriminant analysis on the hidden layer output, and where clustering involves grouping of outputs with different mean, with the hidden layer scatter matrices representing the shared covariance (“Inspired by the DisCluster algorithm (Ding & Li, 2007), we extend ELM for discriminative clustering based on LDA. The idea is to perform LDA and k-means in the output space of ELM alternatively. Since the transformation matrix learned by LDA is a linear mapping, it can be absorbed by the output weight matrix of ELM, and we can directly learn the output weight β by performing LDA on the hidden layer output of ELM. … Basically, the hidden layer output matrix H can be viewed as the new data matrix, and its within-class and between-class scatter matrices can be computed similarly as that in standard LDA.”).])
 Both Akusok and Huang are analogous art as both describe the usage of extreme learning machines to perform classification and to determine initialization of output weights.
It would have been obvious to a person having ordinary skill in the art before the effective filing date to take the extreme learning machine method of initializing output weights of Akusok and enhance it with the extreme learning machine method of initializing output weights of Huang to perform the initialization of output weights using a linear discriminant method for classified distributions that share a same covariance and have different mean. The motivation to combine is taught in Huang, as a way to leverage the benefits of extreme learning machine (high-efficiency, ease of implementation, capability to handle multi-classification problems) with linear discriminant analysis, with the combination shown in Table 2 of Huang having the added benefit of outperforming other clustering classification methods such as k-means, thus making this combined solution an improvement for solving multi-classification problems ([Huang p.2 col.1, 4th paragraph: “The motivation is to take advantage of ELM, and to design clustering algorithms which inherit its salient advantages, such as high efficiency, easiness of implementation and capable of handling multi-class data set.”] [Huang p.6 col.1, 6th paragraph; p.6 Table 2: “ELMCIter, ELMCLDA and ELMCKM outperform the baseline methods, k-means and ELM k-means, on most data sets.”]).
Regarding Claim 11, Akusok in view of Huang teaches
The method of claim 10, 
wherein the linear classifier is an optimal solution ([Akusok p.1014 col.2 III.B. ELM Solution with Best Linear Unbiased Estimator: calculating an estimate of output weights 𝛃 (shown in Eq.13) using the best linear unbiased estimator (shown in Eq.11), which provides the optimal least-squares solution (“the linear classifier is an optimal solution”), involving estimations based on the feature space (represented by H) and the output target (represented by T), as shown in Eq.12 (“The best linear unbiased estimator gives the optimal least squares solution to the matrix X𝛃 = T for stochastic vectors x and t combined into the corresponding matrices. It uses two theoretical correlation matrices                         
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ,
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    t
                                
                            
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                             
                            (
                            10
                            )
                        
                     which are assumed to be known. The best linear unbiased estimator of T, denoted by Y, is then                         
                            Y
                            =
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                                
                            
                            X
                            =
                            β
                            X
                            .
                             
                            (
                            11
                            )
                        
                     The inverse of Cxx exists because x is a stochastic variable for which                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            =
                             
                            E
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            T
                                        
                                    
                                    x
                                
                            
                        
                     has a full rank. The ELM problem has a finite amount of projected data samples H and corresponding targets T, so the correlation matrices are replaced by their estimations                         
                            
                                
                                    C
                                
                                
                                    x
                                    x
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    h
                                
                            
                            ,
                             
                            
                                
                                    C
                                
                                
                                    x
                                    t
                                
                            
                            ≈
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    H
                                    =
                                    Ω
                                
                                
                                    t
                                
                            
                             
                            (
                            12
                            )
                        
                    , and the ELM output weights are computed from those estimates                         
                            β
                            =
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    T
                                                
                                            
                                            H
                                        
                                    
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            T
                                        
                                    
                                    T
                                
                            
                            =
                             
                            
                                
                                    Ω
                                
                                
                                    h
                                
                                
                                    -
                                    1
                                
                            
                            
                                
                                    Ω
                                
                                
                                    t
                                
                                
                            
                             
                            (
                            13
                            )
                        
                    .”).]).  
Regarding Claim 12, Akusok in view of Huang teaches
The method of claim 10, 
wherein the determining is based on how data is distributed in the feature space ([Akusok p.1013 col.1 1st paragraph II.B. Hidden Neurons: extreme learning machine hidden layer neurons supporting various non-linear transformation functions (“The hidden layer is not constrained to have only one type of transformation function in neurons. Different functions can be used (sigmoid, hyperbolic tangent, threshold, etc.)”.] [Akusok p.1013 col.1 2nd paragraph II.B. Hidden Neurons: “Another type of neurons commonly present in ELMs is the Radial Basis Function (RBF) neurons [32]. They use distances to centroids as inputs to the hidden layer, instead of a linear projections. The non-linear projection function is applied as usual. ELMs with RBF neurons compute predictions based on similar training data samples, which helps solving tasks with a complex dependency between data features and targets. Any function (norm) of distances between samples and centroids can be used, for instance L2, L1 or L1 norms.”] [Akusok p.1018 col.2 IV.M. How to Use Gaussian (RBF) Neurons: radial basis function neurons exhibit Gaussian behavior (“the distributions having Gaussian distributions”) (“The ELM toolbox has Gaussian neurons. Centroids are given instead of a projection matrix W and kernel widths in a bias vector b. There are three kinds of distance functions: L2 (Euclidean), L1 and L1.”).] [Akusok p.1012 col.2 II.B. Hidden Neurons: the hidden layer performing a non-linear transformation of the input data into a different representation, with the output of the hidden layer representing a feature space between the hidden layer and output layer (“determining one or more tasks of the task-specific layer”; “the determining is based on how data is distributed in the feature space”), and evaluating the feature space to find the output layer weights (“Hidden neurons transform the input data into a different representation. The transformation is done in two steps. First, the data is projected into the hidden layer using the input layer weights and biases. Second, the projected data is transformed. A non-linear transformation function greatly increases the learning capabilities of an ELM, because it is the only place where a non-linear part can be added in ELM method. After transformation, the data in the hidden layer representation h (see Figure 1) is used for finding output layer weights.”).]).  
Regarding Claim 13, Akusok in view of Huang teaches
The method of claim 10, further comprising
introducing a regularization term to a covariance matrix so as to minimize variability of covariance matrix estimation in the absence of sufficient training data ([Akusok p.1014 col.1, 4th paragraph: applying regularization to an extreme learning machine to prevent over-fitting of training data, which can occur if there is not enough variable training data (“introducing a regularization term … in the absence of sufficient training data”) (Akusok p.1014 col.1, 4th paragraph: “Model structure selection prevents ELM from learning noise from data and over-fitting. It does so by artificially limiting the learning ability of an ELM. A training dataset has multiple instances of inputs, and the corresponding targets, which are generated by the projected data and an added noise. The noise term includes both random noise and projection from features not present in the inputs. Learning particular data samples with the associated noise is called over-fitting. An over-fitted ELM model has worse generalization performance (prediction performance on new data), which can be measured using a validation set of data. A model structure selection process finds an optimal generalization performance by changing the amount of model parameters or applying regularization to the model.”).] [Huang p.4 col.2 Section 4.2 ELM clustering based on LDA: using an extreme learning machine to perform discriminative clustering based on linear discriminant analysis, where the output weights are learned by performing linear discriminant analysis on the hidden layer output, and where clustering involves grouping of outputs with different mean, with the hidden layer scatter matrices representing the shared covariance (“Inspired by the DisCluster algorithm (Ding & Li, 2007), we extend ELM for discriminative clustering based on LDA. The idea is to perform LDA and k-means in the output space of ELM alternatively. Since the transformation matrix learned by LDA is a linear mapping, it can be absorbed by the output weight matrix of ELM, and we can directly learn the output weight β by performing LDA on the hidden layer output of ELM. … Basically, the hidden layer output matrix H can be viewed as the new data matrix, and its within-class and between-class scatter matrices can be computed similarly as that in standard LDA.”).] [Huang p.3 col.2 Section 3.2 Discriminative clustering via LDA: referring to eq.9 and eq.10 in Section 3.2, applying a ridge term λId, where λ is a regularization term, to the within-class scatter matrix (“It can be observed that both Σb and Σw are functions of the label matrix Y, since Y decides which cluster a sample is assigned to. For high dimensional data, a ridge term λId (Id is the identity matrix of dimension d) is added to the within-class scatter matrix to avoid numeric problems.”).] [Huang p.5 col.1, Section 4.2 ELM clustering based on LDA: using LDA minimizes the within-class distortion (“introducing a regularization term to a covariance matrix so as to minimize variability of covariance matrix estimation”) (“Since LDA minimizes the within-class distortion, and maximizes between class discrimination, the algorithm is able to find cluster structure in the ELM feature space.”).]).
Regarding Claim 17, Akusok as applied to Claim 14 teaches
The system of claim 14.
However, Akusok does not teach
wherein the approximate solutions are resolved via result of a variant of a linear discriminant analysis algorithm.  
Huang teaches
wherein the approximate solutions are resolved via result of a variant of a linear discriminant analysis algorithm ([Huang p.4 col.2, Section 4.2 ELM clustering based on LDA: using an extreme learning machine to perform discriminative clustering based on linear discriminant analysis, where the output weights are learned by performing linear discriminant analysis on the hidden layer output (“the approximate solutions are resolved via result of a variant of a linear discriminant analysis algorithm”), and where clustering involves grouping of outputs with different mean, with the hidden layer scatter matrices representing the shared covariance (“Inspired by the DisCluster algorithm (Ding & Li, 2007), we extend ELM for discriminative clustering based on LDA. The idea is to perform LDA and k-means in the output space of ELM alternatively. Since the transformation matrix learned by LDA is a linear mapping, it can be absorbed by the output weight matrix of ELM, and we can directly learn the output weight β by performing LDA on the hidden layer output of ELM. … Basically, the hidden layer output matrix H can be viewed as the new data matrix, and its within-class and between-class scatter matrices can be computed similarly as that in standard LDA.”).]).  
Both Akusok and Huang are analogous art as both describe the usage of extreme learning machines to perform classification and to determine initialization of output weights.
It would have been obvious to a person having ordinary skill in the art before the effective filing date to take the extreme learning machine method of initializing output weights of Akusok and enhance it with the extreme learning machine method of initializing output weights of Huang to perform the initialization of output weights using a linear discriminant method for classified distributions that share a same covariance and have different mean. The motivation to combine is taught in Huang, as a way to leverage the benefits of extreme learning machine (high-efficiency, ease of implementation, capability to handle multi-classification problems) with linear discriminant analysis, with the combination shown in Table 2 of Huang having the added benefit of outperforming other clustering classification methods such as k-means, thus making this combined solution an improvement for solving multi-classification problems ([Huang p.2 col.1, 4th paragraph: “The motivation is to take advantage of ELM, and to design clustering algorithms which inherit its salient advantages, such as high efficiency, easiness of implementation and capable of handling multi-class data set.”] [Huang p.6 col.1, 6th paragraph; p.6 Table 2: “ELMCIter, ELMCLDA and ELMCKM outperform the baseline methods, k-means and ELM k-means, on most data sets.”]).
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Akusok et al., High-Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications, July 17 2015, IEEE Access Volume 3, 2015, pp.1011-1025 [hereafter referred as Akusok] as applied to Claim 1, in view of Xiao et al., A Multiple Hidden Layers Extreme Learning Machine Method and Its Application, December 13 2017, Hindawi, Mathematical Problems in Engineering, Volume 2017, pp.1-10 [hereafter referred as Xiao].
Regarding Claim 7, Akusok as applied to Claim 1 teaches
The method of claim 1.
However, Akusok does not teach
the at least one hidden layer comprises a plurality of hidden layers; 
each hidden layer comprises a respective plurality of nodes, each node in a hidden layer being configured to perform a transformation on output of at least one node from an adjacent, lower layer; 
a lowest one of the plurality of hidden layers receives an output from the input layer; and 
the output layer receives an output from a highest one of the plurality of hidden layers.
Xiao teaches
the at least one hidden layer comprises a plurality of hidden layers ([Xiao Figure 3; p.4 col.2 2nd paragraph Section 4. Multihidden-Layer ELM: referring to Figure 3, an extreme learning machine with three hidden layers (“the at least one hidden layer comprises a plurality of hidden layers”), with each hidden layer consisting of a plurality of nodes, and is connected with an output from an adjacent prior hidden layer, with the first hidden layer connected to the input layer, and the third hidden layer connected to an output layer (“Thus we propose an algorithm named multiple hidden layers extreme learning machine (MELM). The structure of the MELM (select the three-hidden-layer ELM for example) is illustrated in Figure 3. … The structure of the three-hidden-layer ELM has input layer, three hidden layers, and output layer.”).]); 
each hidden layer comprises a respective plurality of nodes, each node in a hidden layer being configured to perform a transformation on output of at least one node from an adjacent, lower layer ([Xiao Figure 3; p.4 col.2 2nd paragraph Section 4. Multihidden-Layer ELM: referring to Figure 3, an extreme learning machine with three hidden layers, with each hidden layer consisting of a plurality of nodes (“each hidden layer comprises a respective plurality of nodes”), and is connected with an output from an adjacent prior hidden layer (“each node in a hidden layer being configured to perform a transformation on output of at least one node from an adjacent, lower layer”), with the first hidden layer connected to the input layer, and the third hidden layer connected to an output layer (“Thus we propose an algorithm named multiple hidden layers extreme learning machine (MELM). The structure of the MELM (select the three-hidden-layer ELM for example) is illustrated in Figure 3. … The structure of the three-hidden-layer ELM has input layer, three hidden layers, and output layer.”).]); 
a lowest one of the plurality of hidden layers receives an output from the input layer ([Xiao Figure 3; p.4 col.2 2nd paragraph Section 4. Multihidden-Layer ELM: referring to Figure 3, an extreme learning machine with three hidden layers, with each hidden layer consisting of a plurality of nodes, and is connected with an output from an adjacent prior hidden layer, with the first hidden layer connected to the input layer (“a lowest one of the plurality of hidden layers receives an output from the input layer”), and the third hidden layer connected to an output layer (“Thus we propose an algorithm named multiple hidden layers extreme learning machine (MELM). The structure of the MELM (select the three-hidden-layer ELM for example) is illustrated in Figure 3. … The structure of the three-hidden-layer ELM has input layer, three hidden layers, and output layer.”).]); and 
the output layer receives an output from a highest one of the plurality of hidden layers ([Xiao Figure 3; p.4 col.2 2nd paragraph Section 4. Multihidden-Layer ELM: referring to Figure 3, an extreme learning machine with three hidden layers, with each hidden layer consisting of a plurality of nodes, and is connected with an output from an adjacent prior hidden layer, with the first hidden layer connected to the input layer, and the third hidden layer connected to an output layer (“the output layer receives an output from a highest one of the plurality of hidden layers”) (“Thus we propose an algorithm named multiple hidden layers extreme learning machine (MELM). The structure of the MELM (select the three-hidden-layer ELM for example) is illustrated in Figure 3. … The structure of the three-hidden-layer ELM has input layer, three hidden layers, and output layer.”).]).

It would have been obvious to a person having ordinary skill in the art before the effective filing date to take the single hidden layer extreme learning machine of Akusok and enhance it with the multihidden layer extreme learning machine of Xiao to implement an extreme learning machine (artificial neural network) with multiple hidden layers. The motivation to combine is taught in Xiao, as the multiple hidden layer structure of an extreme learning machine helps improve classification performance of the network when compared to a single hidden layer extreme learning machine, as shown in Figure 6 of Xiao ([Xiao p.8 Figure 6; p.8 col.2-p.9 col.1&2 Section 6. Conclusion: “At the same time, the MELM network structure also improves the average accuracy of training and testing performance compared to the ELM and TELM network structure. … In the datasets classification problems, the average accuracy of the multiple classifications is significantly higher than that of the ELM and TELM network structure. In such cases, the MELM is able to improve the performance of the network structure.”]). 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Cheng et al., Revisit Multinominal Logistic Regression in Deep Learning: Data Dependent Model Initialization for Image Recognition, arXiv:1809.06131v1, September 17, 2018.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is 303-297-4332.  The examiner can normally be reached on Monday-Friday 8:00am - 4:30pm PT.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li Zhen can be reached on 571-272-3768.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/WILLIAM WAI YIN KWAN/
Examiner, Art Unit 2121



/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121