Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to amendments and remarks filed on 08/05/2022. In the current amendments, claims 1, 11, and 12 are amended. Claims 1-12 are pending and have been examined.
In response to amendments and remarks filed on 08/05/2022, the 35 U.S.C. 112(a) rejection to claims 3-10 and the 35 U.S.C. 112(b) rejection to claims 3-10 made in the previous Office Action have been withdrawn.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 06/21/2022 has been entered.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1, 4, and 11-12 is rejected under 35 U.S.C. 103 as being unpatentable over Kalamkar et al. (US 11094029 B2) in view of Tuske et al. (“Integrating Gaussian Mixtures Into Deep Neural Networks: Softmax Layer With Hidden Variables”) 
Regarding Claim 1,
Kalamkar et al. teaches a data analysis apparatus, comprising (Kalamkar et al., FIG. 1 and Col. 3 Lines 20-24, “The computing system 100 includes a processing subsystem 101 having one or more processor(s) 102 and a system memory 104 communicating via an interconnection path that may include a memory hub 105” teaches a computer with computer components (corresponds to data analysis apparatus)).
a processor (Kalamkar et al., FIG. 22 and Col. 46 Lines 64-65, “The processor 2202 and the GPGPU 2220 can be any of the processors and GPGPU/parallel processors” teaches a processor).
a memory comprising instructions, when executed by the processor, cause the processor to (Kalamkar et al., Col. 46 Lines 66-67, “The processor 2202 can execute instructions for a compiler 2215 stored in system memory 2212” teaches a system memory that stores instructions that are executed by the processor).
use a first neural network that includes a first input layer, a first output layer, and a first intermediate layer having at least two layers between the first input layer and the first output layer, the first intermediate 5layer being configured to give data from a previous layer and a first learning parameter to a first activation function for calculation and output a calculation result to a subsequent layer (Kalamkar et al., Col. 23 Lines 20-38, “A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers. Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can take various forms” teaches a feedforward network (corresponds to the first neural network) that includes an input layer and an output layer separated by at least one hidden layer (corresponds to the first intermediate layer) in between. Kalamkar et al. also further teaches data and coefficients (corresponds to the learning parameter) being propagated from the input layer to the output layer through an activation function to output calculation results to successive layer (corresponds to the subsequent layer) in the network. Col. 27 Lines 11-16, “The exemplary neural networks described above can be used to perform deep learning. Deep learning is machine learning using deep neural networks. The deep neural networks used in deep learning are artificial neural networks composed of multiple hidden layers, as opposed to shallow neural networks that include only a single hidden layer” teaches utilizing the feedforward neural network to perform deep learning for deep neural networks. Deep neural networks are composed of multiple hidden layers (corresponds to the two layers in between the input layer and output layer)). 
generate degenerated output data based on the calculation result from each of the first intermediate layer (Kalamkar et al, Col. 23 Lines 41-45, “Training a neural network involves … using a set of training data representing a problem being modeled by the network, and adjusting the weights until the network model performs with a minimal error for all instances of the training data set” teaches training the neural network by running the network to generate a difference/error (degenerated output) between the output (based on the calculation result) and the desired result)).
receive degenerated output data derived from each of the first intermediate layer, set a weight of each layer in the first intermediate layer based on the degenerated output data and a 10second learning parameter, and output said weight to the first output layer (Kalamkar et al., Col. 23 Lines 46-56, “during a supervised learning training process for a neural network, the output produced by the network in response to the input representing an instance in a training data set is compared to the “correct” labeled output for that instance, an error signal representing the difference between the output and the labeled output is calculated, and the weights associated with the connections are adjusted to minimize that error as the error signal is backward propagated through the layers of the network. The network is considered “trained” when the errors for each of the outputs generated from the instances of the training data set are minimized” teaches a training phase where the weights are adjusted (corresponds to set a weight)  based on the training dataset and calculated output and error (corresponds to the degenerated output data and a 10second learning parameter).
weight the calculation result with the weight of each layer of the first intermediate layer based on the degenerated output data (Kalamkar et al., Col. 23 Lines 20-38, “A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”)” teaches feedforward propagation,e.g weighing intermediate results based on weights calculated from the error/degenerated output).
Kalamkar et al. does not appear to explicitly teach calculate prediction data based on each weighted output data and a third learning parameter
However, Tuske et al., teaches calculate prediction data based on each weighted output data and a third learning parameter (Tuske et al., Section 2.3 Para. 1, “Grouping the parameters of a state, Eq. 3 can be realized by already existing NN building blocks as a softmax layer followed by a sum-pooling over a region. In the case of maximum approximation the last layer becomes a max-pooling” teaches calculating the output of the output layer. Section 2.3 Para. 4, “Because of the huge softmax layer, the low-rank factorization of the last weight matrix through linear BN layer is inevitable as it was proposed also for NN with more than 10k outputs” teaches the weight matrix contributing to the output of the network. Eq. 1 and Section 2.1 Para. 1, “with model parameters θ = {ws, bs}, where ws ∈ R N and bs ∈ R are state specific parameters. The f(x) : R M → R N corresponds to the feature function such as linear, polynomial or any non-linear feature mapping, e.g. another tandem model [11, 12, 13, 14, 15]. Within the neural network framework Eq. 1 corresponds to the softmax output layer: ws, bs form the last weight matrix and bias vector, the rest of the network up to the output of the last hidden layer forms the feature function f” teaches the parameters (corresponds to the third learning parameter) contributing to the output of the network).
Kalamkar et al. in view of Tuske et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. with Tuske et al., with motivation to calculate prediction data based on each weighted output data and a third learning parameter. “On small scale, the joint training of tandem BN-GMM through generalized softmax layer always resulted in better recognition performance than any of our hybrid baselines. Furthermore, large scale experiments verified that the proposed BN-LMM model with hidden variables could achieve similar performance with fewer output targets than a classic hybrid system” (Tuske et al., Conclusion). The proposed teaching is beneficial in that it results in better recognition performance and can achieve similar performance with fewer output targets.
Regarding Claim 4,
Kalamkar et al. in view of Tuske et al. teaches the data analysis apparatus according to claim 1, wherein:
Kalamkar et al. further teaches the data analysis apparatus is further configured to adjust the first learning parameter, the second learning parameter, and the third learning parameter when training data is given to the first input layer (Kalamkar et al., FIG. 1 and Col. 3 Lines 20-24, “The computing system 100 includes a processing subsystem 101 having one or more processor(s) 102 and a system memory 104 communicating via an interconnection path that may include a memory hub 105” teaches a computer with computer components (corresponds to data analysis apparatus). FIG. 14C and Col. 35 Lines 1-17, “As shown in FIG. 14C, hybrid parallelism can be performed in which a partitioning is performed across activations and weights to minimize skewed matrices. For a layer of a neural network, the input data 1402, weight data 1404, and/or activation data 1406 is partitioned and distributed across multiple compute nodes (e.g., Node 0-Node 3). Node 0 receives a first block of input data 1402A and weight data 1404A. Compute operations are performed at Node 0 to generate a first partial activation 1406A. Likewise, Node1 receives a second block of input data 1402B and weight data 1404B. Compute operations are performed at Node 1 to generate a second partial activation 1406B. Node 2 can perform compute operations on third input data 1402C and weight data 1406C to generate a third partial activation 1406C” teaches input data, activation data, and weight data (corresponds to learning parameter) being distributed across Node 0-Node 3 (corresponds to the first-third learning parameters) for a layer of the neural network (corresponds to the input layer)).
Regarding Claim 11,
Kalamkar et al. teaches a data analysis method using a first neural network that includes a first input layer, a first output layer, and a first intermediate layer having at least two 33072388.1318 layers between the first input layer and the first output layer, the first intermediate layer being configured to give data from a previous layer and a first learning parameter to a first activation function for calculation and output a calculation result to a subsequent layer (Kalamkar et al., Col. 23 Lines 20-38, “A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers. Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can take various forms” teaches a feedforward network (corresponds to the first neural network) that includes an input layer and an output layer separated by at least one hidden layer (corresponds to the first intermediate layer) in between. Kalamkar et al. also further teaches data and coefficients (corresponds to the learning parameter) being propagated from the input layer to the output layer through an activation function to output calculation results to successive layer (corresponds to the subsequent layer) in the network. Col. 27 Lines 11-16, “The exemplary neural networks described above can be used to perform deep learning. Deep learning is machine learning using deep neural networks. The deep neural networks used in deep learning are artificial neural networks composed of multiple hidden layers, as opposed to shallow neural networks that include only a single hidden layer” teaches utilizing the feedforward neural network to perform deep learning for deep neural networks. Deep neural networks are composed of multiple hidden layers (corresponds to the two layers in between the input layer and output layer)).
5wherein the data analysis apparatus includes a processor and a storage device to store the first neural network wherein the processor is configured to conduct (Kalamkar et al., FIG. 1 and Col. 3 Lines 18-26, “FIG. 1 is a block diagram illustrating a computing system 100 configured to implement one or more aspects of the embodiments described herein. The computing system 100 includes a processing subsystem 101 having one or more processor(s) 102 and a system memory 104 communicating via an interconnection path that may include a memory hub 105. The memory hub 105 may be a separate component within a chipset component or may be integrated within the one or more processor(s) 102” teaches a processor and a system memory that stores a set of trainable machine learning parameters and a library to facilitate data transmission during distributed training of the neural network).
a degenerated process to generate degenerated output data based on the calculation result from each of the first intermediate layer (Kalamkar et al., Col. 23 Lines 41-45, “Training a neural network involves selecting a network topology, using a set of training data representing a problem being modeled by the network, and adjusting the weights until the network model performs with a minimal error for all instances of the training data set” teaches training the neural network which includes generating the degenerated output data based the feedforward calculations on a set of training data).
a setting process to receive degenerated output data derived from each of the first intermediate layer, set a weight of each layer in the first intermediate layer based on the degenerated output data and a second 10learning parameter, and output said weight to the first output layer (Kalamkar et al., Col. 23 Lines 46-56, “during a supervised learning training process for a neural network, the output produced by the network in response to the input representing an instance in a training data set is compared to the “correct” labeled output for that instance, an error signal representing the difference between the output and the labeled output is calculated, and the weights associated with the connections are adjusted to minimize that error as the error signal is backward propagated through the layers of the network. The network is considered “trained” when the errors for each of the outputs generated from the instances of the training data set are minimized” teaches a training phase where the weights are adjusted (corresponds to set a weight)  based on the training dataset and calculated output and error (corresponds to the degenerated output data and a 10second learning parameter).
a weighting process to weight the calculation result with the weight of each layer of the first intermediate layer that was set in the setting process (Kalamkar et al., Col. 23 Lines 20-38, “A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”)” teaches feedforward propagation,e.g weighing intermediate results based on weights calculated from the error/degenerated output).
Kalamkar et al. does not appear to explicitly teach a calculation process to calculate prediction data based on each output data that was weighted in the weighting process and a third learning parameter.
However, Tuske et al., teaches a calculation process to calculate prediction data based on each output data that was weighted in the weighting process and a third learning parameter (Tuske et al., Section 2.3 Para. 1, “Grouping the parameters of a state, Eq. 3 can be realized by already existing NN building blocks as a softmax layer followed by a sum-pooling over a region. In the case of maximum approximation the last layer becomes a max-pooling” teaches calculating the output of the output layer. Section 2.3 Para. 4, “Because of the huge softmax layer, the low-rank factorization of the last weight matrix through linear BN layer is inevitable as it was proposed also for NN with more than 10k outputs” teaches the weight matrix contributing to the output of the network. Eq. 1 and Section 2.1 Para. 1, “with model parameters θ = {ws, bs}, where ws ∈ R N and bs ∈ R are state specific parameters. The f(x) : R M → R N corresponds to the feature function such as linear, polynomial or any non-linear feature mapping, e.g. another tandem model [11, 12, 13, 14, 15]. Within the neural network framework Eq. 1 corresponds to the softmax output layer: ws, bs form the last weight matrix and bias vector, the rest of the network up to the output of the last hidden layer forms the feature function f” teaches the parameters (corresponds to the third learning parameter) contributing to the output of the network).  
Kalamkar et al. in view of Tuske et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. with Tuske et al., with motivation to have a calculation process to calculate prediction data based on each output data that was weighted in the weighting process and a third learning parameter. “On small scale, the joint training of tandem BN-GMM through generalized softmax layer always resulted in better recognition performance than any of our hybrid baselines. Furthermore, large scale experiments verified that the proposed BN-LMM model with hidden variables could achieve similar performance with fewer output targets than a classic hybrid system” (Tuske et al., Conclusion). The proposed teaching is beneficial in that it results in better recognition performance and can achieve similar performance with fewer output targets.
Regarding Claim 12,
Kalamkar et al. teaches a non-transitory recording medium having stored thereon a data analysis program that causes a processor to conduct prescribed processes, the processor being able to access a storage device having stored therein a first neural network that includes a first input layer, a first output layer, and a first intermediate layer 20having at least two layers between the first input layer and the first output layer, the first intermediate layer being configured to give data from a previous layer and a first learning parameter to a first activation function for calculation and output a calculation result to a subsequent layer, the non-transitory recording medium being readable by the processor, the data analysis program causing the processor to execute (Kalamkar et al., Col. 64 Lines 37-43, “One or more aspects of at least one embodiment may be implemented by representative code stored on a machine-readable medium which represents and/or defines logic within an integrated circuit such as a processor. For example, the machine-readable medium may include instructions which represent various logic within the processor” teaches a non-transitory machine-readable medium within a processor. FIG. 1 and Col. 3 Lines 18-26, “FIG. 1 is a block diagram illustrating a computing system 100 configured to implement one or more aspects of the embodiments described herein. The computing system 100 includes a processing subsystem 101 having one or more processor(s) 102 and a system memory 104 communicating via an interconnection path that may include a memory hub 105. The memory hub 105 may be a separate component within a chipset component or may be integrated within the one or more processor(s) 102” teaches a processor and a system memory that stores a set of trainable machine learning parameters and a library to facilitate data transmission during distributed training of the neural network. Col. 23 Lines 20-38, “A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers. Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can take various forms” teaches a feedforward network (corresponds to the first neural network) that includes an input layer and an output layer separated by at least one hidden layer (corresponds to the first intermediate layer) in between. Kalamkar et al. also further teaches data and coefficients (corresponds to the learning parameter) being propagated from the input layer to the output layer through an activation function to output calculation results to successive layer (corresponds to the subsequent layer) in the network. Col. 27 Lines 11-16, “The exemplary neural networks described above can be used to perform deep learning. Deep learning is machine learning using deep neural networks. The deep neural networks used in deep learning are artificial neural networks composed of multiple hidden layers, as opposed to shallow neural networks that include only a single hidden layer” teaches utilizing the feedforward neural network to perform deep learning for deep neural networks. Deep neural networks are composed of multiple hidden layers (corresponds to the two layers in between the input layer and output layer)).25 
a degeneration process to generate degenerated output data based on the calculation result from each of the first intermediate layer (Kalamkar et al., Col. 23 Lines 41-45, “Training a neural network involves selecting a network topology, using a set of training data representing a problem being modeled by the network, and adjusting the weights until the network model performs with a minimal error for all instances of the training data set” teaches training the neural network which includes generating the degenerated output data based the feedforward calculations on a set of training data).
a setting processing for receiving degenerated output data derived from each of the first intermediate layer, set a weight of each layer in the first intermediate layer based on the degenerated output data and a 34072388.1318 second learning parameter, and output the weight to the first output layer (Kalamkar et al., Col. 23 Lines 46-56, “during a supervised learning training process for a neural network, the output produced by the network in response to the input representing an instance in a training data set is compared to the “correct” labeled output for that instance, an error signal representing the difference between the output and the labeled output is calculated, and the weights associated with the connections are adjusted to minimize that error as the error signal is backward propagated through the layers of the network. The network is considered “trained” when the errors for each of the outputs generated from the instances of the training data set are minimized” teaches a training phase where the weights are adjusted (corresponds to set a weight)  based on the training dataset and calculated output and error (corresponds to the degenerated output data and a 10second learning parameter).
a weighting processing for weighting the calculation result with the weight of each layer of the first intermediate layer that was set in the setting process (Kalamkar et al., Col. 23 Lines 20-38, “A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”)” teaches feedforward propagation,e.g weighing intermediate results based on weights calculated from the error/degenerated output).
Kalamkar et al. does not appear to explicitly teach a calculation processing for calculating prediction data based on each output 5data that was weighted in the weighting process and a third learning parameter.
However, Tuske et al., teaches a calculation processing for calculating prediction data based on each output 5data that was weighted in the weighting process and a third learning parameter (Tuske et al., Section 2.3 Para. 1, “Grouping the parameters of a state, Eq. 3 can be realized by already existing NN building blocks as a softmax layer followed by a sum-pooling over a region. In the case of maximum approximation the last layer becomes a max-pooling” teaches calculating the output of the output layer. Section 2.3 Para. 4, “Because of the huge softmax layer, the low-rank factorization of the last weight matrix through linear BN layer is inevitable as it was proposed also for NN with more than 10k outputs” teaches the weight matrix contributing to the output of the network. Eq. 1 and Section 2.1 Para. 1, “with model parameters θ = {ws, bs}, where ws ∈ R N and bs ∈ R are state specific parameters. The f(x) : R M → R N corresponds to the feature function such as linear, polynomial or any non-linear feature mapping, e.g. another tandem model [11, 12, 13, 14, 15]. Within the neural network framework Eq. 1 corresponds to the softmax output layer: ws, bs form the last weight matrix and bias vector, the rest of the network up to the output of the last hidden layer forms the feature function f” teaches the parameters (corresponds to the third learning parameter) contributing to the output of the network).
Kalamkar et al. in view of Tuske et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. with Tuske et al., with motivation to have a calculation processing for calculating prediction data based on each output 5data that was weighted in the weighting process and a third learning parameter. “On small scale, the joint training of tandem BN-GMM through generalized softmax layer always resulted in better recognition performance than any of our hybrid baselines. Furthermore, large scale experiments verified that the proposed BN-LMM model with hidden variables could achieve similar performance with fewer output targets than a classic hybrid system” (Tuske et al., Conclusion). The proposed teaching is beneficial in that it results in better recognition performance and can achieve similar performance with fewer output targets.
Claims 2 are rejected under 35 U.S.C. 103 as being unpatentable over Kalamkar et al. in view of Tuske et al. in view of Sawada et al. (US 10832128 B2)
Regarding Claim 2,
Kalamkar et al. in view of Tuske et al. teaches the data analysis apparatus according to claim 1, 
Kalamkar et al. in view of Tuske et al. does not appear to explicitly teach wherein the data analysis apparatus receives output data from the first input layer, sets a weight of each first 20intermediate layer based on the output data and the second learning parameter, and outputs said weight to the first output layer
However, Sawada et al., teaches wherein the data analysis apparatus receives output data from the first input layer, sets a weight of each first 20intermediate layer based on the output data and the second learning parameter, and outputs said weight to the first output layer (Sawada et al., Col. 9 Lines 5-12, “In the neural network apparatus 100, a weighted sum computation is performed by the units 105 in the hidden layers 102 and the output layer 103 by using the weight W=[w1, w2, . . . ] in response to the units 105 in the input layer 101 being fed with element values of input data X=[x1, x2, . . . ], and element values of output data Y=[y1, y2, . . . ] are output from the units 105 in the output layer 103” teaches determining the weighted sum from utilizing the units in the output layer and hidden layers (corresponds to first intermediate layer), by using the weights in response of the element values of output data w2 (corresponds to the second learning parameter) to the output layer).   
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. and Sawada et al. with Tuske et al., with motivation wherein the setting unit receives output data from the first input layer, sets a weight of each first 20intermediate layer based on the output data and the second learning parameter, and outputs said weight to the first output layer. “Accordingly, a transfer learning apparatus is obtained which saves the time and effort for changing the configuration and weight values of the neural network apparatus by using the transfer target data items during transfer learning and which is free from unwanted effects, such as overfitting and a decrease in the recognition accuracy that may occur as a result of changing the configuration and the weight values” (Sawada et al., Col. 3 Lines 5-11). The proposed teaching is beneficial in that it saves time and effort for changing the configuration and weight values of the neural network during transfer learning, which prevents overfitting and a decrease in the recognition accuracy.
Claims 3 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Kalamkar et al. in view of Tuske et al. and in further view of Kasahara (US 20170147921 A1)
Regarding Claim 3,
Kalamkar et al. in view of Tuske et al. teaches the data analysis apparatus according to claim 1, wherein
Kalamkar et al. further teaches the data analysis apparatus is configured to receive output data from each first 25intermediate layer, reduce the number of dimensions of each output data, and output each degenerated output data (Kalamkar et al., FIG. 1 and Col. 3 Lines 20-24, “The computing system 100 includes a processing subsystem 101 having one or more processor(s) 102 and a system memory 104 communicating via an interconnection path that may include a memory hub 105” teaches a computer with computer components (corresponds to data analysis apparatus). Col. 29 Lines 46-52, “A deep belief network (DBN) is a generative neural network that is composed of multiple layers of stochastic (random) variables. DBNs can be trained layer-by-layer using greedy unsupervised learning. The learned weights of the DBN can then be used to provide pre-train neural networks by determining an optimal initial set of weights for the neural network” teaches training layer by layer (corresponds to the first intermediate layer) using unsupervised training. Col. 30 Lines 28-35, “Unsupervised training can be used to generate a self-organizing map, which is a type of trained neural network 1107 capable of performing operations useful in reducing the dimensionality of data. Unsupervised training can also be used to perform anomaly detection, which allows the identification of data points in an input dataset that deviate from the normal patterns of the data” teaches the unsupervised training utilized for reducing the dimensionality of data).
Kalamkar et al. in view of Tuske et al. does not appear to explicitly teach wherein the data analysis apparatus receives each degenerated output data, sets a weight of each first intermediate layer based on said degenerated output data and the second learning parameter, and outputs the weight to the first output layer
However, Kasahara, teaches wherein the data analysis apparatus receives each degenerated output data, sets a weight of each first intermediate layer based on said degenerated output data and the second learning parameter, and outputs the weight to the first output layer (Kasahara, Para. [0039], “the learning performing unit 24 causes a stacked autoencoder to learn (i.e., optimize) parameters (e.g., weight parameters between layers) used in the multilayer neural network, by backpropagation” teaches setting weight parameters between layers (corresponds to the first intermediate layer in the middle to the output layer) based on the optimized weight parameter (corresponds to the second learning parameter. FIG. 5 and Para. [0041], “As illustrated in FIG. 5, an autoencoder is known as a method for dimensionality reduction (or dimensionality compression) using the neural network 20. An autoencoder can reduce the number of neurons in a middle layer to become smaller than the dimensionality in an input layer, thereby achieving dimensionality reduction so that the input data is reconstructed with less dimensionality” teaches dimensionality reduction method (corresponds to degenerated output data) utilizing the neural network).
Kalamkar et al. in view of Tuske et al. in view of Kasahara are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. and Tuske et al. with Kasahara, with motivation wherein the data analysis apparatus receives each degenerated output data, sets a weight of each first intermediate layer based on said degenerated output data and the second learning parameter, and outputs the weight to the first output layer. “An embodiment has an object to provide a learning apparatus, a recording medium, and a learning method that improves accuracy of learning results” (Kasahara, Para. [0023]). The proposed teaching is beneficial in that it improves the accuracy of the learning results.
Regarding Claim 5,
Kalamkar et al. in view of Tuske et al. in view of Kasahara teaches the data analysis apparatus according to claim 3, further comprising
Kalamkar et al. further teaches the data analysis apparatus is further configured to receive output data from each first intermediate layer, reduce the number of dimensions of each output data, and output each second degenerated output data (Kalamkar et al., FIG. 1 and Col. 3 Lines 20-24, “The computing system 100 includes a processing subsystem 101 having one or more processor(s) 102 and a system memory 104 communicating via an interconnection path that may include a memory hub 105” teaches a computer with computer components (corresponds to data analysis apparatus). Col. 29 Lines 46-52, “A deep belief network (DBN) is a generative neural network that is composed of multiple layers of stochastic (random) variables. DBNs can be trained layer-by-layer using greedy unsupervised learning. The learned weights of the DBN can then be used to provide pre-train neural networks by determining an optimal initial set of weights for the neural network” teaches training layer by layer (corresponds to the first intermediate layer) using unsupervised training. Col. 30 Lines 28-35, “Unsupervised training can be used to generate a self-organizing map, which is a type of trained neural network 1107 capable of performing operations useful in reducing the dimensionality of data. Unsupervised training can also be used to perform anomaly detection, which allows the identification of data points in an input dataset that deviate from the normal patterns of the data” teaches the unsupervised training utilized for reducing the dimensionality of data).
Kasahara further teaches wherein the data analysis weights each second degenerated output data15 based on the weight of each first intermediate layer (Kasahara, Para. [0041], “As illustrated in FIG. 5, an autoencoder is known as a method for dimensionality reduction (or dimensionality compression) using the neural network 20. An autoencoder can reduce the number of neurons in a middle layer to become smaller than the dimensionality in an input layer, thereby achieving dimensionality reduction so that the input data is reconstructed with less dimensionality” teaches dimensionality reduction using the neural network for each layer. The input data (corresponds to the output data of the previous layer) is reconstructed with less dimensionality Para. [0039], “Specifically, the learning performing unit 24 causes a stacked autoencoder to learn (i.e., optimize) parameters (e.g., weight parameters between layers) used in the multilayer neural network, by backpropagation” teaches the autoencoder optimizing the weight parameters between layers (corresponds to the first intermediate layer)).
Kalamkar et al. in view of Tuske et al. in view of Kasahara are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. and Tuske et al. with Kasahara, with motivation wherein the data analysis weights each second degenerated output data15 based on the weight of each first intermediate layer. “An embodiment has an object to provide a learning apparatus, a recording medium, and a learning method that improves accuracy of learning results” (Kasahara, Para. [0023]). The proposed teaching is beneficial in that it improves the accuracy of the learning results.
Claims 6 -10 are rejected under 35 U.S.C. 103 as being unpatentable over Kalamkar et al. in view of Tuske et al. and in further view of Mendoza et al. (“Towards Automatically –Tuned Neural Networks”)
Regarding Claim 6,
Kalamkar et al. in view of Tuske et al. teaches the data analysis apparatus according to claim 4, wherein the data analysis apparatus is configured to
Kalamkar et al. further teaches 20adjust the fourth learning parameter using a second neural network including a second input layer that receives the training data, a second output layer that outputs a hyperparameter of the first neural network, and a second intermediate layer interposed between the second input layer and the second output layer, the second intermediate layer being configured to give data from a previous layer and a 25fourth learning parameter to a second activation function for calculation and output a calculation result to a subsequent layer, when the training data is given to the second input layer (Kalamkar et al., FIG. 1 and Col. 3 Lines 20-24, “The computing system 100 includes a processing subsystem 101 having one or more processor(s) 102 and a system memory 104 communicating via an interconnection path that may include a memory hub 105” teaches a computer with computer components (corresponds to data analysis apparatus). Col. 27 Lines 37-51, “Once the neural network is structured, a learning model can be applied to the network to train the network to perform specific tasks. The learning model describes how to adjust the weights within the model to reduce the output error of the network. Backpropagation of errors is a common method used to train neural networks. An input vector is presented to the network for processing. The output of the network is compared to the desired output using a loss function and an error value is calculated for each of the neurons in the output layer. The error values are then propagated backwards until each neuron has an associated error value which roughly represents its contribution to the original output. The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network” teaches adjusting the weights associated with connection (corresponds to the fourth learning parameter) to minimize error of output generated from propagation backwards to train neural networks (corresponds to the second neural network). Col. 26 Lines 59-67 and Col. 27 Lines 1-2, “Recurrent neural networks (RNNs) are a family of feedforward neural networks that include feedback connections between layers. RNNs enable modeling of sequential data by sharing parameter data across different parts of the neural network. The architecture for a RNN includes cycles. The cycles represent the influence of a present value of a variable on its own value at a future time, as at least a portion of the output data from the RNN is used as feedback for processing subsequent input in a sequence. This feature makes RNNs particularly useful for language processing due to the variable nature in which language data can be composed” teaches the Recurrent Neural Network (RNN) that consist of an input layer and an output layer (corresponds to the second input and output layer) separated by two hidden layers (corresponds to the second intermediate layer) in between for feedback. Col. 23 Lines 29-35, “Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers” teaches data and coefficients (corresponds to the learning parameter) being propagated from the input layer to the output layer through an activation function to output calculation results to successive layer (corresponds to the subsequent layer) in the network).
… adjust the first learning parameter, the second learning parameter, and the third learning parameter when the training data is given to the first input layer of the first neural network after the structure thereof is determined (Kalamkar et al., FIG. 14C and Col. 35 Lines 1-17, “As shown in FIG. 14C, hybrid parallelism can be performed in which a partitioning is performed across activations and weights to minimize skewed matrices. For a layer of a neural network, the input data 1402, weight data 1404, and/or activation data 1406 is partitioned and distributed across multiple compute nodes (e.g., Node 0-Node 3). Node 0 receives a first block of input data 1402A and weight data 1404A. Compute operations are performed at Node 0 to generate a first partial activation 1406A. Likewise, Node1 receives a second block of input data 1402B and weight data 1404B. Compute operations are performed at Node 1 to generate a second partial activation 1406B. Node 2 can perform compute operations on third input data 1402C and weight data 1406C to generate a third partial activation 1406C” teaches input data, activation data, and weight data (corresponds to learning parameter) being distributed across Node 0-Node 3 (corresponds to the first-third learning parameters) for a layer of the neural network (corresponds to the input layer)).
Kalamkar et al. in view of Tuske et al. does not appear to explicitly teach output the hyperparameter from the second output layer by giving the training data to the second input layer of the second neural network after the fourth learning parameter is adjusted and determine a structure of the first neural network based on the 5hyperparameter
However, Mendoza et al., teaches 32072388.1318output the hyperparameter from the second output layer by giving the training data to the second input layer of the second neural network after the fourth learning parameter is adjusted (Mendoza et al., Section 2 Pg. 59-60, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space. (Since neural networks cannot handle datasets in sparse representation out of the box, we transform the data into a dense representation on a per-batch basis prior to feeding it to the neural network.) The per-layer hyperparameters of layer k are conditionally dependent on the number of layers being at least k. For practical reasons, we constrain the number of layers to be between one and six: firstly, we aim to keep the training time of a single configuration low1, and secondly each layer adds eight per-layer hyperparameters to the configuration space, such that allowing additional layers would further complicate the configuration process. The most common way to optimize the internal weights of neural networks is via stochastic gradient descent (SGD) using partial derivatives calculated with backpropagation. Standard SGD crucially depends on the correct setting of the learning rate hyperparameter. To lessen this dependency, various algorithms (solvers) for stochastic gradient descent have been proposed. We include the following well-known methods from the literature in the configuration space of Auto-Net: vanilla stochastic gradient descent (SGD), stochastic gradient descent with momentum (Momentum), Adam (Kingma and Ba, 2014), Adadelta (Zeiler, 2012), Nesterov momentum (Nesterov, 1983) and Adagrad (Duchi et al., 2011). Additionally, we used a variant of the vSGD optimizer from Schaul et al. (2014), dubbed “smorm”, in which the estimate of the Hessian is replaced by an estimate of the squared gradient (calculated as in the RMSprop procedure). Each of these methods comes with a learning rate α and an own set of hyperparameters, for example Adam’s momentum vectors β1 and β2. Each solver’s hyperparameter(s) are only active if the corresponding solver is chosen” teaches the training of the neural network. Each training iteration creates a neural network (corresponds to the second neural network). Mendoza et al. further teaches optimizing the internal weights of neural networks (corresponds to adjusting the fourth learning parameter)).
determine a structure of the first neural network based on the 5hyperparameter (Mendoza et al., Table 1 and Section 2 Pg. 59, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space” teaches optimizing hyperparameters that corresponds to the structure in Table 1 for the neural networks (corresponds to the first neural network)).
Kalamkar et al. in view of Tuske et al. in view of Mendoza et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. and Tuske et al. with Mendoza et al., with motivation to output the hyperparameter from the second output layer by giving the training data to the second input layer of the second neural network after the fourth learning parameter is adjusted and determine a structure of the first neural network based on the 5hyperparameter. “In this work, we present a first version of AutoNet, which provides automatically-tuned feed-forward neural networks without any human intervention. We report results on datasets from the recent AutoML challenge showing that ensembling Auto-Net with Auto-sklearn can perform better than either approach alone and report the first results on winning competition datasets against human experts with automatically-tuned neural networks” (Mendoza et al., Abstract). The proposed teaching is beneficial in that it provides automatically-tuned feed-forward neural networks without any human intervention and performs better than either approach alone.
Regarding Claim 7,
Kalamkar et al. in view of Tuske et al. in view of Mendoza et al. teaches the data analysis apparatus according to claim 6
Mendoza et al. further teaches wherein the hyperparameter is to determine a pattern of elements constituting the first neural network (Mendoza et al., Table 1 and Section 2 Pg. 59, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space” teaches utilizing the hyperparameters to determines the structure in Table 1 for the neural networks (corresponds to the first neural network) that is made up of a specific pattern of elements).
Kalamkar et al. in view of Tuske et al. in view of Mendoza et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. and Tuske et al. with Mendoza et al., with motivation to output the hyperparameter from the second output layer by giving the training data to the second input layer of the second neural network after the fourth learning parameter is adjusted and determine a structure of the first neural network based on the 5hyperparameter. “In this work, we present a first version of AutoNet, which provides automatically-tuned feed-forward neural networks without any human intervention. We report results on datasets from the recent AutoML challenge showing that ensembling Auto-Net with Auto-sklearn can perform better than either approach alone and report the first results on winning competition datasets against human experts with automatically-tuned neural networks” (Mendoza et al., Abstract). The proposed teaching is beneficial in that it provides automatically-tuned feed-forward neural networks without any human intervention and performs better than either approach alone.
Regarding Claim 8,
Kalamkar et al. in view of Tuske et al. in view of Mendoza et al. teaches the data analysis apparatus according to claim 7
Mendoza et al. further teaches wherein said 15hyperparameter that is to determine the pattern is a parameter indicating a type of the first activation function (Mendoza et al., Table 1 and Section 2 Pg. 59, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space” teaches the per-layer-hyperparameters that indicate activation-type (corresponds to the first activation function))
Kalamkar et al. in view of Tuske et al. in view of Mendoza et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. and Tuske et al. with Mendoza et al., with motivation wherein said 15hyperparameter that is to determine the pattern is a parameter indicating a type of the first activation function. “In this work, we present a first version of AutoNet, which provides automatically-tuned feed-forward neural networks without any human intervention. We report results on datasets from the recent AutoML challenge showing that ensembling Auto-Net with Auto-sklearn can perform better than either approach alone and report the first results on winning competition datasets against human experts with automatically-tuned neural networks” (Mendoza et al., Abstract). The proposed teaching is beneficial in that it provides automatically-tuned feed-forward neural networks without any human intervention and performs better than either approach alone.
Regarding Claim 9,
Kalamkar et al. in view of Tuske et al. in view of Mendoza et al. teaches the data analysis apparatus according to claim 6, 
Mendoza et al. further teaches wherein the hyperparameter is to determine a sequence of elements constituting the first neural 20network (Mendoza et al., Table 1 and Section 2 Pg. 59, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space” teaches optimizing hyperparameters that determines the structure in Table 1 for the neural networks (corresponds to the first neural network) that consist of a sequence of elements).
Kalamkar et al. in view of Tuske et al. in view of Mendoza et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. and Tuske et al. with Mendoza et al., with motivation wherein the hyperparameter is to determine a sequence of elements constituting the first neural 20network. “In this work, we present a first version of AutoNet, which provides automatically-tuned feed-forward neural networks without any human intervention. We report results on datasets from the recent AutoML challenge showing that ensembling Auto-Net with Auto-sklearn can perform better than either approach alone and report the first results on winning competition datasets against human experts with automatically-tuned neural networks” (Mendoza et al., Abstract). The proposed teaching is beneficial in that it provides automatically-tuned feed-forward neural networks without any human intervention and performs better than either approach alone.
Regarding Claim 10,
Kalamkar et al. in view of Tuske et al. in view of Mendoza et al. teaches the data analysis apparatus according to claim 9, 
Mendoza et al. further teaches wherein said hyperparameter that is to determine the sequence is a parameter indicating the number of layers in the first intermediate layer (Mendoza et al., Table 1 and Section 2 Pg. 59-60, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space. (Since neural networks cannot handle datasets in sparse representation out of the box, we transform the data into a dense representation on a per-batch basis prior to feeding it to the neural network.) The per-layer hyperparameters of layer k are conditionally dependent on the number of layers being at least k. For practical reasons, we constrain the number of layers to be between one and six: firstly, we aim to keep the training time of a single configuration low1, and secondly each layer adds eight per-layer hyperparameters to the configuration space, such that allowing additional layers would further complicate the configuration process. The most common way to optimize the internal weights of neural networks is via stochastic gradient descent (SGD) using partial derivatives calculated with backpropagation. Standard SGD crucially depends on the correct setting of the learning rate hyperparameter. To lessen this dependency, various algorithms (solvers) for stochastic gradient descent have been proposed. We include the following well-known methods from the literature in the configuration space of Auto-Net: vanilla stochastic gradient descent (SGD), stochastic gradient descent with momentum (Momentum), Adam (Kingma and Ba, 2014), Adadelta (Zeiler, 2012), Nesterov momentum (Nesterov, 1983) and Adagrad (Duchi et al., 2011). Additionally, we used a variant of the vSGD optimizer from Schaul et al. (2014), dubbed “smorm”, in which the estimate of the Hessian is replaced by an estimate of the squared gradient (calculated as in the RMSprop procedure). Each of these methods comes with a learning rate α and an own set of hyperparameters, for example Adam’s momentum vectors β1 and β2. Each solver’s hyperparameter(s) are only active if the corresponding solver is chosen” teaches the structure of the neural network with the first intermediate layer. Mendoza at al. further teaches hyperparameters that determine the number of layers (corresponds to the first intermediate layer)).
Kalamkar et al. in view of Tuske et al. in view of Mendoza et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. and Tuske et al. with Mendoza et al., with motivation wherein said hyperparameter that is to determine the sequence is a parameter indicating the number of layers in the first intermediate layer. “In this work, we present a first version of AutoNet, which provides automatically-tuned feed-forward neural networks without any human intervention. We report results on datasets from the recent AutoML challenge showing that ensembling Auto-Net with Auto-sklearn can perform better than either approach alone and report the first results on winning competition datasets against human experts with automatically-tuned neural networks” (Mendoza et al., Abstract). The proposed teaching is beneficial in that it provides automatically-tuned feed-forward neural networks without any human intervention and performs better than either approach alone.

Response to Arguments
Applicant's arguments filed 08/05/2022 with respect to the 35 U.S.C. 103 rejection to claims 1-20 have been fully considered but they are not persuasive. Applicant asserts that “The proposed Kalamkar-Sawada-Tuske combination fails to disclose, teach, or suggest all the limitations of independent Claim 1. Specifically, Kalamkar and Sawada fail to teach generating degenerated output data and setting a weight of each layer in the first intermediate layer based on the degenerated data as required by claim 1” (Remarks, Pg. 9).
Examiner’s Response:
The Examiner respectfully disagrees. Kalamkar et al. teaches “generate degenerated output data based on the calculation result from each of the first intermediate layer” (Kalamkar et al., Col. 23 Lines 41-45, “Training a neural network involves selecting a network topology, using a set of training data representing a problem being modeled by the network, and adjusting the weights until the network model performs with a minimal error for all instances of the training data set” teaches training the neural network (corresponds to generating the degenerated output data) based on a set of training data (corresponds to the calculation result from each of the first intermediate layer). Col. 23 Lines 46-56, “during a supervised learning training process for a neural network, the output produced by the network in response to the input representing an instance in a training data set is compared to the “correct” labeled output for that instance, an error signal representing the difference between the output and the labeled output is calculated, and the weights associated with the connections are adjusted to minimize that error as the error signal is backward propagated through the layers of the network. The network is considered “trained” when the errors for each of the outputs generated from the instances of the training data set are minimized” teaches a training process where the weights are adjusted (corresponds to set a weight)  based on the training dataset and calculated output (corresponds to the degenerated output data and a 10second learning parameter). This teaches that the training phase is indeed set based on the degenerated output data that was calculated.
Applicant asserts that “The Examiner further alleges that Sawada "teaches determining the weighted sum from utilizing the units in the output layer and hidden layer (corresponds to first intermediate layer), by using the weights in response of the element values of output data)." Office Action at 16. However, the weights in Sawada each correspond to an element value of input data (xl, x2 ...) and an element value of output data (yl, y2 ...). The weights in Sawada do not correspond to "each layer in the first intermediate layer" and are not based on the degenerated output data” (Remarks, Pg. 9).
Examiner’s Response:
The Examiner agrees that Sawanda et al. does not teach “weight the calculation result with the weight of each layer of the first intermediate layer based on the degenerated output data”. However, Kalamkar et al., Col. 23 Lines 20-38, “A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers. Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can take various forms” teaches data and coefficients (corresponds to weight) being propagated from the input layer to the output layer through an activation function to output calculation results to successive layer (corresponds to the subsequent layer) in the network. Col. 23 Lines 46-56, “during a supervised learning training process for a neural network, the output produced by the network in response to the input representing an instance in a training data set is compared to the “correct” labeled output for that instance, an error signal representing the difference between the output and the labeled output is calculated, and the weights associated with the connections are adjusted to minimize that error as the error signal is backward propagated through the layers of the network. The network is considered “trained” when the errors for each of the outputs generated from the instances of the training data set are minimized” teaches a training phase for determining the weight based on the training dataset and calculated output (corresponds to the degenerated output data)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Henry T Nguyen whose telephone number is (571)272-8860. The examiner can normally be reached Monday-Friday 8:00am-5:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/HENRY TRONG NGUYEN/
Examiner, Art Unit 2125
/BRIAN M SMITH/Primary Examiner, Art Unit 2122