Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This is the initial office action that has been issued in response to patent application 15/906,807 filed on 02/27/2018. Claims 1-12, as originally filed, are currently pending and have been considered below. Claim 1, 11, and 12 are independent claims.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 02/27/2018.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Applicant cannot rely upon the certified copy of the foreign priority application to overcome this rejection because a translation of said application has not been made of record in accordance with 37 CFR 1.55. See MPEP §§ 215 and 216.
In particular, Applicant is reminded of requirements set forth in 27 C.F.R. 1.55(g)(3)-(4) Claim for foreign priority:
“(3) An English language translation on a non-English language foreign application is not required except:
When the application is involved in an inference (see § 41.202 of this chapter) or  derivation (see part 42 of this chapter) proceeding;
When necessary to overcome the date of a reference relied upon by the examiner; or 
When specifically required by the examiner.
(4) If an English language translation of a non-English language foreign application is required, it must be filed together with a statement that the translation of the certified copy is accurate” (emphasis added).
	Since an English language translation of Application No. JP 2017-083608 has not been made of record, the Examiner notes that prior art references with filing date or publication date prior to the instant Application’s filing date of 02/27/2018 are considered applicable prior art references.

Specification
The disclosure is objected to because of the following informalities: 
Specification page 9, line 14, “The learning unit 260” should be “The learning unit 261”. 
Appropriate correction is required.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 

Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder 

Claim 1: 
a setting unit configured to receive output data from the first input layer, set a weight of each layer in the first intermediate layer based on the output data and a 10second learning parameter, and output said weight to the first output layer (Specification page 11 and lines 9-13 reiterates the function, but does not provide description of the structure)
a weight processing unit included in the first output layer, the weight processing unit being configured to weight each output data with the weight of each layer of the first intermediate layer that was set by the setting unit (Specification page 11 and lines 14-19 reiterates the function, but does not provide description of the structure)
a calculation unit included in the first output layer, the calculation unit being 15configured to calculate prediction data based on each output data that was weighted by the weight processing unit and a third learning parameter (Specification page 11 and lines 15-19 reiterates the function, but does not provide description of the structure)
Claim 3:
a first degeneration unit configured to receive output data from each first 25intermediate layer, reduce the number of dimensions of each output data, and 
Claim 4:
a learning unit configured to adjust the first learning parameter, the second learning parameter, and the third learning parameter when training data is given to the first input layer (Specification page 10 and lines 17-19 reiterates the function, but does not provide description of the structure)
Claim 5:
a second degeneration unit configured to receive output data from each first intermediate layer, reduce the number of dimensions of each output data, and output each degenerated output data to the weight processing unit (Specification page 28 and lines 12-15 reiterates the function, but does not provide description of the structure)

Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the 

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-10 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirements. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.

Each of the limitations in claims 1-10 that contains the following generic placeholders:
setting unit
weight processing unit
calculation unit 
first degeneration unit
learning unit
second degeneration unit

invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112. Sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. In particular, the corresponding description found in the Specification (see Section 8 of the Office Action) of each of the generic placeholders listed above substantially reiterates the claim language and does not provide description of the structure that performs the corresponding functions. Therefore, each of claims 1-10 are rejected under 35 U.S.C. 112(a) for lack of written description. See MPEP 2181 (IV) (“the means- (or step-) plus- function claim must still be analyzed to determine whether there exists corresponding adequate support for such claim limitation under 35 U.S.C. 112(a) or pre-AIA  35 U.S.C. 112, first paragraph support for the claim limitation, the examiner must consider whether the specification describes the claimed invention in sufficient detail to establish that the inventor or joint inventor(s) had possession of the claimed invention as of the application’s filing date”).

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.



Claims 1-10 are rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph, as being indefinite or failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

	Each of the limitations in claims 1-10 that contain the following the following generic placeholders:
setting unit
weight processing unit
calculation unit 
first degeneration unit
learning unit
second degeneration unit

invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. In particular, the corresponding description found in the Specification (see Section 8 of the Office Action) of each of the generic placeholders listed above substantially reiterates the claim language and does not provide description of the 
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, 

Each of the dependent claims is rejected based on the same rationale as the claim from which it depends.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4, and 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Kalamkar et al. (US 11094029 B2) in view of Sawada et al. (US 10832128 B2) in view of Tuske et al. (“Integrating Gaussian Mixtures Into Deep Neural Networks: Softmax Layer With Hidden Variables”) 
Regarding Claim 1,
Kalamkar et al. teaches a data analysis apparatus using a first neural network that includes a first input layer, a first output layer, and a first intermediate layer having at least two layers between the first input layer and the first output layer, the first intermediate 5layer being configured to give data from a previous layer and a first learning parameter to a first activation function for calculation and output a calculation result to a subsequent layer, the data analysis apparatus comprising (Kalamkar et al., Col. 23 Lines 20-38, “A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers. Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can take various forms” teaches a feedforward network (corresponds to the first neural network) that includes an input layer and an output layer separated by at least one hidden layer (corresponds to the first intermediate layer) in between. Kalamkar et al. also further teaches data and coefficients (corresponds to the learning parameter) being propagated from the input Col. 27 Lines 11-16, “The exemplary neural networks described above can be used to perform deep learning. Deep learning is machine learning using deep neural networks. The deep neural networks used in deep learning are artificial neural networks composed of multiple hidden layers, as opposed to shallow neural networks that include only a single hidden layer” teaches utilizing the feedforward neural network to perform deep learning for deep neural networks. Deep neural networks are composed of multiple hidden layers (corresponds to the two layers in between the input layer and output layer)). 
a setting unit configured to receive output data from the first input layer, set a weight of each layer in the first intermediate layer based on the output data and a 10second learning parameter, and output said weight to the first output layer (Kalamkar et al., Col. 23 Lines 25-35, “The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers” teaches receiving data from the input layer to set the representation (corresponds to weight) of the edges for 
Kalamkar et al. does not appear to explicitly teach a weight processing unit included in the first output layer, the weight processing unit being configured to weight each output data with the weight of each layer of the first intermediate layer that was set by the setting unit
However, Sawada et al., teaches a weight processing unit included in the first output layer, the weight processing unit being configured to weight each output data with the weight of each layer of the first intermediate layer that was set by the setting unit (Sawada et al., Col. 9 Lines 5-12, “In the neural network apparatus 100, a weighted sum computation is performed by the units 105 in the hidden layers 102 and the output layer 103 by using the weight W=[w1, w2, . . . ] in response to the units 105 in the input layer 101 being fed with element values of input data X=[x1, x2, . . . ], and element values of output data Y=[y1, y2, . . . ] are output from the units 105 in the output layer 103” teaches determining the weighted sum from utilizing the units in the output layer and hidden layer (corresponds to first intermediate layer), by using the weights in response of the element values of output data).
Kalamkar et al. in view of Sawada et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify 
Kalamkar et al. in view of Sawada et al. does not appear to explicitly teach a calculation unit included in the first output layer, the calculation unit being 15configured to calculate prediction data based on each output data that was weighted by the weight processing unit and a third learning parameter
However, Tuske et al., teaches a calculation unit included in the first output layer, the calculation unit being 15configured to calculate prediction data based on each output data that was weighted by the weight processing unit and a third learning parameter (Tuske et al., Section 2.3 Para. 1, “Grouping the parameters of a state, Eq. 3 can be realized by already existing NN building blocks as a softmax layer followed by a sum-pooling over a region. In the case of maximum approximation the last layer becomes a max-pooling” teaches calculating the output of the output Section 2.3 Para. 4, “Because of the huge softmax layer, the low-rank factorization of the last weight matrix through linear BN layer is inevitable as it was proposed also for NN with more than 10k outputs” teaches the weight matrix contributing to the output of the network. Eq. 1 and Section 2.1 Para. 1, “with model parameters θ = {ws, bs}, where ws ∈ R N and bs ∈ R are state specific parameters. The f(x) : R M → R N corresponds to the feature function such as linear, polynomial or any non-linear feature mapping, e.g. another tandem model [11, 12, 13, 14, 15]. Within the neural network framework Eq. 1 corresponds to the softmax output layer: ws, bs form the last weight matrix and bias vector, the rest of the network up to the output of the last hidden layer forms the feature function f” teaches the parameters (corresponds to the third learning parameter) contributing to the output of the network).
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. and Sawada et al. with Tuske et al., with motivation to have a calculation unit included in the first output layer, the calculation unit being 15configured to calculate prediction data based on each output data that was weighted by the weight processing unit and a third learning parameter. “On small scale, the joint training of tandem BN-GMM through generalized softmax layer always resulted in better recognition performance than any of our hybrid baselines. Furthermore, large scale experiments verified that the proposed BN-LMM 
Regarding Claim 2,
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. teaches the data analysis apparatus according to claim 1, 
Sawada et al. further teaches wherein the setting unit receives output data from the first input layer, sets a weight of each first 20intermediate layer based on the output data and the second learning parameter, and outputs said weight to the first output layer (Sawada et al., Col. 9 Lines 5-12, “In the neural network apparatus 100, a weighted sum computation is performed by the units 105 in the hidden layers 102 and the output layer 103 by using the weight W=[w1, w2, . . . ] in response to the units 105 in the input layer 101 being fed with element values of input data X=[x1, x2, . . . ], and element values of output data Y=[y1, y2, . . . ] are output from the units 105 in the output layer 103” teaches determining the weighted sum from utilizing the units in the output layer and hidden layers (corresponds to first intermediate layer), by using the weights in response of the element values of output data w2 (corresponds to the second learning parameter) to the output layer).   
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the 
Regarding Claim 4,
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. teaches the data analysis apparatus according to claim 1, further comprising
Kalamkar et al. further teaches a learning unit configured to adjust the first learning parameter, the second learning parameter, and the third learning parameter when training data is given to the first input layer (Kalamkar et al., FIG. 14C and Col. 35 Lines 1-17, “As shown in FIG. 14C, hybrid parallelism can be performed in which a partitioning is performed across activations and weights to minimize skewed matrices. For a layer of a neural network, the input data 1402, weight data 1404, and/or activation data 1406 is partitioned and distributed across multiple compute nodes (e.g., Node 0-Node 3). Node 0 receives a first block of input data 1402A and weight data 1404A. Compute operations are performed at Node 0 to generate a first partial activation 1406A. Likewise, Node1 receives a second block of input data 1402B and weight data 1404B. Compute operations are performed at Node 1 to generate a second partial activation 1406B. Node 2 can perform compute operations on third input data 1402C and weight data 1406C to generate a third partial activation 1406C” teaches input data, activation data, and weight data (corresponds to learning parameter) being distributed across Node 0-Node 3 (corresponds to the first-third learning parameters) for a layer of the neural network (corresponds to the input layer)).
Regarding Claim 11,
Kalamkar et al. teaches a data analysis method using a first neural network that includes a first input layer, a first output layer, and a first intermediate layer having at least two 33072388.1318 layers between the first input layer and the first output layer, the first intermediate layer being configured to give data from a previous layer and a first learning parameter to a first activation function for calculation and output a calculation result to a subsequent layer (Kalamkar et al., Col. 23 Lines 20-38, “A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers. Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can take various forms” teaches a feedforward network (corresponds to the first neural network) that includes an input layer and an output layer separated by at least one hidden layer (corresponds to the first intermediate layer) in between. Kalamkar et al. also further teaches data and coefficients (corresponds to the learning parameter) being propagated from the input layer to the output layer through an activation function to output calculation results to successive layer (corresponds to the subsequent layer) in the network. Col. 27 Lines 11-16, “The exemplary neural networks described above can be used to perform deep learning. Deep learning is machine learning using deep neural networks. The deep neural networks used in deep learning are artificial neural networks composed of multiple hidden layers, as opposed to shallow neural networks that include only a single hidden layer” teaches utilizing the feedforward neural network to perform deep learning for deep neural networks. Deep neural networks are composed of multiple hidden layers (corresponds to the two layers in between the input layer and output layer)).
5wherein the data analysis apparatus includes a processor and a storage device to store the first neural network wherein the processor is configured to conduct Kalamkar et al., FIG. 1 and Col. 3 Lines 18-26, “FIG. 1 is a block diagram illustrating a computing system 100 configured to implement one or more aspects of the embodiments described herein. The computing system 100 includes a processing subsystem 101 having one or more processor(s) 102 and a system memory 104 communicating via an interconnection path that may include a memory hub 105. The memory hub 105 may be a separate component within a chipset component or may be integrated within the one or more processor(s) 102” teaches a processor and a system memory that stores a set of trainable machine learning parameters and a library to facilitate data transmission during distributed training of the neural network).
a setting process to receive output data from the first input layer, set a weight of each layer in the first intermediate layer based on the output data and a second 10learning parameter, and output said weight to the first output layer (Kalamkar et al., Col. 23 Lines 25-35, “The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers” teaches receiving data from the input layer to set the representation (corresponds to weight) of the edges for each layers based on the data propagated to the nodes of the 
Kalamkar et al. does not appear to explicitly teach a weighting process to weight each output data with the weight of each layer of the first intermediate layer that was set in the setting process
However, Sawada et al., teaches a weighting process to weight each output data with the weight of each layer of the first intermediate layer that was set in the setting process (Sawada et al., Col. 9 Lines 5-12, “In the neural network apparatus 100, a weighted sum computation is performed by the units 105 in the hidden layers 102 and the output layer 103 by using the weight W=[w1, w2, . . . ] in response to the units 105 in the input layer 101 being fed with element values of input data X=[x1, x2, . . . ], and element values of output data Y=[y1, y2, . . . ] are output from the units 105 in the output layer 103” teaches determining the weighted sum from utilizing the units in the output layer and hidden layer (corresponds to first intermediate layer), by using the weights in response of the element values of output data).
Kalamkar et al. in view of Sawada et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. with Sawada et al., with motivation to have a weighting process to weight each output data with the weight of each layer of the first intermediate layer that was set in the setting process. “Accordingly, a transfer learning apparatus is 
Kalamkar et al. in view of Sawada et al. does not appear to explicitly teach a calculation process to calculate prediction data based on each output data that was weighted in the weighting process and a third learning parameter.
However, Tuske et al., teaches a calculation process to calculate prediction data based on each output data that was weighted in the weighting process and a third learning parameter (Tuske et al., Section 2.3 Para. 1, “Grouping the parameters of a state, Eq. 3 can be realized by already existing NN building blocks as a softmax layer followed by a sum-pooling over a region. In the case of maximum approximation the last layer becomes a max-pooling” teaches calculating the output of the output layer. Section 2.3 Para. 4, “Because of the huge softmax layer, the low-rank factorization of the last weight matrix through linear BN layer is inevitable as it was proposed also for NN with more than 10k outputs” teaches the weight matrix contributing to the output of the network. Eq. 1 and Section 2.1 Para. 1, “with model parameters θ = {ws, bs}, where ws ∈ R N and bs ∈ R are state specific parameters. The f(x) : R M → R N corresponds to the feature function such as linear, polynomial or any non-linear feature mapping, e.g. another tandem model [11, 12, 13, 14, 15]. Within the neural network framework Eq. 1 corresponds to the softmax output layer: ws, bs form the last weight matrix and bias vector, the rest of the network up to the output of the last hidden layer forms the feature function f” teaches the parameters (corresponds to the third learning parameter) contributing to the output of the network).  
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. and Sawada et al. with Tuske et al., with motivation to have a calculation process to calculate prediction data based on each output data that was weighted in the weighting process and a third learning parameter. “On small scale, the joint training of tandem BN-GMM through generalized softmax layer always resulted in better recognition performance than any of our hybrid baselines. Furthermore, large scale experiments verified that the proposed BN-LMM model with hidden variables could achieve similar performance with fewer output targets than a classic hybrid system” (Tuske et al., Conclusion). The proposed teaching is beneficial in that it results in better recognition performance and can achieve similar performance with fewer output targets.
Regarding Claim 12,
Kalamkar et al. teaches a non-transitory recording medium having stored thereon a data analysis program that causes a processor to conduct prescribed Kalamkar et al., Col. 64 Lines 37-43, “One or more aspects of at least one embodiment may be implemented by representative code stored on a machine-readable medium which represents and/or defines logic within an integrated circuit such as a processor. For example, the machine-readable medium may include instructions which represent various logic within the processor” teaches a non-transitory machine-readable medium within a processor. FIG. 1 and Col. 3 Lines 18-26, “FIG. 1 is a block diagram illustrating a computing system 100 configured to implement one or more aspects of the embodiments described herein. The computing system 100 includes a processing subsystem 101 having one or more processor(s) 102 and a system memory 104 communicating via an interconnection path that may include a memory hub 105. The memory hub 105 may be a separate component within a chipset component or may be integrated within the one or more processor(s) 102” teaches a processor and a system memory that stores a set of trainable machine learning parameters and a library to facilitate data transmission during distributed training of the neural network. Col. 23 Lines 20-38, “A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers. Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can take various forms” teaches a feedforward network (corresponds to the first neural network) that includes an input layer and an output layer separated by at least one hidden layer (corresponds to the first intermediate layer) in between. Kalamkar et al. also further teaches data and coefficients (corresponds to the learning parameter) being propagated from the input layer to the output layer through an activation function to output calculation results to successive layer (corresponds to the subsequent layer) in the network. Col. 27 Lines 11-16, “The exemplary neural networks described above can be used to perform deep learning. Deep learning is machine learning using deep neural networks. The deep neural networks used in deep learning are artificial neural networks composed of multiple hidden layers, as opposed to shallow neural networks that include only a single hidden layer” teaches utilizing the feedforward neural network 
a setting processing for receiving output data from the first input layer, set a weight of each layer in the first intermediate layer based on the output data and a 34072388.1318 second learning parameter, and output the weight to the first output layer (Kalamkar et al., Col. 23 Lines 25-35, “The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers” teaches receiving data from the input layer to set the representation (corresponds to weight) of the edges for each layers based on the data propagated to the nodes of the output layer from the input layer nodes and the and the coefficients (corresponds to the weights of the second learning parameter)).
Kalamkar et al. does not appear to explicitly teach a weighting processing for weighting each output data with the weight of each layer of the first intermediate layer that was set in the setting process
Sawada et al., teaches a weighting processing for weighting each output data with the weight of each layer of the first intermediate layer that was set in the setting process (Sawada et al., Col. 9 Lines 5-12, “In the neural network apparatus 100, a weighted sum computation is performed by the units 105 in the hidden layers 102 and the output layer 103 by using the weight W=[w1, w2, . . . ] in response to the units 105 in the input layer 101 being fed with element values of input data X=[x1, x2, . . . ], and element values of output data Y=[y1, y2, . . . ] are output from the units 105 in the output layer 103” teaches determining the weighted sum from utilizing the units in the output layer and hidden layer (corresponds to first intermediate layer), by using the weights in response of the element values of output data).
Kalamkar et al. in view of Sawada et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. with Sawada et al., with motivation to have a weighting processing for weighting each output data with the weight of each layer of the first intermediate layer that was set in the setting process. “Accordingly, a transfer learning apparatus is obtained which saves the time and effort for changing the configuration and weight values of the neural network apparatus by using the transfer target data items during transfer learning and which is free from unwanted effects, such as overfitting and a decrease in the recognition accuracy that may occur as a result of changing the configuration and the weight values” (Sawada et al., Col. 3 Lines 5-11). 
Kalamkar et al. in view of Sawada et al. does not appear to explicitly teach a calculation processing for calculating prediction data based on each output 5data that was weighted in the weighting process and a third learning parameter.
However, Tuske et al., teaches a calculation processing for calculating prediction data based on each output 5data that was weighted in the weighting process and a third learning parameter (Tuske et al., Section 2.3 Para. 1, “Grouping the parameters of a state, Eq. 3 can be realized by already existing NN building blocks as a softmax layer followed by a sum-pooling over a region. In the case of maximum approximation the last layer becomes a max-pooling” teaches calculating the output of the output layer. Section 2.3 Para. 4, “Because of the huge softmax layer, the low-rank factorization of the last weight matrix through linear BN layer is inevitable as it was proposed also for NN with more than 10k outputs” teaches the weight matrix contributing to the output of the network. Eq. 1 and Section 2.1 Para. 1, “with model parameters θ = {ws, bs}, where ws ∈ R N and bs ∈ R are state specific parameters. The f(x) : R M → R N corresponds to the feature function such as linear, polynomial or any non-linear feature mapping, e.g. another tandem model [11, 12, 13, 14, 15]. Within the neural network framework Eq. 1 corresponds to the softmax output layer: ws, bs form the last weight matrix and bias vector, the rest of the network up to the output of the last hidden layer forms the feature function f” teaches the parameters (corresponds to the third learning parameter) contributing to the output of the network).
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al. and Sawada et al. with Tuske et al., with motivation to have a calculation processing for calculating prediction data based on each output 5data that was weighted in the weighting process and a third learning parameter. “On small scale, the joint training of tandem BN-GMM through generalized softmax layer always resulted in better recognition performance than any of our hybrid baselines. Furthermore, large scale experiments verified that the proposed BN-LMM model with hidden variables could achieve similar performance with fewer output targets than a classic hybrid system” (Tuske et al., Conclusion). The proposed teaching is beneficial in that it results in better recognition performance and can achieve similar performance with fewer output targets.
Claims 3 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Kalamkar et al. in view of Sawada et al. in view of Tuske et al. and in further view of Kasahara (US 20170147921 A1)
Regarding Claim 3,
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. teaches the data analysis apparatus according to claim 1, further comprising
Kalamkar et al. further teaches a first degeneration unit configured to receive output data from each first 25intermediate layer, reduce the number of dimensions of each output data, and output each degenerated output data to the setting unit (Kalamkar et al., Col. 29 Lines 46-52, “A deep belief network (DBN) is a generative neural network that is composed of multiple layers of stochastic (random) variables. DBNs can be trained layer-by-layer using greedy unsupervised learning. The learned weights of the DBN can then be used to provide pre-train neural networks by determining an optimal initial set of weights for the neural network” teaches training layer by layer (corresponds to the first intermediate layer) using unsupervised training. Col. 30 Lines 28-35, “Unsupervised training can be used to generate a self-organizing map, which is a type of trained neural network 1107 capable of performing operations useful in reducing the dimensionality of data. Unsupervised training can also be used to perform anomaly detection, which allows the identification of data points in an input dataset that deviate from the normal patterns of the data” teaches the unsupervised training utilized for reducing the dimensionality of data).
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. does not appear to explicitly teach wherein the setting unit receives each output data that was degenerated by 31072388.1318 the first degeneration unit, sets a weight of each first intermediate layer based on said degenerated output data and the second learning parameter, and outputs the weight to the first output layer
However, Kasahara, teaches wherein the setting unit receives each output data that was degenerated by 31072388.1318 the first degeneration unit, sets a weight of each first Kasahara, Para. [0039], “the learning performing unit 24 causes a stacked autoencoder to learn (i.e., optimize) parameters (e.g., weight parameters between layers) used in the multilayer neural network, by backpropagation” teaches setting weight parameters between layers (corresponds to the first intermediate layer in the middle to the output layer) based on the optimized weight parameter (corresponds to the second learning parameter. FIG. 5 and Para. [0041], “As illustrated in FIG. 5, an autoencoder is known as a method for dimensionality reduction (or dimensionality compression) using the neural network 20. An autoencoder can reduce the number of neurons in a middle layer to become smaller than the dimensionality in an input layer, thereby achieving dimensionality reduction so that the input data is reconstructed with less dimensionality” teaches dimensionality reduction method (corresponds to degenerated output data) utilizing the neural network).
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Kasahara are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al., Sawada et al., and Tuske et al. with Kasahara, with motivation wherein the setting unit receives each output data that was degenerated by 31072388.1318 the first degeneration unit, sets a weight of each first intermediate layer based on said degenerated output data and the second learning parameter, and outputs the weight to the first output layer. “An 
Regarding Claim 5,
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Kasahara teaches the data analysis apparatus according to claim 3, further comprising
Kalamkar et al. further teaches a second degeneration unit configured to receive output data from each first intermediate layer, reduce the number of dimensions of each output data, and output each degenerated output data to the weight processing unit (Kalamkar et al., Col. 29 Lines 46-52, “A deep belief network (DBN) is a generative neural network that is composed of multiple layers of stochastic (random) variables. DBNs can be trained layer-by-layer using greedy unsupervised learning. The learned weights of the DBN can then be used to provide pre-train neural networks by determining an optimal initial set of weights for the neural network” teaches training layer by layer (corresponds to the first intermediate layer) using unsupervised training. Col. 30 Lines 28-35, “Unsupervised training can be used to generate a self-organizing map, which is a type of trained neural network 1107 capable of performing operations useful in reducing the dimensionality of data. Unsupervised training can also be used to perform anomaly detection, which allows the identification of data points in an input dataset that deviate from the normal patterns of the data” teaches the unsupervised training utilized for reducing the dimensionality of data).
Kasahara further teaches wherein the weight processing unit weights each degenerated output data 15from the second degeneration unit based on the weight of each first intermediate layer (Kasahara, Para. [0041], “As illustrated in FIG. 5, an autoencoder is known as a method for dimensionality reduction (or dimensionality compression) using the neural network 20. An autoencoder can reduce the number of neurons in a middle layer to become smaller than the dimensionality in an input layer, thereby achieving dimensionality reduction so that the input data is reconstructed with less dimensionality” teaches dimensionality reduction using the neural network for each layer. The input data (corresponds to the output data of the previous layer) is reconstructed with less dimensionality Para. [0039], “Specifically, the learning performing unit 24 causes a stacked autoencoder to learn (i.e., optimize) parameters (e.g., weight parameters between layers) used in the multilayer neural network, by backpropagation” teaches the autoencoder optimizing the weight parameters between layers (corresponds to the first intermediate layer)).
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Kasahara are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al., Sawada et al., and Tuske et al. with Kasahara, with motivation wherein the weight processing unit weights each degenerated output data 15from the second degeneration unit based on the weight of each first intermediate layer. “An embodiment has an object .
Claims 6 -10 are rejected under 35 U.S.C. 103 as being unpatentable over Kalamkar et al. in view of Sawada et al. in view of Tuske et al. and in further view of Mendoza et al. (“Towards Automatically –Tuned Neural Networks”)
Regarding Claim 6,
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. teaches the data analysis apparatus according to claim 4, wherein the learning unit is configured to
Kalamkar et al. further teaches 20adjust the fourth learning parameter using a second neural network including a second input layer that receives the training data, a second output layer that outputs a hyperparameter of the first neural network, and a second intermediate layer interposed between the second input layer and the second output layer, the second intermediate layer being configured to give data from a previous layer and a 25fourth learning parameter to a second activation function for calculation and output a calculation result to a subsequent layer, when the training data is given to the second input layer (Kalamkar et al., Col. 27 Lines 37-51 , “Once the neural network is structured, a learning model can be applied to the network to train the network to perform specific tasks. The learning model describes how to adjust the weights within the model to reduce the output error of the network. Backpropagation of errors is a common method used to train neural networks. An input vector is presented to the network for processing. The output of the network is compared to the desired output using a loss function and an error value is calculated for each of the neurons in the output layer. The error values are then propagated backwards until each neuron has an associated error value which roughly represents its contribution to the original output. The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network” teaches adjusting the weights associated with connection (corresponds to the fourth learning parameter) to minimize error of output generated from propagation backwards to train neural networks (corresponds to the second neural network). Col. 26 Lines 59-67 and Col. 27 Lines 1-2, “Recurrent neural networks (RNNs) are a family of feedforward neural networks that include feedback connections between layers. RNNs enable modeling of sequential data by sharing parameter data across different parts of the neural network. The architecture for a RNN includes cycles. The cycles represent the influence of a present value of a variable on its own value at a future time, as at least a portion of the output data from the RNN is used as feedback for processing subsequent input in a sequence. This feature makes RNNs particularly useful for language processing due to the variable nature in which language data can be composed” teaches the Recurrent Neural Network (RNN) that consist of an input layer and an output layer (corresponds to the second input and output layer) separated by two hidden layers (corresponds to the second intermediate layer) in between for feedback. Col. 23 Lines 29-35, “Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers” teaches data and coefficients (corresponds to the learning parameter) being propagated from the input layer to the output layer through an activation function to output calculation results to successive layer (corresponds to the subsequent layer) in the network).
… adjust the first learning parameter, the second learning parameter, and the third learning parameter when the training data is given to the first input layer of the first neural network after the structure thereof is determined (Kalamkar et al., FIG. 14C and Col. 35 Lines 1-17, “As shown in FIG. 14C, hybrid parallelism can be performed in which a partitioning is performed across activations and weights to minimize skewed matrices. For a layer of a neural network, the input data 1402, weight data 1404, and/or activation data 1406 is partitioned and distributed across multiple compute nodes (e.g., Node 0-Node 3). Node 0 receives a first block of input data 1402A and weight data 1404A. Compute operations are performed at Node 0 to generate a first partial activation 1406A. Likewise, Node1 receives a second block of input data 1402B and weight data 1404B. Compute operations are performed at Node 1 to generate a second partial activation 1406B. Node 2 can perform compute operations on third input data 1402C and weight data 1406C to generate a third partial activation 1406C” teaches input data, activation data, and weight data (corresponds to learning parameter) being distributed across Node 0-Node 3 (corresponds to the first-third learning parameters) for a layer of the neural network (corresponds to the input layer)).
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. does not appear to explicitly teach output the hyperparameter from the second output layer by giving the training data to the second input layer of the second neural network after the fourth learning parameter is adjusted and determine a structure of the first neural network based on the 5hyperparameter
However, Mendoza et al., teaches 32072388.1318output the hyperparameter from the second output layer by giving the training data to the second input layer of the second neural network after the fourth learning parameter is adjusted (Mendoza et al., Section 2 Pg. 59-60, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space. (Since neural networks cannot handle datasets in sparse representation out of the box, we transform the data into a dense representation on a per-batch basis prior to feeding it to the neural network.) The per-layer hyperparameters of layer k are conditionally dependent on the number of layers being at least k. For practical reasons, we constrain the number of layers to be between one and six: firstly, we aim to keep the training time of a single configuration low1, and secondly each layer adds eight per-layer hyperparameters to the configuration space, such that allowing additional layers would further complicate the configuration process. The most common way to optimize the internal weights of neural networks is via stochastic gradient descent (SGD) using partial derivatives calculated with backpropagation. Standard SGD crucially depends on the correct setting of the learning rate hyperparameter. To lessen this dependency, various algorithms (solvers) for stochastic gradient descent have been proposed. We include the following well-known methods from the literature in the configuration space of Auto-Net: vanilla stochastic gradient descent (SGD), stochastic gradient descent with momentum (Momentum), Adam (Kingma and Ba, 2014), Adadelta (Zeiler, 2012), Nesterov momentum (Nesterov, 1983) and Adagrad (Duchi et al., 2011). Additionally, we used a variant of the vSGD optimizer from Schaul et al. (2014), dubbed “smorm”, in which the estimate of the Hessian is replaced by an estimate of the squared gradient (calculated as in the RMSprop procedure). Each of these methods comes with a learning rate α and an own set of hyperparameters, for example Adam’s momentum vectors β1 and β2. Each solver’s hyperparameter(s) are only active if the corresponding solver is chosen” teaches the training of the neural network. Each training iteration creates a neural network (corresponds to the second neural network). Mendoza et al. further teaches optimizing the internal weights of neural networks (corresponds to adjusting the fourth learning parameter)).
determine a structure of the first neural network based on the 5hyperparameter (Mendoza et al., Table 1 and Section 2 Pg. 59, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space” teaches optimizing hyperparameters that corresponds to the structure in Table 1 for the neural networks (corresponds to the first neural network)).
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Mendoza et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al., Sawada et al., and Tuske et al. with Mendoza et al., with motivation to output the hyperparameter from the second output layer by giving the training data to the second input layer of the second neural network after the fourth learning parameter is adjusted and determine a structure of the first neural network based on the 5hyperparameter. “In this work, we present a first version of AutoNet, which provides automatically-tuned feed-forward neural networks without any human intervention. We report results on datasets from the recent AutoML challenge showing that ensembling Auto-Net with Auto-sklearn can perform better than either approach alone and report the first results on winning competition datasets against human experts with automatically-tuned neural networks” (Mendoza et al., Abstract). The proposed teaching is beneficial in that it provides automatically-tuned feed-forward neural networks without any human intervention and performs better than either approach alone.
Regarding Claim 7,
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Mendoza et al. teaches the data analysis apparatus according to claim 6
Mendoza et al. further teaches wherein the hyperparameter is to determine a pattern of elements constituting the first neural network (Mendoza et al., Table 1 and Section 2 Pg. 59, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space” teaches utilizing the hyperparameters to determines the structure in Table 1 for the neural networks (corresponds to the first neural network) that is made up of a specific pattern of elements).
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Mendoza et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al., Sawada et al., and Tuske et al. with Mendoza et al., with motivation to output the hyperparameter from the second output layer by giving the training data to the second input layer of the second neural network after the fourth learning parameter is adjusted and determine a structure of the first neural network based on the 5hyperparameter. “In this work, we present a first version of AutoNet, which provides automatically-tuned 
Regarding Claim 8,
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Mendoza et al. teaches the data analysis apparatus according to claim 7
Mendoza et al. further teaches wherein said 15hyperparameter that is to determine the pattern is a parameter indicating a type of the first activation function (Mendoza et al., Table 1 and Section 2 Pg. 59, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space” teaches the per-layer-hyperparameters that indicate activation-type (corresponds to the first activation function))
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Mendoza et al. are analogous art because they are from the same field of endeavor and 
Regarding Claim 9,
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Mendoza et al. teaches the data analysis apparatus according to claim 6, 
Mendoza et al. further teaches wherein the hyperparameter is to determine a sequence of elements constituting the first neural 20network (Mendoza et al., Table 1 and Section 2 Pg. 59, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space” teaches optimizing hyperparameters that determines the structure in Table 1 for the neural networks (corresponds to the first neural network) that consist of a sequence of elements).
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Mendoza et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al., Sawada et al., and Tuske et al. with Mendoza et al., with motivation wherein the hyperparameter is to determine a sequence of elements constituting the first neural 20network. “In this work, we present a first version of AutoNet, which provides automatically-tuned feed-forward neural networks without any human intervention. We report results on datasets from the recent AutoML challenge showing that ensembling Auto-Net with Auto-sklearn can perform better than either approach alone and report the first results on winning competition datasets against human experts with automatically-tuned neural networks” (Mendoza et al., Abstract). The proposed teaching is beneficial in that it provides automatically-tuned feed-forward neural networks without any human intervention and performs better than either approach alone.
Regarding Claim 10,
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Mendoza et al. teaches the data analysis apparatus according to claim 9, 
Mendoza et al. further teaches wherein said hyperparameter that is to determine the sequence is a parameter indicating the number of layers in the first intermediate layer (Mendoza et al., Table 1 and Section 2 Pg. 59-60, “Following Bergstra et al. (2011) and Domhan et al. (2015), we distinguish between layer-independent network hyperparameters that control the architecture and training procedure and per-layer hyperparameters that are set for each layer. In total, we optimize 63 hyperparameters (see Table 1), using the same configuration space for all types of supervised learning (binary, multiclass and multilabel classification, as well as regression). Sparse datasets also share the same configuration space. (Since neural networks cannot handle datasets in sparse representation out of the box, we transform the data into a dense representation on a per-batch basis prior to feeding it to the neural network.) The per-layer hyperparameters of layer k are conditionally dependent on the number of layers being at least k. For practical reasons, we constrain the number of layers to be between one and six: firstly, we aim to keep the training time of a single configuration low1, and secondly each layer adds eight per-layer hyperparameters to the configuration space, such that allowing additional layers would further complicate the configuration process. The most common way to optimize the internal weights of neural networks is via stochastic gradient descent (SGD) using partial derivatives calculated with backpropagation. Standard SGD crucially depends on the correct setting of the learning rate hyperparameter. To lessen this dependency, various algorithms (solvers) for stochastic gradient descent have been proposed. We include the following well-known methods from the literature in the configuration space of Auto-Net: vanilla stochastic gradient descent (SGD), stochastic gradient descent with momentum (Momentum), Adam (Kingma and Ba, 2014), Adadelta (Zeiler, 2012), Nesterov momentum (Nesterov, 1983) and Adagrad (Duchi et al., 2011). Additionally, we used a variant of the vSGD optimizer from Schaul et al. (2014), dubbed “smorm”, in which the estimate of the Hessian is replaced by an estimate of the squared gradient (calculated as in the RMSprop procedure). Each of these methods comes with a learning rate α and an own set of hyperparameters, for example Adam’s momentum vectors β1 and β2. Each solver’s hyperparameter(s) are only active if the corresponding solver is chosen” teaches the structure of the neural network with the first intermediate layer. Mendoza at al. further teaches hyperparameters that determine the number of layers (corresponds to the first intermediate layer)).
Kalamkar et al. in view of Sawada et al. in view of Tuske et al. in view of Mendoza et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Kalamkar et al., Sawada et al., and Tuske et al. with Mendoza et al., with motivation wherein said hyperparameter that is to determine the sequence is a parameter indicating the number of layers in the first intermediate layer. “In this work, we present a first version of AutoNet, which provides automatically-tuned feed-forward neural networks without any human intervention. We report results on datasets from the recent AutoML challenge showing that ensembling Auto-Net with Auto-sklearn can perform better than either 
5
10 Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Henry T Nguyen whose telephone number is (571)272-8860. The examiner can normally be reached Monday-Friday 8:00am-4:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For 


/HENRY TRONG NGUYEN/Examiner, Art Unit 2125                                                                                                                                                                                                        
10

/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125