DETAILED ACTION
Applicant has amended claims 1 and 19-20. Claims 1-4, 7-9, and 11-20 are pending.

Response to Arguments
Applicant’s arguments filed November 8, 2021 have with respect to claims 1-4, 7-9 and, 11-20 have been considered but are moot in view of the new ground(s) of rejection. The amended claims resulted in changes to the scope and contents; therefore, the grounds of rejection are modified accordingly. It is noted the previously applied prior arts remain in effect. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-4, 7-9, and 11-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 now recites the limitation “using a mini batch formed of a learning data group based on a configuration pattern that defines class ratio of learning data each of respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch” in lines 4-7. However, the examiner cannot clearly ascertain if the claimed “the mini batch” in line 7 of claim 1 corresponds to the claimed “a first mini batch” recited in line 10 of claim 1, the claimed “a second mini batch” recited in line 11 of claim 1, or a different mini batch from the claimed “a mini batch” in line 4 of claim 1, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “using a mini batch formed of a learning data group based on a configuration pattern that defines class ratio of learning data each of respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch”, in lines 4-7 of claim 1, as “using a mini batch formed of a learning data group based on a configuration pattern that defines class ratio of learning data each of respective classes defined for a plurality of pieces of learning data forming the formed mini batch to the classes in the formed mini batch”.
Claim 1 now recites the limitation “determine a new configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning of the neural network using a first mini batch in the learning” in lines 8-10. However, the examiner was not able to find where the claimed “determine a new configuration pattern to be utilized for subsequent learning” was found in the specification of the instant application as originally filed.
For example, Par. [0018-19] of the application indicate that “pattern generation unit 202 generates a plurality of configuration patterns. Here, the configuration pattern represents the pattern of a breakdown of learning data included in a mini batch… pattern determination unit 204 determines one configuration from among the plurality of configuration patterns, as a configuration pattern to be used for learning… the learning unit 207 evaluates a learning result, using the evaluation set. The evaluation value updating unit 208 updates the evaluation value of the configuration pattern, based on an evaluation result of the evaluation set”. Par. [0022-24] of the application also indicate that “the pattern generation unit 202 generates a plurality of configuration patterns. The configuration pattern is information that indicates the proportion of each class of learning data included in a mini batch… the pattern determination unit 204 selects one configuration pattern as the configuration pattern of a processing target, from the plurality of configuration patterns stored… the pattern determination unit 204 ,… The pattern determination unit 204 updates (changes) the probability of selection of each configuration pattern based on the evaluation score, and selects one configuration pattern from among the plurality of configuration patterns, utilizing the updated probability”. Par. [0028-32] of the application also indicate that “evaluation value updating unit 208 updates the evaluation score stored… by calculating an evaluation score based on a learning result for the evaluation set. The evaluation score calculated here corresponds to the learning result… Because the evaluation score is updated to a value other than the initial value in and after the second iteration, the probability corresponding to the evaluation score changes, and a configuration pattern corresponding to a learning result is selected… embodiment determines the configuration pattern to be utilized for the next learning, based on the learning result using the mini batch”.
Although the specification indicates that the processing apparatus “selects a configuration pattern of a processing target, based on the evaluation score”, the configuration pattern to be utilized for the next learning”, and “updates (changes) the probability of selection of each configuration pattern based on the evaluation score, and selects one configuration pattern from among the plurality of configuration patterns, utilizing the updated probability”, as indicated above, the examiner was not able to find where the claimed “determine a new configuration pattern to be utilized for subsequent learning” was found in the specification, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “determine a new configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning of the neural network using a first mini batch in the learning”, in lines 8-10 of claim 1, as “determine one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning of the neural network using a first mini batch in the learning”. 
Claim 1 recites the limitation “a first learning result that is a result of learning of the neural network using a first mini batch” in lines 9-10.
However, the examiner cannot clearly ascertain if the claimed “a first mini batch” in lines 9-10 of claim 1 corresponds to the claimed “a mini batch” previously recited in line 5 of claim 1, or a different mini batch from the claimed “a mini batch” in line 5 of claim 1, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “a first learning result that is a result of learning of the neural network using a first mini batch”, a first mini batch different from the mini batch”.
Claim 1 recites the limitation “generate a second mini batch based on the configuration pattern to be utilized for subsequent learning” in lines 11-14. However, the examiner cannot clearly ascertain if the claimed “the configuration pattern” corresponds to the claimed “a configuration pattern that defines a ratio of each respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch” recited in line 5 of claim 1, “a new configuration pattern [one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern] to be utilized for subsequent learning” recited in lines 7-8 of claim 1, or both, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “generate a second mini batch based on the configuration pattern to be utilized for subsequent learning” in lines 11-14 of claim 1, as “generate a second mini batch based on the determined configuration pattern to be utilized for subsequent learning”.
Claim 1 recites the limitation “that learning data included in the first mini batch formed of a learning data group” in lines 13-15. 
However, the examiner cannot clearly ascertain if the claimed “a learning data group” in lines 13-15 of claim 1 corresponds to the claimed “a learning data group based on a configuration pattern” previously recited in lines 4-5 of claim 1, or a different learning data group from the claimed “a learning data group” in lines 4-5 of claim 1, which renders the claim indefinite. 
a learning data group” in lines 13-15 of claim 1, as “a second learning data group, different from said learning data group”.
Claims 2-4, 7-9, and 11-18 are rejected by virtue of being dependent upon rejected base claim 1.
Claim 2 recites the limitation “determines a configuration pattern among the generated plurality of configuration patterns, as the configuration pattern to be utilized for consequent learning” in lines 4-6. However, the examiner cannot clearly ascertain if the claimed “the configuration pattern” corresponds to the claimed “a configuration pattern that defines a ratio of each respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch” recited in line 5 of claim 1, “a new configuration pattern [one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern] to be utilized for subsequent learning” recited in lines 7-8 of claim 1, or “a configuration pattern among the generated plurality of configuration patterns” recited in lines 4-5 of claim 2, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “determines a configuration pattern among the generated plurality of configuration patterns, as the configuration pattern to be utilized for consequent learning” in lines 4-6 of claim 2, as “determines a configuration pattern among the generated plurality of configuration patterns, as the determined configuration pattern to be utilized for consequent learning”.
Claims 3-4 are rejected by virtue of being dependent upon rejected base claim 2.
the configuration pattern to be utilized for subsequent learning” in lines 3-4. However, the examiner cannot clearly ascertain if the claimed “the configuration pattern” corresponds to the claimed “a configuration pattern that defines a ratio of each respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch” recited in line 5 of claim 1, “a new configuration pattern [one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern] to be utilized for subsequent learning” recited in lines 7-8 of claim 1, or “a configuration pattern among the generated plurality of configuration patterns” recited in lines 4-5 of claim 2, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “the configuration pattern to be utilized for subsequent learning” in lines 3-7 of claim 3, as “a probability that each of the generated plurality of configuration patterns is determined as the determined configuration pattern to be utilized for subsequent learning”.
Claim 7 recites the limitation “wherein the configuration pattern includes an evaluation score” in lines 1-2. However, the examiner cannot clearly ascertain if the claimed “the configuration pattern” corresponds to the claimed “a configuration pattern that defines a ratio of each respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch” recited in line 5 of claim 1, “a new configuration pattern [one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern] to be utilized for subsequent learning” recited in lines 7-8 of claim 1, or both, which renders the claim indefinite.
the configuration pattern includes an evaluation score” in lines 1-2 of claim 7, as “wherein the determined configuration pattern includes an evaluation score”.
Claim 14 recites the limitation “the configuration pattern” in line 3. However, the examiner cannot clearly ascertain if the claimed “the configuration pattern” corresponds to the claimed “a configuration pattern that defines a ratio of each respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch” recited in line 5 of claim 1, “a new configuration pattern [one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern] to be utilized for subsequent learning” recited in lines 7-8 of claim 1, or both, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “the configuration pattern” in line 3 of claim 14, as “the determined configuration pattern”.
Claim 18 recites the limitation “the configuration pattern” in line 4. However, the examiner cannot clearly ascertain if the claimed “the configuration pattern” corresponds to the claimed “a configuration pattern that defines a ratio of each respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch” recited in line 5 of claim 1, “a new configuration pattern [one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern] to be utilized for subsequent learning” recited in lines 7-8 of claim 1, or both, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “the configuration pattern” in line 4 of claim 18, as “the determined configuration pattern”.
the mini batch” in line 2. However, the examiner cannot clearly ascertain if the claimed “the mini batch” in line 2 of claim 13 corresponds to the claimed “a first mini batch” recited in line 10 of claim 1, the claimed “a second mini batch” recited in line 11 of claim 1, or a different mini batch from the claimed “a mini batch” in line 4 of claim 1, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “receives a learning set of learning data of the mini batch”, in line 2 of claim 13, as “receives a learning set of learning data of the second mini batch”.
Claim 15 recites the limitation “using data different from learning data included in the mini batch” in lines 3-4. However, the examiner cannot clearly ascertain if the claimed “the mini batch” in lines 3-4 of claim 15 corresponds to the claimed “a first mini batch” recited in line 10 of claim 1, the claimed “a second mini batch” recited in line 11 of claim 1, or a different mini batch from the claimed “a mini batch” in line 4 of claim 1, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “using data different from learning data included in the mini batch”, in lines 3-4 of claim 15, as “using data different from learning data included in the second mini batch”.
Claim 16 recites the limitation “performs the learning, using the mini batch” in line 5. However, the examiner cannot clearly ascertain if the claimed “the mini batch” in line 5 of claim 16 corresponds to the claimed “a first mini batch” recited in line 10 of claim 1, the claimed “a second mini batch” recited in line 11 of claim 1, or a different mini batch from the claimed “a mini batch” in line 4 of claim 1, which renders the claim indefinite.
the mini batch”, in line 5 of claim 16, as “performs the learning, using the first mini batch”.
Claim 19 now recites the limitation “using a mini batch formed of a learning data group based on a configuration pattern that defines class ratio of learning data each of respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch” in lines 3-6. However, the examiner cannot clearly ascertain if the claimed “the mini batch” in lines 5-6 of claim 19 corresponds to the claimed “a first mini batch” recited in line 9 of claim 19, the claimed “a second mini batch” recited in line 10 of claim 19, or a different mini batch from the claimed “a mini batch” in line 3 of claim 19, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “using a mini batch formed of a learning data group based on a configuration pattern that defines class ratio of learning data each of respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch”, in lines 3-6 of claim 19, as “using a mini batch formed of a learning data group based on a configuration pattern that defines class ratio of learning data each of respective classes defined for a plurality of pieces of learning data forming the formed mini batch to the classes in the formed mini batch”.
Claim 19 now recites the limitation “determining a new configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning of the neural network using a first mini batch in the learning” in lines 7-9. However, the examiner was not able to find where the claimed “determining a new configuration pattern to be utilized for subsequent learning” was found in the specification of the instant application as originally filed.
For example, Par. [0018-19] of the application indicate that “pattern generation unit 202 generates a plurality of configuration patterns. Here, the configuration pattern represents the pattern of a breakdown of learning data included in a mini batch… pattern determination unit 204 determines one configuration from among the plurality of configuration patterns, as a configuration pattern to be used for learning… the learning unit 207 evaluates a learning result, using the evaluation set. The evaluation value updating unit 208 updates the evaluation value of the configuration pattern, based on an evaluation result of the evaluation set”. Par. [0022-24] of the application also indicate that “the pattern generation unit 202 generates a plurality of configuration patterns. The configuration pattern is information that indicates the proportion of each class of learning data included in a mini batch… the pattern determination unit 204 selects one configuration pattern as the configuration pattern of a processing target, from the plurality of configuration patterns stored… the pattern determination unit 204 ,… The pattern determination unit 204 updates (changes) the probability of selection of each configuration pattern based on the evaluation score, and selects one configuration pattern from among the plurality of configuration patterns, utilizing the updated probability”. Par. [0028-32] of the application also indicate that “evaluation value updating unit 208 updates the evaluation score stored… by calculating an evaluation score based on a learning result for the evaluation set. The evaluation score calculated here corresponds to the learning result… Because the evaluation score is updated to a value other than the initial value in and after the second iteration, the probability a configuration pattern corresponding to a learning result is selected… embodiment determines the configuration pattern to be utilized for the next learning, based on the learning result using the mini batch”.
Although the specification indicates that the processing apparatus “selects a configuration pattern of a processing target, based on the evaluation score”, “determines the configuration pattern to be utilized for the next learning”, and “updates (changes) the probability of selection of each configuration pattern based on the evaluation score, and selects one configuration pattern from among the plurality of configuration patterns, utilizing the updated probability”, as indicated above, the examiner was not able to find where the claimed “determining a new configuration pattern to be utilized for subsequent learning” was found in the specification, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “determining a new configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning of the neural network using a first mini batch in the learning” in lines 7-9 of claim 19, as “determining one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning of the neural network using a first mini batch in the learning”. 
Claim 19 recites the limitation “a first learning result that is a result of learning of the neural network using a first mini batch in the learning” in lines 9-10.
a first mini batch” in lines 8-9 of claim 19 corresponds to the claimed “a mini batch” previously recited in line 3 of claim 19, or a different mini batch from the claimed “a mini batch” in line 3 of claim 19, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “a first learning result that is a result of learning of the neural network using a first mini batch in the learning” in lines 9-10 of claim 19, as “a first learning result that is a result of learning of the neural network using a first mini batch different from the mini batch”.
Claim 19 recites the limitation “generating a second mini batch formed of a learning data group based on the configuration pattern to be utilized for subsequent learning” in lines 10-12. However, the examiner cannot clearly ascertain if the claimed “the configuration pattern” corresponds to the claimed “a configuration pattern that defines a ratio of each respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch” recited in lines 3-6 of claim 19, “a new configuration pattern [one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern] to be utilized for subsequent learning” recited in lines 7-8 of claim 19, or both, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “generating a second mini batch formed of a learning data group based on the configuration pattern to be utilized for subsequent learning” in lines 10-12 of claim 19, as “generate a second mini batch based on the new configuration pattern [the one configuration pattern] to be utilized for subsequent learning”.
a learning data group” in line 10. 
However, the examiner cannot clearly ascertain if the claimed “a learning data group” in line 10 of claim 19 corresponds to the claimed “a learning data group based on a configuration pattern” previously recited in lines 2-3 of claim 19, or a different learning data group from the claimed “a learning data group” in line 10 of claim 19, which renders the claim indefinite. 
For examination purposes the examiner has interpreted the claimed “a learning data group” in line 10 of claim 19, as “a second learning data group, different from said learning data group”.
Claim 20 now recites the limitation “using a mini batch formed of a learning data group based on a configuration pattern that defines class ratio of learning data each of respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch” in lines 4-7. However, the examiner cannot clearly ascertain if the claimed “the mini batch” in line 7 of claim 20 corresponds to the claimed “a first mini batch” recited in line 10 of claim 19, the claimed “a second mini batch” recited in line 11 of claim 20, or a different mini batch from the claimed “a mini batch” in line 4 of claim 20, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “using a mini batch formed of a learning data group based on a configuration pattern that defines class ratio of learning data each of respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch”, in lines 4-7 of claim 20, as “using a mini batch formed of a learning data group based on a the formed mini batch to the classes in the formed mini batch”.
Claim 20 now recites the limitation “determine a new configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning of the neural network using a first mini batch in the learning” in lines 8-10. However, the examiner was not able to find where the claimed “determine a new configuration pattern to be utilized for subsequent learning” was found in the specification of the instant application as originally filed.
For example, Par. [0018-19] of the application indicate that “pattern generation unit 202 generates a plurality of configuration patterns. Here, the configuration pattern represents the pattern of a breakdown of learning data included in a mini batch… pattern determination unit 204 determines one configuration from among the plurality of configuration patterns, as a configuration pattern to be used for learning… the learning unit 207 evaluates a learning result, using the evaluation set. The evaluation value updating unit 208 updates the evaluation value of the configuration pattern, based on an evaluation result of the evaluation set”. Par. [0022-24] of the application also indicate that “the pattern generation unit 202 generates a plurality of configuration patterns. The configuration pattern is information that indicates the proportion of each class of learning data included in a mini batch… the pattern determination unit 204 selects one configuration pattern as the configuration pattern of a processing target, from the plurality of configuration patterns stored… the pattern determination unit 204 ,… The pattern determination unit 204 updates (changes) the probability of selection of each configuration pattern based on the evaluation score, and selects one configuration pattern from among the plurality of configuration patterns, utilizing the updated probability”. Par. [0028-32] of the application also indicate that “evaluation value updating unit 208 updates the evaluation score stored… by calculating an evaluation score based on a learning result for the evaluation set. The evaluation score calculated here corresponds to the learning result… Because the evaluation score is updated to a value other than the initial value in and after the second iteration, the probability corresponding to the evaluation score changes, and a configuration pattern corresponding to a learning result is selected… embodiment determines the configuration pattern to be utilized for the next learning, based on the learning result using the mini batch”.
Although the specification indicates that the processing apparatus “selects a configuration pattern of a processing target, based on the evaluation score”, “determines the configuration pattern to be utilized for the next learning”, and “updates (changes) the probability of selection of each configuration pattern based on the evaluation score, and selects one configuration pattern from among the plurality of configuration patterns, utilizing the updated probability”, as indicated above, the examiner was not able to find where the claimed “determine a new configuration pattern to be utilized for subsequent learning” was found in the specification, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “determine a new configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning of the neural network using a first mini batch in one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning of the neural network using a first mini batch in the learning”. 
Claim 20 recites the limitation “a first learning result that is a result of learning of the neural network using a first mini batch in the learning” in lines 9-10.
However, the examiner cannot clearly ascertain if the claimed “a first mini batch” in lines 9-10 of claim 20 corresponds to the claimed “a mini batch” previously recited in line 4 of claim 20, or a different mini batch from the claimed “a mini batch” in line 4 of claim 20, which renders the claim indefinite.
For examination purposes the examiner has interpreted the claimed “a first learning result that is a result of learning of the neural network using a first mini batch in the learning” in lines 9-10 of claim 20, as “a first learning result that is a result of learning of the neural network using a first mini batch different from the mini batch”.
Claim 20 recites the limitation “generate a second mini batch formed of a learning data group based on the configuration pattern to be utilized for subsequent learning” in lines 10-13. However, the examiner cannot clearly ascertain if the claimed “the configuration pattern” corresponds to the claimed “a configuration pattern that defines a ratio of each respective classes defined for a plurality of pieces of learning data forming the mini batch to the classes in the mini batch” recited in lines 4-7 of claim 20, “a new configuration pattern [one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern] to be utilized for subsequent learning” recited in lines 8-9 of claim 20, or both, which renders the claim indefinite.
the configuration pattern to be utilized for subsequent learning” in lines 10-13 of claim 20, as “generate a second mini batch based on the new configuration pattern [the one configuration pattern] to be utilized for subsequent learning”.
Claim 20 recites the limitation “generating a second mini batch formed of a learning data group” in line 12. 
However, the examiner cannot clearly ascertain if the claimed “a learning data group” in line 12 of claim 20 corresponds to the claimed “a learning data group based on a configuration pattern” previously recited in lines 4-5 of claim 20, or a different learning data group from the claimed “a learning data group” in line 20 of claim 20, which renders the claim indefinite. 
For examination purposes the examiner has interpreted the claimed “a learning data group” in line 12 of claim 20, as “a second learning data group, different from said learning data group”.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 7-9, 11, 14-16, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Roy et al. (US PG Pub. No. 2018/0300631 A1), hereafter referred to as Roy, in view of Iyengar et al. (US PG Pub. No. 2017/0300829 A1), hereafter referred to as Iyengar. 

Regarding claim 1, Roy discloses an apparatus (Par. [0003]: a method for analyzing patterns in a data stream and taking an action based on the analysis; Par. [0150]: Embodiments of invention also relate to apparatuses for performing the operations herein) comprising: 
one or more processors configured to function (Par. [0149-151]: "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system or computing platform, or similar electronic computing device(s)… Embodiments of invention also relate to apparatuses for performing the operations herein. Some apparatuses may be specially constructed for the required purposes, or may comprise a general purpose computer(s) selectively activated or configured by a computer program stored in the computer(s)… Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required methods) as:
a learning unit configured to perform learning of a neural network for identifying a class of input data, using a mini batch formed of a learning data group based on a configuration pattern that defines a ratio of each respective classes defined for a plurality of pieces of learning data forming the mini batch [the formed mini batch] to the a method for analyzing patterns in a data stream and taking an action based on the analysis…. A volume of data is received, and the data is trained to create training examples. Features are selected that are predictive of different classes of patterns in the data, using the training examples. A set of Kohonen networks is trained using the data, based on the selected features. Then, active nodes are identified and extracted from the set of Kohenen nets that are representative of a class of patterns in the data. Classes are assigned to the extracted active nodes… train multiple Kohonen nets in parallel both during feature selection and classifier construction phases… use Kohonen nets both for dimensionality reduction through feature selection and for building an ensemble of classifiers using single Kohonen neurons… The artificial neural network introduced by Finnish professor Teuvo Kohonen in the 1980s is sometimes called a Kohonen map or network. A Kohonen network is a self-organizing map (SOM) or self-organizing feature map (SOFM) which is a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of training samples, called a map; Par. [0026-31]: a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; Par. [0048-50]: explores different feature spaces given the class-specific feature rankings. In general, the process creates buckets of features and then trains several Kohonen nets of different grid sizes… for the feature spaces contained in the buckets… For the first bucket, select the top ranked feature of each class. For the second bucket, select the top two ranked features of each class and similarly create other buckets. The procedure, therefore, sequentially adds top ranked features of each class to create the buckets. Thus, the ith bucket of features will have j top ranked features from each class… A bucket, therefore, consists of a variety of feature spaces and the method trains a variety of Kohonen nets for each such feature space; Par. [0056-61]: Let Bmax be the maximum number of feature buckets… the first bucket will have 6 features, 3 from each class, and the last bucket will have all 120 features, 60 from each class. And each bucket will always have three feature spaces--one for each class and the third for the combined set of features. And there will be FG Kohonen nets of different grid sizes for each of the three feature spaces in each bucket... Let Inc be the number of features added each time to a bucket for each class. Inc is calculated from the number of top-ranked features to use from each class and Bmax. Let FBj be the jth bucket of features… Note that although the active nodes ANkj resulted from Kohonen nets built with the class k feature set in bucket j, these active nodes could belong to (that is, be assigned to) any of the classes k, k=1 . . . kc… kc, be the class count percentage of the mth class at active node ANkji and let CTAkji be the absolute count of input patterns processed at that active node… minimum required percentage of class counts for a class in order to assign an active node to that class… class count percentage of the mth class at active node… initialize FG Kohonen nets for a feature set that includes all of the features from all classes in bucket j… class count percentage of the mth class at active node ANkji, m = 1 . . . kc CTAkji the absolute count of input patterns processed at active node ANkji… Train all KNmax Kohonen nets in parallel using streaming data and selecting appropriate parts of the input pattern for each Kohonen net according to the feature subsets; a learning unit configured to perform learning of a neural network for identifying a class of input data, using a mini batch formed of a learning data group based on a configuration pattern that defines a ratio of each respective classes defined for a plurality of pieces of learning data forming the formed mini batch to the classes in the formed mini batch (e.g. perform learning (i.e. training) of a neural network for identifying a class of input data, including a Kohonen network, which is a type of artificial neural network (ANN) that is trained by selecting appropriate parts of an input pattern (i.e. a learning data group based on a configuration pattern) for each Kohonen net (i.e. neural net) according to feature sets and subsets (i.e. a first, second… Nth mini batch, group, cluster, etc.), including active nodes which are identified and extracted from the set of Kohenen nets (i.e. mini batches, groups, etc.) that are representative of a class of patterns (i.e. kn, that ranks features for each class, in which rkn =dknout/dknin , dknin is the average distance between patterns within class k for feature n, and dknout is the average distance between the patterns in class k and those not in class k for feature n, and the separability index of feature n for class k is given by the ratio rkn, for example, and the phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) (i.e. each of the formed mini batch) can either be assigned to classes or discarded if no class has a significant majority (i.e. each mini batch formed of a learning data group based on a configuration pattern that defines a ratio (i.e. percentage) of each respective classes defined for a plurality of pieces of learning data forming the formed mini batch to the classes in the formed mini batch), as indicated above), for example), but fails to teach the following as further recited in claim 1.
However, Iyengar teaches a determination unit configured to determine a new configuration pattern to be utilized for subsequent learning [determine one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern], based on a first learning result that is a result of learning of the neural network using a first mini batch [a first mini batch different from the mini batch] by the learning unit (Par. [0039]: a machine-learning system for managing shuffling of input training datasets. The machine-learning system includes a training dataset manager configured to shuffle an input dataset received from each of a plurality of electronic devices, and split the input training datasets into a plurality of mini-batches. Each of the mini-batches described herein defines an error surface corresponding to an error function. A learning manager is configured to obtain a cross mini-batch discriminator based on the error function for each of the mini-batches. Further, the learning manager is configured to select a mini-batch configuration associated with a least mini-batch discriminator score from the plurality of mini-batch configurations as optimal mini-batch; a determination unit configured to determine one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern, based on a first learning result that is a result of learning of the neural network using a first mini batch different from the mini batch by the learning unit (e.g. machine-learning system (i.e. neural network) includes a training dataset manager configured to shuffle an input dataset received from each of a plurality of electronic devices, and split the input training datasets into a plurality of mini-batches (i.e. a plurality of different configuration patterns) by obtaining a cross mini-batch discriminator (i.e. differentiator) based on the error function for each of the mini-batches to select a mini-batch configuration associated with a least mini-batch discriminator score from the plurality of mini-batch configurations (i.e. a first mini batch different from the mini batch) as optimal mini-batch (i.e. determine one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern, based on a first learning result that is a result of learning of the neural network using a first mini batch different from the mini batch), as indicated above), for example); and
a method to manage shuffling of input training datasets to train a machine-learning system… the method includes splitting, by the training dataset manager, the input training datasets into a plurality of mini-batches. Each of the mini-batches along with the corresponding target values, define an error surface corresponding to an error function… the method includes obtaining, by a learning manager, a cross mini-batch discriminator based on the error function for each of the mini-batches… the method includes selecting a mini-batch associated with a least cross mini-batch discriminator from the plurality of mini-batch configurations as optimal mini-batch configuration… the cross mini-batch discriminator is defined as a function evaluated as the sum of a differences in an error converged on by every mini-batch and the initial error on the error surface formed by the subsequent mini-batch for at least one specific mini-batch configuration from the plurality of mini-batch configurations using a gradient descent based method. The error on the error surface is formed by a subsequent mini-batch from the plurality of mini-batches. The cross mini-batch score is defined as the final score evaluated by the cross mini-batch discriminator for a given mini-batch configuration… the cross mini-batch discriminator is equivalent for minimizing the difference in the error between error surfaces of at least one specific mini-batch configuration and the subsequent mini-batch configuration, which leads to faster convergence; Par. [0041-57]: method can be used to perform a characterization of a discrimination between the error surfaces of various mini batches. Thus, a parameter can be seen as a measure of efficiency of shuffling performed on the input data… the machine-learning system 104 is configured to split the input training datasets into a plurality of mini-batches. Each of the mini-batches defines an error surface corresponding to an error function. After splitting the input training datasets into the plurality of mini-batches, the machine-learning system 104 is configured to obtain a cross mini-batch discriminator based on the error function for each of the mini-batches. Further, the machine-learning system 104 is configured to select a mini-batch associated with a least mini-batch discriminator from the plurality of mini-batches as optimal mini-batch… the cross mini-batch discriminator is estimated as a function of a difference in an error converged on at least one specific mini-batch configuration from the plurality of mini-batch configurations…, the cross mini-batch discriminator is computed as sum of differences in error over all mini batches… the cross mini-batch discriminator is defined as a score or value which is calculated as a function of the difference in the error value converged on a specific mini-batch and an initial error value on the error surface formed by a subsequent mini-batch… the least cross mini-batch discriminator is equivalent for minimizing the difference in the error between error surfaces of the at least one specific mini-batch configuration and the subsequent mini-batch configuration, which leads to faster convergence; Par. FIG. 4 illustrates an example scenario in which a hypothetical error surface represents error surfaces of nth mini batch and (n+1)th mini batch in a specific minibatch configuration, according to an embodiment as disclosed herein. The general optimization tasks involve finding a parameter set (w) that minimizes an error function E(w)… Approximations of the gradient of the error surface in the full batch gradient descent method is used to speed up gradient descent process. This is also applied to mini-batch gradient descent… When performing the mini-batch gradient descent, it is a standard practice to start with a set of random values for the parameters (wi) and perform a parameter update after every mini batch… Furthermore, the updated parameters after nth mini-batch are taken as the initial values of the parameters for (n+1)th mini batch… each mini batch defines its own error surface and there is a jump in the value of error function when training progresses from one mini batch to the other… Let's define a parameter to measure the efficiency of shuffling… the number can be defined as the cross mini-batch discriminator; a mini batch generation unit configured to generate a second mini batch based on the determined configuration pattern to be utilized for subsequent learning determined by the determination unit so that learning data included in the first mini batch formed of a second learning data group, different from said learning data group, that is effective for leaning in the first learning result is preferentially included in the second mini batch, wherein the learning unit performs second learning of the neural network, using the second mini batch (e.g. machine-learning system (i.e. neural network) includes a training dataset manager configured to shuffle an input dataset received from each of a plurality of electronic devices, and split th) of mini-batches as optimal mini-batch (i.e. generate a second, third… Nth mini batch), based on a cross mini-batch discriminator, which is defined as a score or value (i.e. evaluation score) calculated as a function of the difference in the error value converged on a specific mini-batch (i.e. the mini batch, the first mini batch, etc.) and an initial error value on the error surface formed by a subsequent mini-batch (i.e. the determined configuration pattern to be utilized for subsequent learning determined by the determination unit so that learning data included in the first mini batch formed of a second learning data group), including an error surface function represents error surfaces of nth mini batch and (n+1)th mini batch (i.e. performs first, second… Nth learning of the neural network, using the first, second… Nth mini batch) in a specific minibatch configuration (i.e. based on the determined configuration pattern to be utilized for subsequent learning), including a general optimization tasks which involves finding a parameter set (w) that minimizes an error function E(w), and the cross mini-batch discriminator is equivalent for minimizing the difference in the error between error surfaces of at least one specific mini-batch configuration and the subsequent mini-batch configuration (i.e. based on the determined configuration pattern to be utilized for subsequent learning determined), which leads to faster convergence (i.e. learning data included in the first mini batch formed of a second learning data group, different from said learning data group, that is effective for leaning in the first learning result is 
Roy and Iyengar are considered to be analogous art because they pertain artificial intelligence (i.e. learning machines) using artificial neural networks applied to image processing applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the apparatus for analyzing patterns in a data stream and taking an action based on the analysis (as disclosed by Roy) with determine one configuration pattern, from among a plurality of configuration patterns, as a configuration pattern, based on a first learning result that is a result of learning of the neural network using a first mini batch different from the mini batch and generate a second mini batch based on the determined configuration pattern to be utilized for subsequent learning determined  so that learning data included in the first mini batch formed of a second learning data group, different from said learning data group, that is effective for leaning in the first learning result is preferentially included in the second mini batch, wherein the learning unit performs second learning of the neural network, using the second mini batch (as taught by Iyengar, Abstract, Par. [0010-13, 41-57, 78-90]) by selecting a mini-batch associated with a least cross mini-batch discriminator from the plurality of mini-batches as optimal mini-batch to obtain faster convergence (Iyengar, Abstract, Par. [0013, 46, 57, 88, 102]).

Regarding claim 2, claim 1 is incorporated and Roy discloses the apparatus, wherein the one or more processors further function as a configuration pattern generation unit configured to generate a plurality of configuration patterns,
the determined configuration pattern] to be utilized for consequent learning, based on the first learning result (Par. [0003]: a method for analyzing patterns in a data stream and taking an action based on the analysis… . A volume of data is received, and the data is trained to create training examples. Features are selected that are predictive of different classes of patterns in the data, using the training examples. A set of Kohonen networks is trained using the data, based on the selected features. Then, active nodes are identified and extracted from the set of Kohenen nets that are representative of a class of patterns in the data. Classes are assigned to the extracted active nodes; Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; wherein the 
wherein the determination unit determines a configuration pattern among the generated plurality of configuration patterns, as the determined configuration pattern to be utilized for consequent learning, based on the first learning result generate a plurality of configuration patterns, wherein the determination unit determines a configuration pattern among the generated plurality of configuration patterns, as the configuration pattern to be utilized for consequent learning, based on the first learning result (e.g. Kohonen neural network is trained (i.e. the first learning result) using data based on selected features by normalizing input patterns (i.e. a first, second… Nth configuration pattern(s)) to be used during subsequent (i.e. consequent, subsequent, successive, next, etc.) learning processing (i.e. the determined configuration pattern to be utilized for consequent learning), in which appropriate parts of the input pattern for each Kohonen net are selected according to feature subsets, and features are selected that are predictive of different classes of patterns in the data using the training examples input patterns processed (i.e. determines a configuration pattern among the generated plurality of configuration patterns based on the obtained learning result), as indicated above), for example).

Regarding claim 7, claim 1 is incorporated and Roy discloses the apparatus, wherein the configuration pattern [the determined configuration pattern] includes an evaluation score (Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; Par. [0030-34]: method repeats this overall process of computing separability indices a few times by randomly selecting features for each feature partition, according to one embodiment. The method then uses the maximum separability index value of each feature over these repetitions for final ranking of the features… find the approximate maximum and minimum values of each feature. Use the range to normalize streaming input patterns during subsequent processing; Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; Par. [0048-50]: explores different feature spaces given the class-specific feature rankings. In general, the process creates buckets of features and then trains several Kohonen nets of different grid sizes… for the feature spaces contained in the buckets… For the first bucket, select the top ranked feature of each class. For the second bucket, select the top two ranked features of each class and similarly create other buckets. The procedure, therefore, sequentially adds top ranked features of each class to create the buckets. Thus, the ith bucket of features will have j top ranked features from each class… A bucket, therefore, consists of a variety of feature spaces and the method trains a variety of Kohonen nets for each such feature space; Par. [0105-121]: find the approximate maximum and minimum values of each feature. Use the range to normalize streaming input patterns during subsequent processing… Repeat steps 2 through 8 a few times and track the maximum separability index value of each feature… Rank features on the basis of their maximum separability index value; wherein the determined configuration pattern includes an evaluation score (e.g. algorithm uses Kohonen nets as a tool to break up class regions into smaller sub-regions (i.e. sets, subsets, mini-batches, groups, etc.) to 

Regarding claim 8, claim 1 is incorporated and Roy discloses the apparatus, wherein the one or more processors further function as an acquisition unit configured to acquire the class information from the learning data (Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; acquire the class information from the learning data (e.g. features are selected that are predictive of different classes of patterns in the data, using the training examples, and a set of Kohonen networks is trained using the data, based on the selected features, then, active nodes are identified and extracted from the set of Kohenen nets that are representative of a class of patterns in the data, and classes are assigned to the extracted active nodes (i.e. acquire the class information from the learning data), as indicated above), for example).

Regarding claim 9, claim 8 is incorporated and Roy discloses the apparatus, wherein the acquisition unit classifies the learning data into a plurality of clusters, and generates the clusters as class information of each piece of learning data (Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; Par. [0098-102]: A Kohonen net, as shown in FIG. 3, is generally used for clustering data into separate classes of patterns in a data stream. For classification problems, once it finds clusters, one can then label the nodes based on the majority class at that node and use it as a prediction system… one embodiment identifies Kohonen nodes with a significant presence of the minority class and then uses the training data points at those nodes to train another set of Kohonen nets. The basic idea is to break up the data points at those nodes to find the minority class regions. Henceforth, the Kohonen nets for the individual nodes with significant minority class presence are often referred to as Kohonen submodels or subnets… the algorithm uses Kohonen nets as a tool to break up class regions into smaller sub-regions to provide better visibility to the different class regions. It is somewhat similar to decision tree methods. However, one of the powerful features of Kohonen nets is that it breaks up (that is, it groups) data points considering all of the features, unlike decision tree methods that only consider a subset of features to build trees… the algorithm breaks up the regions corresponding to these nodes to gain better visibility to both the majority and minority class subregions; classifies the learning data into a plurality of clusters, and generates the clusters as class information of each piece of learning data (e.g. Kohonen net is used for clustering data into separate classes of patterns in a data stream (i.e. classifies the learning data into a plurality of clusters), and once it finds clusters, one can then label the nodes based on the majority class at that node and use it as a prediction system (i.e. generates the clusters as class information of each piece of learning data), as indicated above), for example).

Regarding claim 11, claim 1 is incorporated and Roy discloses the apparatus, wherein the mini batch generation unit generates the second mini batch including a learning data group for learning and a learning data group for evaluation, as the mini batch (Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; Par. [0048-50]: explores different feature spaces given the class-specific feature rankings. In general, the process creates buckets of features and then trains several Kohonen nets of different grid sizes… for the feature spaces contained in the buckets… For the first bucket, select the top ranked feature of each class. For the second bucket, select the top two ranked features of each class and similarly create other buckets. The procedure, therefore, sequentially adds top ranked features of each class to create the buckets. Thus, the ith bucket of features will have j top ranked features from each class… A bucket, therefore, consists of a variety of feature spaces and the method trains a variety of Kohonen nets for each such feature space; Par. [0056-61]: Let Bmax be the maximum number of feature buckets… the first bucket will have 6 features, 3 from each class, and the last bucket will have all 120 features, 60 from each class. And each bucket will always have three feature spaces--one for each class and the third for the combined set of features. And there will be FG Kohonen nets of different grid sizes for each of the three feature spaces in each bucket... Let Inc be the number of features added each time to a bucket for each class. Inc is calculated from the number of top-ranked features to use from each class and Bmax. Let FBj be the jth bucket of features… Note that although the active nodes ANkj resulted from Kohonen nets built with the class k feature set in bucket j, these active nodes could belong to (that is, be assigned to) any of the classes k, k=1 . . . kc… kc, be the class count percentage of the mth class at active node ANkji and let CTAkji be the absolute count of input patterns processed at that active node… minimum required percentage of class counts for a class in order to assign an active node to that class… class count percentage of the mth class at active node… initialize FG Kohonen nets for a feature set that includes all of the features from all classes in bucket j… class count percentage of the mth class at active node ANkji, m = 1 . . . kc CTAkji the absolute count of input patterns processed at active node ANkji… Train all KNmax Kohonen nets in parallel using streaming data and selecting appropriate parts of the input pattern for each Kohonen net according to the feature subsets; wherein th) smaller sub-regions (i.e. generate sets, subsets, mini-batches, groups, etc.) to provide better visibility to the different class regions in order to select appropriate parts of an input pattern (i.e. the mini batch) for each Kohonen net according to feature subsets (i.e. a learning data group for learning), including ranking criteria for class k to produce separation between patterns in class k and those not in class k, k=1 . . . kc (i.e. a learning data group), in which a measure (i.e. evaluation), called separability index, ranks (i.e. scores) features for each class to rank order features of class k, where a higher ratio implies a higher rank, as indicated above), for example).

Regarding claim 14, claim 1 is incorporated and Roy discloses the apparatus, wherein the one or more processors further function as a display control unit configured to display information about the configuration pattern [the determined configuration pattern] at a display unit, during or after learning by the learning unit (Par. [0149]: terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system or computing platform, or similar electronic computing device(s), that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices; a 

    PNG
    media_image1.png
    621
    602
    media_image1.png
    Greyscale

), for example).

Regarding claim 15, claim 1 is incorporated and Roy discloses the apparatus, wherein the one or more processors further function as an evaluation unit configured to evaluate a learning result obtained by the learning unit, using data different from learning data included in the mini batch [the second mini batch],
a method for analyzing patterns in a data stream and taking an action based on the analysis… . A volume of data is received, and the data is trained to create training examples. Features are selected that are predictive of different classes of patterns in the data, using the training examples. A set of Kohonen networks is trained using the data, based on the selected features. Then, active nodes are identified and extracted from the set of Kohenen nets that are representative of a class of patterns in the data. Classes are assigned to the extracted active nodes; Par. [0027-31]: trains many different Kohonen nets, of different grid sizes, and for different feature subsets… the reason for using different grid sizes for the same feature partition is to get different representative examples to compute the separability indices… some of the active nodes of Kohonen nets trained for different feature partitions serve as representative training examples of different classes and are used to compute the separability indices; Par. [0048-52]: explores different feature spaces given the class-specific feature rankings. In general, the process creates buckets of features and then trains several Kohonen nets of different grid sizes… for the feature spaces contained in the buckets… selecting Kohonen neurons from various Kohonen nets in different feature spaces and of different grid sizes. Note that, at the end of this final phase, the method discards all of the trained Kohonen nets and retains only a selected set of Kohonen neurons to serve as hyperspheres in different hypersphere nets; evaluate a learning result obtained by 

Regarding claim 16, claim 1 is incorporated and Roy discloses the apparatus, wherein the one or more processors further function as a selection unit configured to select learning data corresponding to the determined configuration pattern, based on the learning result,
wherein the learning unit performs the learning, using the mini batch [the first mini batch] including the selected learning data (Par. [0003]: a method for analyzing patterns in a data stream and taking an action based on the analysis… . A volume of data is received, and the data is trained to create training examples. Features are selected that are predictive of different classes of patterns in the data, using the training examples. A set of Kohonen networks is trained using the data, based on the selected features. Then, active nodes are identified and extracted from the set of Kohenen nets that are representative of a class of patterns in the data. Classes are assigned to the extracted active nodes; Par. [0027-31]: trains many different Kohonen nets, of different grid sizes, and for different feature subsets… the reason for using different grid sizes for the same feature partition is to get different representative examples to compute the separability indices… some of the active nodes of Kohonen nets trained for different feature partitions serve as representative training examples of different classes and are used to compute the separability indices; Par. [0048-52]: explores different feature spaces given the class-specific feature rankings. In general, the process creates buckets of features and then trains several Kohonen nets of different grid sizes… for the feature spaces contained in the buckets… selecting Kohonen neurons from various Kohonen nets in different feature spaces and of different grid sizes. Note that, at the end of this final phase, the method discards all of the trained Kohonen nets and retains only a selected set of Kohonen neurons to serve as hyperspheres in different hypersphere nets; select learning data corresponding to the determined configuration pattern, based on the learning result, wherein the learning unit performs the learning, using the first mini batch including the selected learning data (e.g. Kohonen neural network is trained using data based on selected features by normalizing input patterns (i.e. configuration patterns) to be used during subsequent 

Regarding claim 19, is a corresponding method claim rejected as applied to the apparatus claim 1 above.

Regarding claim 20, is a corresponding computer readable medium claim rejected as applied to the apparatus claim 1 above.

Claims 3-4, 12-13, and 17-18   are rejected under 35 U.S.C. 103 as being unpatentable over Roy, in view of Iyengar, as applied to claim 1 above, and in further view of Zadeh et al. (US PG Pub. No. 2014/0201126 A1), hereafter referred to as Zadeh.

Regarding claim 3, claim 2 is incorporated and the combination of Roy and Iyengar, as a whole, teaches the apparatus (Roy Par. [0003, 150]), but fails to teach the following as furtherer recited in claim 3.
However, Zadeh teaches wherein the one or more processors further function as a change unit configured to change a probability that each of the generated plurality of configuration patterns is determined as the configuration pattern [the determined 
wherein the determination unit determines the configuration pattern [the determined configuration pattern] to be utilized for subsequent learning, based on a probability changed by the change unit for each of the plurality of configuration patterns (Par. [1577-1588]: a fuzzy probability measure (p*) of fuzzy map A*, given probability distribution p(x), is determined… a test score is associated with a proposition or fact (e.g., in form of X is A). In one embodiment, this test score is based on a probability measure of A based on a probability distribution in X… a test score is associated with a proposition or fact (e.g., in form of X is A)… this test score is based on a probability measure of A based on a probability distribution in X. In one embodiment, a fuzzy test score is associated with a proposition or fact (e.g., in form of X is A*), where the test score is based on a fuzzy probability measure of A* and a probability distribution in X… the set of candidate probability distributions is based on one or more parameters associated to a model of probability distribution function, e.g. a family of class of probability distribution functions… the fuzzy logic inference engine uses a pattern matching algorithm in a forward chaining inference; Par. [1617]: inference engine (system), with a pattern matching engine that matches the current data state against the predicate of each rule, to find the ones that should be executed (or fired). Pattern matching module is connected to both processing (or controlling) module and interpreter module, to find the rules and also to change the association threads that find each candidate node for next loop (cycle); Par. the unclamped labels do not contribute to the error function, and their related weights are prevented to change during the learning step (e.g., by setting the corresponding learning rate to zero for the related weights and biases). In one embodiment, the labels provided for the training are associated with corresponding reliability factors. In one embodiment, such reliability factors (e.g., in range of [0,1]) are used to scale the learning step related to weights and biases of such unit. In one embodiment, the state of unclamped label units are allowed to vary stochastically based on links form other units; Par. [2197]: clustering algorithm, which produces input pattern groups with corresponding cluster centers. To learn fuzzy functions, one can use adaptive vector quantization (AVQ) (using unsupervised AVQ competitive learning) to estimate the local centroids (and covariance matrices) of clusters in the input-output space. From the resulting ellipsoid, one can derive the fuzzy rules (and fuzzy patches)… one can use the Kohonen self-organizing map (SOM)… to change weight vectors for a network (for modeling the features in training samples); change a probability that each of the generated plurality of configuration patterns is determined as the determined configuration pattern to be utilized for subsequent learning, the probability being changed based on the first learning result, wherein the determination unit determines the determined configuration pattern to be utilized for subsequent learning, based on a probability changed by the change unit for each of the plurality of configuration patterns (i.e. transition) unit for each of the plurality of configuration patterns (e.g. fuzzy logic inference engine which uses a pattern matching algorithm in a forward chaining inference, including a fuzzy (i.e. variable, changeable, etc.) probability 
Roy, Iyengar, and Zadeh are considered to be analogous art because they pertain artificial intelligence (i.e. learning machines) using artificial neural networks applied to image processing applications. Therefore, the combined teachings of Roy, Iyengar, and Zadeh, as a whole, would have rendered obvious the invention recited in claim 3 with a reasonable expectation of success in order to modify the apparatus for analyzing patterns in a data stream and taking an action based on the analysis (as disclosed by Roy) with change a probability that each of the generated plurality of configuration patterns is determined as the configuration pattern to be utilized for subsequent learning, the probability being changed based on the first learning result, wherein the determination unit determines the configuration pattern to be utilized for subsequent learning, based on a probability changed by the change unit for each of the plurality of configuration patterns (as taught by Zadeh, Abstract, Par. [1577-1588, 1617, 1808, 2197]) to return a ranked list of classes or categories by using combined classifiers to improve the performance of the combination (Zadeh, Abstract, Par. [1865]).

Regarding claim 4, claim 2 is incorporated and the combination of Roy and Iyengar, as a whole, teaches the apparatus (Roy Par. [0003, 150]), but fails to teach the following as furtherer recited in claim 4.
a fuzzy probability measure (p*) of fuzzy map A*, given probability distribution p(x), is determined… a test score is associated with a proposition or fact (e.g., in form of X is A). In one embodiment, this test score is based on a probability measure of A based on a probability distribution in X… a test score is associated with a proposition or fact (e.g., in form of X is A)… this test score is based on a probability measure of A based on a probability distribution in X. In one embodiment, a fuzzy test score is associated with a proposition or fact (e.g., in form of X is A*), where the test score is based on a fuzzy probability measure of A* and a probability distribution in X… the set of candidate probability distributions is based on one or more parameters associated to a model of probability distribution function, e.g. a family of class of probability distribution functions… the fuzzy logic inference engine uses a pattern matching algorithm in a forward chaining inference; Par. [1617]: inference engine (system), with a pattern matching engine that matches the current data state against the predicate of each rule, to find the ones that should be executed (or fired). Pattern matching module is connected to both processing (or controlling) module and interpreter module, to find the rules and also to change the association threads that find each candidate node for next loop (cycle); Par. [1808]: the unclamped labels do not contribute to the error function, and their related weights are prevented to change during the learning step (e.g., by setting the corresponding learning rate to zero for the related weights and biases). In one embodiment, the labels provided for the training are associated with corresponding reliability factors. In one embodiment, such reliability factors (e.g., in range of [0,1]) are used to scale the learning step related to weights and biases of such unit. In one embodiment, the state of unclamped label units are allowed to vary stochastically based on links form other units; Par. [2197]: clustering algorithm, which produces input pattern groups with corresponding cluster centers. To learn fuzzy functions, one can use adaptive vector quantization (AVQ) (using unsupervised AVQ competitive learning) to estimate the local centroids (and covariance matrices) of clusters in the input-output space. From the resulting ellipsoid, one can derive the fuzzy rules (and fuzzy patches)… one can use the Kohonen self-organizing map (SOM)… to change weight vectors for a network (for modeling the features in training samples); further comprising a storage unit configured to store the generated plurality of configuration patterns and evaluation scores of the respective configuration patterns in association with each other (e.g. fuzzy logic inference uses a pattern matching algorithm in a forward chaining inference, including a clustering algorithm, which produces input pattern groups with corresponding cluster centers (i.e. the generated plurality of configuration patterns), and test scores (i.e. evaluation scores) based on a set of candidate probability distributions based on one or more parameters associated to a model of probability distribution function (i.e. evaluation scores of the respective configuration patterns in association with each other), as indicated above), for example).


Regarding claim 12, claim 1 is incorporated and the combination of Roy and Iyengar, as a whole, teaches the apparatus (Roy Par. [0003, 150]), but fails to teach the following as furtherer recited in claim 12.
However, Zadeh teaches wherein the learning unit updates a weight of the neural network by calculating losses of respective pieces of the learning data, and performs back propagation for an average of the losses of the learning data (Par. [1734]: by using a back propagation method based on gradient decent. Since the initial weights of autoencoder were determined by a greedy pre-training of lower RBMs, the back propagation will be efficient… during the back propagation fine tuning, the stochastic binary units are assumed to be deterministic continuous value units adopting the probability value as their state value, to carry out the back propagation… the objective function ( error function) to optimize in back propagation, is the cross entropy error; Par. [1762-1763]: an error function (to be minimized by training) defined over the training sample space (e.g., in a batch processing of an epoch) accounts for data sample reliability by including sample reliability factor as a weight in the contribution of the data sample to the batch error function, e.g., in the summation of the errors contributed from individual data samples… for example, a stochastic approach is used (instead of full epoch batch) to sample one (or several) training data sample(s) while optimizing the sample error function, and the sample error function is weighted by the reliability factor of the data sample… the learning rate (e.g., the factor associated with the step to take in modifying the weights during the training) is modified based on the reliability weight for a given data sample used during the learning (e.g., in stochastic sampling of the data samples); Par. [1808]: the unclamped labels do not contribute to the error function, and their related weights are prevented to change during the learning step (e.g., by setting the corresponding learning rate to zero for the related weights and biases). In one embodiment, the labels provided for the training are associated with corresponding reliability factors. In one embodiment, such reliability factors (e.g., in range of [0,1]) are used to scale the learning step related to weights and biases of such unit. In one embodiment, the state of unclamped label units are allowed to vary stochastically based on links form other units; Par. [1869]: we use "Fuzzy c-Means Clustering Method", with a fuzzy pseudopartition or fuzzy c-partition of our set (where c is the number of fuzzy classes in partition), in terms of cluster centers, and using inner product induced norm in our space (representing distances in that space). The performance metrics measures the weighted sum of distances between cluster centers and elements in those clusters; Par. [2216-2222]: for machine learning, we use neural networks, perceptrons, including… back propagation algorithm (including convergence and local minima problem)… or reinforcement learning, which all can be combined with our methods in this disclosure, as a complementary method, for improving the performance or efficiency… active supervised learning (in which we query about the data, actively), active reinforcement learning; learning unit updates a weight of the neural network by 
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 3.

Regarding claim 13, claim 12 is incorporated and the combination of Roy and Iyengar, as a whole, teaches the apparatus (Roy Par. [0003, 150]), but fails to teach the following as furtherer recited in claim 13.
However, Zadeh teaches wherein the learning unit receives a learning set of learning data of the mini batch [the second mini batch] as an input, and calculates the losses of the respective pieces of the learning data by inputting a final output and supervisory information of the learning set into a loss function (Par. [1734]: by using a back propagation method based on gradient decent. Since the initial weights of autoencoder were determined by a greedy pre-training of lower RBMs, the back propagation will be efficient… during the back propagation fine tuning, the stochastic binary units are assumed to be deterministic continuous value units adopting the probability value as their state value, to carry out the back propagation… the objective function ( error function) to optimize in back propagation, is the cross entropy error; Par. [1762-1763]: an error function (to be minimized by training) defined over the training sample space (e.g., in a batch processing of an epoch) accounts for data sample reliability by including sample reliability factor as a weight in the contribution of the data sample to the batch error function, e.g., in the summation of the errors contributed from individual data samples… for example, a stochastic approach is used (instead of full epoch batch) to sample one (or several) training data sample(s) while optimizing the sample error function, and the sample error function is weighted by the reliability factor of the data sample… the learning rate (e.g., the factor associated with the step to take in modifying the weights during the training) is modified based on the reliability weight for a given data sample used during the learning (e.g., in stochastic sampling of the data samples); par. [1770]: back propagation is used for fine tuning of the weights/biases. In one embodiment, the added units and the previous units are used to make association and/or correlation with labeled samples, e.g., during the supervised training; wherein the learning unit receives a learning set of learning data of the second mini batch as an input, and calculates the losses of the respective pieces of the learning data by inputting a final output and supervisory information of the learning set into a loss function (e.g. an error (i.e. loss) function (to be minimized by training) defined over the training sample space (e.g., in a batch processing of an epoch) accounts for data sample reliability by including sample reliability factor as a weight in the contribution of the data sample to the batch error function, in which a stochastic approach is used (instead of full epoch batch) to sample one (or several) training data sample(s) (i.e. mini batches) while optimizing the sample error function, to make association and/or 
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 3.

Regarding claim 17, claim 16 is incorporated and the combination of Roy and Iyengar, as a whole, teaches the apparatus (Roy Par. [0003, 150]), but fails to teach the following as furtherer recited in claim 17.
However, Zadeh teaches wherein the one or more processors further function as a change unit configured to change a probability of selection for each piece of the learning data by the selection unit, based on the learning result,
wherein the selection unit selects learning data corresponding to a configuration pattern, based on the changed probability for each piece of the learning data (Par. [1577-1588]: a fuzzy probability measure (p*) of fuzzy map A*, given probability distribution p(x), is determined… a test score is associated with a proposition or fact (e.g., in form of X is A). In one embodiment, this test score is based on a probability measure of A based on a probability distribution in X… a test score is associated with a proposition or fact (e.g., in form of X is A)… this test score is based on a probability measure of A based on a probability distribution in X. In one embodiment, a fuzzy test score is associated with a proposition or fact (e.g., in form of X is A*), where the test score is based on a fuzzy probability measure of A* and a probability distribution in X… the set of candidate probability distributions is based on one or more parameters associated to a model of probability distribution function, e.g. a family of class of probability distribution functions… the fuzzy logic inference engine uses a pattern matching algorithm in a forward chaining inference; Par. [1617]: inference engine (system), with a pattern matching engine that matches the current data state against the predicate of each rule, to find the ones that should be executed (or fired). Pattern matching module is connected to both processing (or controlling) module and interpreter module, to find the rules and also to change the association threads that find each candidate node for next loop (cycle); Par. [1808]: the unclamped labels do not contribute to the error function, and their related weights are prevented to change during the learning step (e.g., by setting the corresponding learning rate to zero for the related weights and biases). In one embodiment, the labels provided for the training are associated with corresponding reliability factors. In one embodiment, such reliability factors (e.g., in range of [0,1]) are used to scale the learning step related to weights and biases of such unit. In one embodiment, the state of unclamped label units are allowed to vary stochastically based on links form other units; Par. [2197]: clustering algorithm, which produces input pattern groups with corresponding cluster centers. To learn fuzzy functions, one can use adaptive vector quantization (AVQ) (using unsupervised AVQ competitive learning) to estimate the local centroids (and covariance matrices) of clusters in the input-output space. From the resulting ellipsoid, one can derive the fuzzy rules (and fuzzy patches)… one can use the Kohonen self-organizing map (SOM)… to change weight vectors for a network (for modeling the features in training samples); change a probability of selection for each piece of the learning data 
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 3.

Regarding claim 18, claim 1 is incorporated and Roy discloses the apparatus, but fails to teach the following as furtherer recited in claim 18.
However, Zadeh teaches wherein the learning unit performs reinforcement learning of the neural network (Par. [2216-2222]: for machine learning, we use neural networks, perceptrons, including… back propagation algorithm (including convergence and local minima problem)… or reinforcement learning, which all can be combined with our methods in this disclosure, as a complementary method, for improving the performance or efficiency… active supervised learning (in which we query about the data, actively), active reinforcement learning), and
a fuzzy probability measure (p*) of fuzzy map A*, given probability distribution p(x), is determined… a test score is associated with a proposition or fact (e.g., in form of X is A). In one embodiment, this test score is based on a probability measure of A based on a probability distribution in X… a test score is associated with a proposition or fact (e.g., in form of X is A)… this test score is based on a probability measure of A based on a probability distribution in X. In one embodiment, a fuzzy test score is associated with a proposition or fact (e.g., in form of X is A*), where the test score is based on a fuzzy probability measure of A* and a probability distribution in X… the set of candidate probability distributions is based on one or more parameters associated to a model of probability distribution function, e.g. a family of class of probability distribution functions… the fuzzy logic inference engine uses a pattern matching algorithm in a forward chaining inference; Par. [1617]: inference engine (system), with a pattern matching engine that matches the current data state against the predicate of each rule, to find the ones that should be executed (or fired). Pattern matching module is connected to both processing (or controlling) module and interpreter module, to find the rules and also to change the association threads that find each candidate node for next loop (cycle); Par. [1808]: the unclamped labels do not contribute to the error function, and their related weights are prevented to change during the learning step (e.g., by setting the corresponding learning rate to zero for the related weights and biases). In one embodiment, the labels provided for the training are associated with corresponding reliability factors. In one embodiment, such reliability factors (e.g., in range of [0,1]) are used to scale the learning step related to weights and biases of such unit. In one embodiment, the state of unclamped label units are allowed to vary stochastically based on links form other units; and wherein the determination unit determines the determined configuration pattern to be utilized for subsequent learning, based on a plurality of learning results obtained by the learning unit (e.g. fuzzy logic inference engine which uses a pattern matching algorithm in a forward chaining inference, including a fuzzy (i.e. variable, changeable, etc.) probability measure with a given probability distribution, including matching the current data state against the predicate of each rule to find the ones that should be executed (or fired) to find the rules and also to change the association threads that find each candidate node for next loop (cycle) (i.e. determine a configuration pattern to be utilized for next learning), as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 3.

Conclusion
Applicant’s amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Contact
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GUILLERMO RIVERA-MARTINEZ whose telephone number is 571-272-4979. The examiner can normally be reached on Monday-Friday (8am - 5pm Eastern Time). If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached on 571-272-7332. The 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/GUILLERMO M RIVERA-MARTINEZ/           Primary Examiner, Art Unit 2668