DETAILED ACTION
Applicant’s amendment of December 2, 2020 overcomes the following:
Specification objections
Claims 1-4, 8, 10, and 14-17 interpretation under 35 U.S.C. 112(f), pre-AIA  35 U.S.C. 112, sixth paragraph

Applicant has amended claims 1-3, 7-8, 10, and 14-20. Claims 5-6 have been cancelled. Claims 1-4 and 7-20 are pending.

Response to Arguments
Applicant’s arguments with respect to claims 1-4 and 7-20 have been considered but are moot in view of the new ground(s) of rejection. The amended claims resulted in changes to the scope and contents; therefore, the grounds of rejection are modified accordingly. It is noted the previously applied prior arts remain in effect. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1, 2, 7-11, 14-16, and 19-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Roy et al. (US PG Pub. No. 2018/0300631 A1), hereafter referred to as Roy.

Regarding claim 1, Roy discloses an apparatus (Par. [0003]: a method for analyzing patterns in a data stream and taking an action based on the analysis; Par. [0150]: Embodiments of invention also relate to apparatuses for performing the operations herein) comprising: 
one or more processors configured to function (Par. [0149-151]: "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system or computing platform, or similar electronic computing device(s)… Embodiments of invention also relate to apparatuses for performing the operations herein. Some apparatuses may be specially constructed for the required purposes, or may comprise a general purpose computer(s) selectively activated or configured by a computer program stored in the computer(s)… Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required methods) as:
a learning unit configured to perform learning of a neural network for identifying a class of input data, using a mini batch formed of a learning data group based on a configuration pattern that defines class ratio of learning data (Par. [0003]: a method for analyzing patterns in a data stream and taking an action based on the analysis… . A volume of data is received, and the data is trained to create training examples. Features are selected that are predictive of different classes of patterns in the data, using the training examples. A set of Kohonen networks is trained using the data, based on the selected features. Then, active nodes are identified and extracted from the set of Kohenen nets that are representative of a class of patterns in the data. Classes are assigned to the extracted active nodes; Par. [0010]: The artificial neural network introduced by Finnish professor Teuvo Kohonen in the 1980s is sometimes called a Kohonen map or network. A Kohonen network is a self-organizing map (SOM) or self-organizing feature map (SOFM) which is a type of artificial neural network (ANN) that is trained; Par. [0026-31]: a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; Par. [0048-50]: explores different feature spaces given the class-specific feature rankings. In general, the process creates buckets of features and then trains several Kohonen nets of different grid sizes… for the feature spaces contained in the buckets… For the first bucket, select the top ranked feature of each class. For the second bucket, select the top two ranked features of each class and similarly create other buckets. The procedure, therefore, sequentially adds top ranked features of each class to create the buckets. Thus, the ith bucket of features will have j top ranked features from each class… A bucket, therefore, consists of a variety of feature spaces and the method trains a variety of Kohonen nets for each such feature space; Par. [0056-61]: Let Bmax be the maximum number of feature buckets… the first bucket will have 6 features, 3 from each class, and the last bucket will have all 120 features, 60 from each class. And each bucket will always have three feature spaces--one for each class and the third for the combined set of features. And there will be FG Kohonen nets of different grid sizes for each of the three feature spaces in each bucket... Let Inc be the number of features added each time to a bucket for each class. Inc is calculated from the number of top-ranked features to use from each class and Bmax. Let FBj be the jth bucket of features… Note that although the active nodes ANkj resulted from Kohonen nets built with the class k feature set in bucket j, these active nodes could belong to (that is, be assigned to) any of the classes k, k=1 . . . kc… kc, be the class count percentage of the mth class at active node ANkji and let CTAkji be the absolute count of input patterns processed at that active node… minimum required percentage of class counts for a class in order to assign an active node to that class… class count percentage of the mth class at active node… initialize FG Kohonen nets for a feature set that includes all of the features from all classes in bucket j… class count percentage of the mth class at active node ANkji, m = 1 . . . kc CTAkji the absolute count of input patterns processed at active node ANkji… Train all KNmax Kohonen nets in parallel using streaming data and selecting appropriate parts of the input pattern for each Kohonen net according to the feature subsets; a learning unit configured to perform learning of a neural network for identifying a class of input data, using a mini batch formed of a learning data group based on a configuration pattern that defines class ratio of learning data (e.g. perform learning (i.e. training) of a neural network for identifying a class of input data, including a Kohonen network, which is a type of artificial neural network (ANN) that is trained by selecting appropriate parts of an input pattern (i.e. a learning data group based on a configuration pattern) for each Kohonen net (i.e. neural net) according to feature sets and subsets (i.e. a first, second… Nth mini batch, group, cluster, etc.), in which each of the neural net active nodes are identified and extracted that are representative of a class (i.e. identifying a class of input data) of patterns in the data (i.e. using a first, second… Nth mini batch formed of a learning data group), and keeps count of the number of times (i.e. ratio, percentage, etc.) an input pattern of a particular class activated a particular neuron (i.e. based on a configuration pattern that defines class ratio of learning data), in which the higher the ratio (i.e. percentage, count, etc.) for a feature n, the greater the ability to separate class k from the other classes, as indicated above), for example); and
a determination unit configured to determine a configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning of the neural network using a first mini batch by the learning unit,
method repeats this overall process of computing separability indices a few times by randomly selecting features for each feature partition, according to one embodiment. The method then uses the maximum separability index value of each feature over these repetitions for final ranking of the features… One embodiment considers only the winning or best neurons of the Kohonen nets to be active nodes. Once the Kohonen nets stabilize during initial training, the embodiment processes some more streaming data to assign class labels to the active nodes… the class count percentage of the most active class at an active node and let PCTmin be the minimum class count percentage for a class to be assigned to an active node… find the approximate maximum and minimum values of each feature. Use the range to normalize streaming input patterns during subsequent processing; Par. [0046-50]: in the next phase, the method constructs classifiers exploiting the class-specific feature rankings produced in this phase… explores different feature spaces given the class-specific feature rankings. In general, the process creates buckets of features and then trains several Kohonen nets of different grid sizes… for the feature spaces contained in the buckets… For the first bucket, select the top ranked feature of each class. For the second bucket, select the top two ranked features of each class and similarly create other buckets. The procedure, therefore, sequentially adds top ranked features of each class to create the buckets. Thus, the ith bucket of features will have j top ranked features from each class… A bucket, therefore, consists of a variety of feature spaces and the method trains a variety of Kohonen nets for each such feature space; Par. [0105-121]: find the approximate maximum and minimum values of each feature. Use the range to normalize streaming input patterns during subsequent processing… Repeat steps 2 through 8 a few times and track the maximum separability index value of each feature… Rank features on the basis of their maximum separability index value; determine a configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning of the neural network using a first mini batch by the learning unit, wherein the learning unit performs second learning of the neural network, using a second mini batch including a learning data group based on the configuration pattern to be utilized for the subsequent learning determined by the determination unit based on the first learning result (e.g. Kohonen neural network is trained (i.e. a first learning result) using data based on selected features by normalizing input patterns (i.e. a first, second… Nth configuration pattern(s)) to be used during subsequent (i.e. successive, next, etc.) processing (i.e. determine a configuration pattern to be utilized for subsequent learning, based on a first learning result that is a result of learning), including selecting appropriate parts of an input pattern (i.e. the first, second… Nth configuration pattern(s)) for each Kohonen net (i.e. neural net) according to feature sets and subsets (i.e. performs second (subsequent, successive, next, etc.) learning of the neural network, using first, second… Nth mini batch including a learning data group based on the configuration pattern to be utilized for the subsequent learning  learning result), as indicated above), for example).

Regarding claim 2, claim 1 is incorporated and Roy discloses the apparatus, wherein the one or more processors further function as a configuration pattern generation unit configured to generate a plurality of configuration patterns,
wherein the determination unit determines a configuration pattern among the generated plurality of configuration patterns, as the configuration pattern to be utilized for consequent learning, based on the first learning result (Par. [0003]: a method for analyzing patterns in a data stream and taking an action based on the analysis… . A volume of data is received, and the data is trained to create training examples. Features are selected that are predictive of different classes of patterns in the data, using the training examples. A set of Kohonen networks is trained using the data, based on the selected features. Then, active nodes are identified and extracted from the set of Kohenen nets that are representative of a class of patterns in the data. Classes are assigned to the extracted active nodes; Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; generate a plurality of configuration patterns, wherein the determination unit determines a configuration pattern among the generated plurality of configuration patterns, as the configuration pattern to be utilized for consequent learning, based on the first learning result (e.g. Kohonen neural network is trained (i.e. the first learning result) using data based on selected features by normalizing input patterns (i.e. a first, second… Nth configuration pattern(s)) to be used during subsequent (i.e. consequent, subsequent, successive, next, etc.) learning processing (i.e. the configuration pattern to be utilized for consequent learning), in which appropriate parts of the input pattern for each Kohonen net are selected according to feature subsets, and features are selected that are predictive of different classes of patterns in the data using the training examples input patterns processed (i.e. determines a configuration pattern among the generated plurality of configuration patterns based on the obtained learning result), as indicated above), for example).

Regarding claim 7, claim 1 is incorporated and Roy discloses the apparatus, wherein the configuration pattern includes an evaluation score (Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; Par. [0030-34]: method repeats this overall process of computing separability indices a few times by randomly selecting features for each feature partition, according to one embodiment. The method then uses the maximum separability index value of each feature over these repetitions for final ranking of the features… find the approximate maximum and minimum values of each feature. Use the range to normalize streaming input patterns during subsequent processing; Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; Par. [0048-50]: explores different feature spaces given the class-specific feature rankings. In general, the process creates buckets of features and then trains several Kohonen nets of different grid sizes… for the feature spaces contained in the buckets… For the first bucket, select the top ranked feature of each class. For the second bucket, select the top two ranked features of each class and similarly create other buckets. The procedure, therefore, sequentially adds top ranked features of each class to create the buckets. Thus, the ith bucket of features will have j top ranked features from each class… A bucket, therefore, consists of a variety of feature spaces and the method trains a variety of Kohonen nets for each such feature space; Par. [0105-121]: find the approximate maximum and minimum values of each feature. Use the range to normalize streaming input patterns during subsequent processing… Repeat steps 2 through 8 a few times and track the maximum separability index value of each feature… Rank features on the basis of their maximum separability index value; wherein the configuration pattern includes an evaluation score (e.g. algorithm uses Kohonen nets as a tool to break up class regions into smaller sub-regions (i.e. sets, subsets, mini-batches, groups, etc.) to provide better visibility to the different class regions in order to select appropriate parts 

Regarding claim 8, claim 1 is incorporated and Roy discloses the apparatus, wherein the one or more processors further function as an acquisition unit configured to acquire the class information from the learning data (Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; acquire the class information from the learning data (e.g. features 

Regarding claim 9, claim 8 is incorporated and Roy discloses the apparatus, wherein the acquisition unit classifies the learning data into a plurality of clusters, and generates the clusters as class information of each piece of learning data (Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; Par. [0098-102]: A Kohonen net, as shown in FIG. 3, is generally used for clustering data into separate classes of patterns in a data stream. For classification problems, once it finds clusters, one can then label the nodes based on the majority class at that node and use it as a prediction system… one embodiment identifies Kohonen nodes with a significant presence of the minority class and then uses the training data points at those nodes to train another set of Kohonen nets. The basic idea is to break up the data points at those nodes to find the minority class regions. Henceforth, the Kohonen nets for the individual nodes with significant minority class presence are often referred to as Kohonen submodels or subnets… the algorithm uses Kohonen nets as a tool to break up class regions into smaller sub-regions to provide better visibility to the different class regions. It is somewhat similar to decision tree methods. However, one of the powerful features of Kohonen nets is that it breaks up (that is, it groups) data points considering all of the features, unlike decision tree methods that only consider a subset of features to build trees… the algorithm breaks up the regions corresponding to these nodes to gain better visibility to both the majority and minority class subregions; classifies the learning data into a plurality of clusters, and generates the clusters as class information of each piece of learning data (e.g. Kohonen net is used for clustering data into separate classes of patterns in a data stream (i.e. classifies the learning data into a plurality of clusters), and once it finds clusters, one can then label the nodes based on the majority class at that node and use it as a prediction system (i.e. generates the clusters as class information of each piece of learning data), as indicated above), for example).

claim 10, claim 1 is incorporated and Roy discloses the apparatus, wherein the one or more processors further function as a mini batch generation unit configured to extract learning data from a group of the learning data and to generate a mini batch based on the extracted learning data (Par. [0003]: a method for analyzing patterns in a data stream and taking an action based on the analysis… . A volume of data is received, and the data is trained to create training examples. Features are selected that are predictive of different classes of patterns in the data, using the training examples. A set of Kohonen networks is trained using the data, based on the selected features. Then, active nodes are identified and extracted from the set of Kohenen nets that are representative of a class of patterns in the data. Classes are assigned to the extracted active nodes; Par. [0098-102]: A Kohonen net, as shown in FIG. 3, is generally used for clustering data into separate classes of patterns in a data stream. For classification problems, once it finds clusters, one can then label the nodes based on the majority class at that node and use it as a prediction system… one embodiment identifies Kohonen nodes with a significant presence of the minority class and then uses the training data points at those nodes to train another set of Kohonen nets. The basic idea is to break up the data points at those nodes to find the minority class regions. Henceforth, the Kohonen nets for the individual nodes with significant minority class presence are often referred to as Kohonen submodels or subnets… the algorithm uses Kohonen nets as a tool to break up class regions into smaller sub-regions to provide better visibility to the different class regions. It is somewhat similar to decision tree methods. However, one of the powerful features of Kohonen nets is that it breaks up (that is, it groups) data points considering all of the features, unlike decision tree methods that only consider a subset of features to build trees… the algorithm breaks up the regions corresponding to these nodes to gain better visibility to both the majority and minority class subregions; extract learning data from a group of the learning data and to generate a mini batch based on the extracted learning data (e.g. active nodes are identified and extracted from the set of Kohenen nets that are representative of a class of patterns in the data (i.e. extract learning data from a group of the learning data), and classes are assigned to the extracted active nodes, to break up class regions into smaller sub-regions (i.e. sets, subsets, mini-batches, groups, etc.) to provide better visibility to the different class regions in order to select appropriate parts of an input pattern (i.e. the configuration pattern) for each Kohonen net (i.e. neural net) according to feature subsets (i.e. mini-batches) by clustering data into separate classes of patterns, as indicated above), for example ).

Regarding claim 11, claim 10 is incorporated and Roy discloses the apparatus, wherein the mini batch generation unit generates a learning data group for learning and a learning data group for evaluation, as the mini batch (Par. [0026-31]: basic feature ranking criteria are that (1) a good feature for class k should produce good separation between patterns in class k and those not in class k, k=1 . . . kc, and (2) also make the patterns in class k more compact. Based on this idea, a measure called the separability index that can rank features for each class has been proposed. Suppose dknin is the average distance between patterns within class k for feature n, and dknout the average distance between the patterns in class k and those not in class k for feature n… The separability index of feature n for class k is given by rkn=dknout/dknin. One may use this separability index rkn to rank order features of class k where a higher ratio implies a higher rank. The sense of this measure is that a feature n with a lower dknin makes class k more compact and with a higher dknout increases the separation of class k from the other classes. Thus, the higher the ratio rkn for a feature n, the greater is its ability to separate class k from the other classes and the better the feature … N-dimensional vector x, x=(X1, X2, . . . ,XN) represents an input pattern in the streaming data… Let FPq denote the qth feature subset, q=1 . . . FS, where FS is the total number of feature subsets… as the embodiment processes some more streaming data… keeps count of the number of times an input pattern of a particular class activated a particular neuron (i.e., the neuron was the winning neuron for those input patterns). For example, given there are two classes, A and B, for each active node, the method keeps count of the number of times input patterns from each of these two classes activates the node. Suppose class A patterns activate one such neuron (node) 85 times and class B patterns activate the node 15 times. At this node then, approximately 85% of the activating input patterns belong to class A and 15% belong to class B. Since a significant majority of the activating patterns belong to class A, the method simply assigns this active neuron to class A. Assigning an active neuron to a class simply means that that neuron represents an example of that class. As an example when an active neuron is discarded, suppose class A patterns activate a node 55% of the time and class B patterns activate the node 45% of the time. The method discards such an active node because no class has a significant majority and, therefore, it cannot claim the node as a representative point of any particular class. This phase of labeling active nodes ends once the class ratios (percentages) at every active node for all of the Kohonen nets are fairly stable and all active nodes (neurons) can either be assigned to classes or discarded if no class has a significant majority. The embodiment also discards active nodes that have comparatively low absolute count of patterns; Par. [0048-50]: explores different feature spaces given the class-specific feature rankings. In general, the process creates buckets of features and then trains several Kohonen nets of different grid sizes… for the feature spaces contained in the buckets… For the first bucket, select the top ranked feature of each class. For the second bucket, select the top two ranked features of each class and similarly create other buckets. The procedure, therefore, sequentially adds top ranked features of each class to create the buckets. Thus, the ith bucket of features will have j top ranked features from each class… A bucket, therefore, consists of a variety of feature spaces and the method trains a variety of Kohonen nets for each such feature space; Par. [0056-61]: Let Bmax be the maximum number of feature buckets… the first bucket will have 6 features, 3 from each class, and the last bucket will have all 120 features, 60 from each class. And each bucket will always have three feature spaces--one for each class and the third for the combined set of features. And there will be FG Kohonen nets of different grid sizes for each of the three feature spaces in each bucket... Let Inc be the number of features added each time to a bucket for each class. Inc is calculated from the number of top-ranked features to use from each class and Bmax. Let FBj be the jth bucket of features… Note that although the active nodes ANkj resulted from Kohonen nets built with the class k feature set in bucket j, these active nodes could belong to (that is, be assigned to) any of the classes k, k=1 . . . kc… kc, be the class count percentage of the mth class at active node ANkji and let CTAkji be the absolute count of input patterns processed at that active node… minimum required percentage of class counts for a class in order to assign an active node to that class… class count percentage of the mth class at active node… initialize FG Kohonen nets for a feature set that includes all of the features from all classes in bucket j… class count percentage of the mth class at active node ANkji, m = 1 . . . kc CTAkji the absolute count of input patterns processed at active node ANkji… Train all KNmax Kohonen nets in parallel using streaming data and selecting appropriate parts of the input pattern for each Kohonen net according to the feature subsets; wherein the mini batch generation unit generates a learning data group for learning and a learning data group for evaluation, as the mini batch (e.g. algorithm uses Kohonen nets as a tool to break up class regions into smaller sub-regions (i.e. sets, subsets, mini-batches, groups, etc.) to provide better visibility to the different class regions in order to select appropriate parts of an input pattern for each Kohonen net according to feature subsets (i.e. a learning data group for learning), including ranking criteria for class k to produce separation between patterns in class k and those not in class k, k=1 . . . kc (i.e. a learning data group), in which a measure (i.e. evaluation), called separability index, ranks (i.e. scores) 

Regarding claim 14, claim 1 is incorporated and Roy discloses the apparatus, wherein the one or more processors further function as a display control unit configured to display information about the configuration pattern at a display unit, during or after learning by the learning unit (Par. [0149]: terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system or computing platform, or similar electronic computing device(s), that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices; a display control unit configured to display information about the configuration pattern at a display unit, during or after learning by the learning unit (e.g. "displaying" or the like, refer to the action and processes of a computer system or computing platform, or similar electronic computing device(s), that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer display devices, including displaying information about the configuration pattern during processing, as shown below:

    PNG
    media_image1.png
    621
    602
    media_image1.png
    Greyscale

), for example).

Regarding claim 15, claim 1 is incorporated and Roy discloses the apparatus, wherein the one or more processors further function as an evaluation unit configured to evaluate a learning result obtained by the learning unit, using data different from learning data included in the mini batch,
wherein the determination unit determines a configuration pattern to be utilized for next learning, based on an evaluation of the learning result obtained by the evaluation unit (Par. [0003]: a method for analyzing patterns in a data stream and taking an action based on the analysis… . A volume of data is received, and the data is trained to create training examples. Features are selected that are predictive of different classes of patterns in the data, using the training examples. A set of Kohonen networks is trained using the data, based on the selected features. Then, active nodes are identified and extracted from the set of Kohenen nets that are representative of a class of patterns in the data. Classes are assigned to the extracted active nodes; Par. [0027-31]: trains many different Kohonen nets, of different grid sizes, and for different feature subsets… the reason for using different grid sizes for the same feature partition is to get different representative examples to compute the separability indices… some of the active nodes of Kohonen nets trained for different feature partitions serve as representative training examples of different classes and are used to compute the separability indices; Par. [0048-52]: explores different feature spaces given the class-specific feature rankings. In general, the process creates buckets of features and then trains several Kohonen nets of different grid sizes… for the feature spaces contained in the buckets… selecting Kohonen neurons from various Kohonen nets in different feature spaces and of different grid sizes. Note that, at the end of this final phase, the method discards all of the trained Kohonen nets and retains only a selected set of Kohonen neurons to serve as hyperspheres in different hypersphere nets; evaluate a learning result obtained by the learning unit, using data different from learning data included in the mini batch, wherein the determination unit determines a configuration pattern to be utilized for next learning, based on an evaluation of the learning result obtained (e.g. Kohonen neural network is trained using data based on selected features by normalizing input patterns (i.e. configuration patterns) to be used during subsequent (i.e. successive, next, etc.) processing (i.e. determine a configuration pattern to be utilized for next learning based on a learning result obtained), including selecting appropriate parts of an input pattern (i.e. a configuration pattern) for each Kohonen net (i.e. neural net) according to feature subsets, including active nodes of Kohonen nets trained for different feature partitions 

Regarding claim 16, claim 1 is incorporated and Roy discloses the apparatus, wherein the one or more processors further function as a selection unit configured to select learning data corresponding to the determined configuration pattern, based on the learning result,
wherein the learning unit performs the learning, using the mini batch including the selected learning data (Par. [0003]: a method for analyzing patterns in a data stream and taking an action based on the analysis… . A volume of data is received, and the data is trained to create training examples. Features are selected that are predictive of different classes of patterns in the data, using the training examples. A set of Kohonen networks is trained using the data, based on the selected features. Then, active nodes are identified and extracted from the set of Kohenen nets that are representative of a class of patterns in the data. Classes are assigned to the extracted active nodes; Par. [0027-31]: trains many different Kohonen nets, of different grid sizes, and for different feature subsets… the reason for using different grid sizes for the same feature partition is to get different representative examples to compute the separability indices… some of the active nodes of Kohonen nets trained for different feature partitions serve as representative training examples of different classes and are used to compute the separability indices; Par. [0048-52]: explores different feature spaces given the class-specific feature rankings. In general, the process creates buckets of features and then trains several Kohonen nets of different grid sizes… for the feature spaces contained in the buckets… selecting Kohonen neurons from various Kohonen nets in different feature spaces and of different grid sizes. Note that, at the end of this final phase, the method discards all of the trained Kohonen nets and retains only a selected set of Kohonen neurons to serve as hyperspheres in different hypersphere nets; select learning data corresponding to the determined configuration pattern, based on the learning result, wherein the learning unit performs the learning, using the mini batch including the selected learning data (e.g. Kohonen neural network is trained using data based on selected features by normalizing input patterns (i.e. configuration patterns) to be used during subsequent (i.e. successive, next, etc.) processing (i.e. determine a configuration pattern to be utilized for next learning based on a learning result), including selecting appropriate parts of an input pattern (i.e. a configuration pattern) for each Kohonen net (i.e. neural net) according to feature subsets (i.e. using the mini batch including the selected learning data), as indicated above), for example).

Regarding claim 19, is a corresponding method claim rejected as applied to the apparatus claim 1 above.

claim 20, is a corresponding computer readable medium claim rejected as applied to the apparatus claim 1 above.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 3-4, 12-13, and 17-18   are rejected under 35 U.S.C. 103 as being unpatentable over Roy, in view of Zadeh et al. (US PG Pub. No. 2014/0201126 A1), hereafter referred to as Zadeh.

Regarding claim 3, claim 2 is incorporated and Roy discloses the apparatus, but fails to teach the following as furtherer recited in claim 3.
However, Zadeh teaches wherein the one or more processors further function as a change unit configured to change a probability that each of the generated plurality of configuration patterns is determined as the configuration pattern to be utilized for subsequent learning, the probability being changed based on the first learning result,
wherein the determination unit determines the configuration pattern to be utilized for subsequent learning, based on a probability changed by the change unit for each of the plurality of configuration patterns (Par. [1577-1588]: a fuzzy probability measure (p*) of fuzzy map A*, given probability distribution p(x), is determined… a test score is associated with a proposition or fact (e.g., in form of X is A). In one embodiment, this test score is based on a probability measure of A based on a probability distribution in X… a test score is associated with a proposition or fact (e.g., in form of X is A)… this test score is based on a probability measure of A based on a probability distribution in X. In one embodiment, a fuzzy test score is associated with a proposition or fact (e.g., in form of X is A*), where the test score is based on a fuzzy probability measure of A* and a probability distribution in X… the set of candidate probability distributions is based on one or more parameters associated to a model of probability distribution function, e.g. a family of class of probability distribution functions… the fuzzy logic inference engine uses a pattern matching algorithm in a forward chaining inference; Par. [1617]: inference engine (system), with a pattern matching engine that matches the current data state against the predicate of each rule, to find the ones that should be executed (or fired). Pattern matching module is connected to both processing (or controlling) module and interpreter module, to find the rules and also to change the association threads that find each candidate node for next loop (cycle); Par. [1808]: the unclamped labels do not contribute to the error function, and their related weights are prevented to change during the learning step (e.g., by setting the corresponding learning rate to zero for the related weights and biases). In one embodiment, the labels provided for the training are associated with corresponding reliability factors. In one embodiment, such reliability factors (e.g., in range of [0,1]) are used to scale the learning step related to weights and biases of such unit. In one embodiment, the state of unclamped label units are allowed to vary stochastically based on links form other units; Par. [2197]: clustering algorithm, which produces input pattern groups with corresponding cluster centers. To learn fuzzy functions, one can use adaptive vector quantization (AVQ) (using unsupervised AVQ competitive learning) to estimate the local centroids (and covariance matrices) of clusters in the input-output space. From the resulting ellipsoid, one can derive the fuzzy rules (and fuzzy patches)… one can use the Kohonen self-organizing map (SOM)… to change weight vectors for a network (for modeling the features in training samples); change a probability that each of the generated plurality of configuration patterns is determined as the configuration pattern to be utilized for subsequent learning, the probability being changed based on the first learning result, wherein the determination unit determines the configuration pattern to be utilized for subsequent learning, based on a probability changed by the change unit for each of the plurality of configuration patterns (i.e. transition) unit for each of the plurality of configuration patterns (e.g. fuzzy logic inference engine which uses a pattern matching algorithm in a forward chaining inference, including a fuzzy (i.e. variable, changeable, etc.) probability measure with a given probability distribution (i.e. a probability changed by the change), including matching the current data state against the predicate of each rule to find the ones that should be executed (or fired) to find the rules and also to change the association threads that find each candidate node for next loop (cycle) (i.e. determine a configuration pattern to be utilized for next learning), as indicated above), for example).
Roy and Zadeh are considered to be analogous art because they pertain artificial intelligence (i.e. learning machines) using artificial neural networks. Therefore, it would 

Regarding claim 4, claim 2 is incorporated and Roy discloses the apparatus, but fails to teach the following as furtherer recited in claim 4.
However, Zadeh teaches, further comprising a storage unit configured to store the generated plurality of configuration patterns and evaluation scores of the respective configuration patterns in association with each other (Par. [1577-1588]: a fuzzy probability measure (p*) of fuzzy map A*, given probability distribution p(x), is determined… a test score is associated with a proposition or fact (e.g., in form of X is A). In one embodiment, this test score is based on a probability measure of A based on a probability distribution in X… a test score is associated with a proposition or fact (e.g., in form of X is A)… this test score is based on a probability measure of A based on a probability distribution in X. In one embodiment, a fuzzy test score is associated with a proposition or fact (e.g., in form of X is A*), where the test score is based on a fuzzy probability measure of A* and a probability distribution in X… the set of candidate probability distributions is based on one or more parameters associated to a model of probability distribution function, e.g. a family of class of probability distribution functions… the fuzzy logic inference engine uses a pattern matching algorithm in a forward chaining inference; Par. [1617]: inference engine (system), with a pattern matching engine that matches the current data state against the predicate of each rule, to find the ones that should be executed (or fired). Pattern matching module is connected to both processing (or controlling) module and interpreter module, to find the rules and also to change the association threads that find each candidate node for next loop (cycle); Par. [1808]: the unclamped labels do not contribute to the error function, and their related weights are prevented to change during the learning step (e.g., by setting the corresponding learning rate to zero for the related weights and biases). In one embodiment, the labels provided for the training are associated with corresponding reliability factors. In one embodiment, such reliability factors (e.g., in range of [0,1]) are used to scale the learning step related to weights and biases of such unit. In one embodiment, the state of unclamped label units are allowed to vary stochastically based on links form other units; Par. [2197]: clustering algorithm, which produces input pattern groups with corresponding cluster centers. To learn fuzzy functions, one can use adaptive vector quantization (AVQ) (using unsupervised AVQ competitive learning) to estimate the local centroids (and covariance matrices) of clusters in the input-output space. From the resulting ellipsoid, one can derive the fuzzy rules (and fuzzy patches)… one can use the Kohonen self-organizing map (SOM)… to change weight vectors for a network (for modeling the features in training samples); further comprising a storage unit configured to store the generated plurality of configuration patterns and evaluation scores of the respective configuration patterns in association with each other (e.g. fuzzy logic inference uses a pattern matching algorithm in a forward chaining inference, including a clustering algorithm, which produces input pattern groups with corresponding cluster centers (i.e. the generated plurality of configuration patterns), and test scores (i.e. evaluation scores) based on a set of candidate probability distributions based on one or more parameters associated to a model of probability distribution function (i.e. evaluation scores of the respective configuration patterns in association with each other), as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 3.

Regarding claim 12, claim 1 is incorporated and Roy discloses the apparatus, but fails to teach the following as furtherer recited in claim 12.
However, Zadeh teaches wherein the learning unit updates a weight of the neural network by calculating losses of respective pieces of the learning data, and performs back propagation for an average of the losses of the learning data (Par. [1734]: by using a back propagation method based on gradient decent. Since the initial weights of autoencoder were determined by a greedy pre-training of lower RBMs, the back propagation will be efficient… during the back propagation fine tuning, the stochastic binary units are assumed to be deterministic continuous value units adopting the probability value as their state value, to carry out the back propagation… the objective function ( error function) to optimize in back propagation, is the cross entropy error; Par. [1762-1763]: an error function (to be minimized by training) defined over the training sample space (e.g., in a batch processing of an epoch) accounts for data sample reliability by including sample reliability factor as a weight in the contribution of the data sample to the batch error function, e.g., in the summation of the errors contributed from individual data samples… for example, a stochastic approach is used (instead of full epoch batch) to sample one (or several) training data sample(s) while optimizing the sample error function, and the sample error function is weighted by the reliability factor of the data sample… the learning rate (e.g., the factor associated with the step to take in modifying the weights during the training) is modified based on the reliability weight for a given data sample used during the learning (e.g., in stochastic sampling of the data samples); Par. [1808]: the unclamped labels do not contribute to the error function, and their related weights are prevented to change during the learning step (e.g., by setting the corresponding learning rate to zero for the related weights and biases). In one embodiment, the labels provided for the training are associated with corresponding reliability factors. In one embodiment, such reliability factors (e.g., in range of [0,1]) are used to scale the learning step related to weights and biases of such unit. In one embodiment, the state of unclamped label units are allowed to vary stochastically based on links form other units; Par. [1869]: we use "Fuzzy c-Means Clustering Method", with a fuzzy pseudopartition or fuzzy c-partition of our set (where c is the number of fuzzy classes in partition), in terms of cluster centers, and using inner product induced norm in our space (representing distances in that space). The performance metrics measures the weighted sum of distances between cluster centers and elements in those clusters; Par. [2216-2222]: for machine learning, we use neural networks, perceptrons, including… back propagation algorithm (including convergence and local minima problem)… or reinforcement learning, which all can be combined with our methods in this disclosure, as a complementary method, for improving the performance or efficiency… active supervised learning (in which we query about the data, actively), active reinforcement learning; learning unit updates a weight of the neural network by calculating losses of respective pieces of the learning data, and performs back propagation for an average of the losses of the learning data (e.g. an error function (to be minimized by training) defined over the training sample space accounts for data sample reliability by including sample reliability factor as a weight in the contribution of the data sample to the batch error function, and the sample error function is weighted by the reliability factor of the data sample, including using a back propagation algorithm (including convergence and local minima problem), as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 3.

claim 13, claim 12 is incorporated and Roy discloses the apparatus, but fails to teach the following as furtherer recited in claim 13.
However, Zadeh teaches wherein the learning unit receives a learning set of learning data of the mini batch as an input, and calculates the losses of the respective pieces of the learning data by inputting a final output and supervisory information of the learning set into a loss function (Par. [1734]: by using a back propagation method based on gradient decent. Since the initial weights of autoencoder were determined by a greedy pre-training of lower RBMs, the back propagation will be efficient… during the back propagation fine tuning, the stochastic binary units are assumed to be deterministic continuous value units adopting the probability value as their state value, to carry out the back propagation… the objective function ( error function) to optimize in back propagation, is the cross entropy error; Par. [1762-1763]: an error function (to be minimized by training) defined over the training sample space (e.g., in a batch processing of an epoch) accounts for data sample reliability by including sample reliability factor as a weight in the contribution of the data sample to the batch error function, e.g., in the summation of the errors contributed from individual data samples… for example, a stochastic approach is used (instead of full epoch batch) to sample one (or several) training data sample(s) while optimizing the sample error function, and the sample error function is weighted by the reliability factor of the data sample… the learning rate (e.g., the factor associated with the step to take in modifying the weights during the training) is modified based on the reliability weight for a given data sample used during the learning (e.g., in stochastic sampling of the data samples); par. back propagation is used for fine tuning of the weights/biases. In one embodiment, the added units and the previous units are used to make association and/or correlation with labeled samples, e.g., during the supervised training; wherein the learning unit receives a learning set of learning data of the mini batch as an input, and calculates the losses of the respective pieces of the learning data by inputting a final output and supervisory information of the learning set into a loss function (e.g. an error (i.e. loss) function (to be minimized by training) defined over the training sample space (e.g., in a batch processing of an epoch) accounts for data sample reliability by including sample reliability factor as a weight in the contribution of the data sample to the batch error function, in which a stochastic approach is used (instead of full epoch batch) to sample one (or several) training data sample(s) (i.e. mini batches) while optimizing the sample error function, to make association and/or correlation with labeled samples (i.e. by inputting a final output) during supervised training (i.e. supervisory information), as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 3.

Regarding claim 17, claim 16 is incorporated and Roy discloses the apparatus, but fails to teach the following as furtherer recited in claim 17.
However, Zadeh teaches wherein the one or more processors further function as a change unit configured to change a probability of selection for each piece of the learning data by the selection unit, based on the learning result,
a fuzzy probability measure (p*) of fuzzy map A*, given probability distribution p(x), is determined… a test score is associated with a proposition or fact (e.g., in form of X is A). In one embodiment, this test score is based on a probability measure of A based on a probability distribution in X… a test score is associated with a proposition or fact (e.g., in form of X is A)… this test score is based on a probability measure of A based on a probability distribution in X. In one embodiment, a fuzzy test score is associated with a proposition or fact (e.g., in form of X is A*), where the test score is based on a fuzzy probability measure of A* and a probability distribution in X… the set of candidate probability distributions is based on one or more parameters associated to a model of probability distribution function, e.g. a family of class of probability distribution functions… the fuzzy logic inference engine uses a pattern matching algorithm in a forward chaining inference; Par. [1617]: inference engine (system), with a pattern matching engine that matches the current data state against the predicate of each rule, to find the ones that should be executed (or fired). Pattern matching module is connected to both processing (or controlling) module and interpreter module, to find the rules and also to change the association threads that find each candidate node for next loop (cycle); Par. [1808]: the unclamped labels do not contribute to the error function, and their related weights are prevented to change during the learning step (e.g., by setting the corresponding learning rate to zero for the related weights and biases). In one embodiment, the labels provided for the training are associated with corresponding reliability factors. In one embodiment, such reliability factors (e.g., in range of [0,1]) are used to scale the learning step related to weights and biases of such unit. In one embodiment, the state of unclamped label units are allowed to vary stochastically based on links form other units; Par. [2197]: clustering algorithm, which produces input pattern groups with corresponding cluster centers. To learn fuzzy functions, one can use adaptive vector quantization (AVQ) (using unsupervised AVQ competitive learning) to estimate the local centroids (and covariance matrices) of clusters in the input-output space. From the resulting ellipsoid, one can derive the fuzzy rules (and fuzzy patches)… one can use the Kohonen self-organizing map (SOM)… to change weight vectors for a network (for modeling the features in training samples); change a probability of selection for each piece of the learning data by the selection unit, based on the learning result, wherein the selection unit selects learning data corresponding to a configuration pattern, based on the changed probability for each piece of the learning data (e.g. fuzzy logic inference engine which uses a pattern matching algorithm in a forward chaining inference, including a fuzzy (i.e. variable, changeable, etc.) probability measure with a given probability distribution (i.e. the changed probability for each piece of the learning data), including matching the current data state against the predicate of each rule to find the ones that should be executed (or fired) to find the rules and also to change the association threads that find each candidate node for next loop (cycle) (i.e. change a probability of selection for each piece of the learning data), as indicated above), for example).


Regarding claim 18, claim 1 is incorporated and Roy discloses the apparatus, but fails to teach the following as furtherer recited in claim 18.
However, Zadeh teaches wherein the learning unit performs reinforcement learning of the neural network (Par. [2216-2222]: for machine learning, we use neural networks, perceptrons, including… back propagation algorithm (including convergence and local minima problem)… or reinforcement learning, which all can be combined with our methods in this disclosure, as a complementary method, for improving the performance or efficiency… active supervised learning (in which we query about the data, actively), active reinforcement learning), and
wherein the determination unit determines the configuration pattern to be utilized for subsequent learning, based on a plurality of learning results obtained by the learning unit Par. [1577-1588]: a fuzzy probability measure (p*) of fuzzy map A*, given probability distribution p(x), is determined… a test score is associated with a proposition or fact (e.g., in form of X is A). In one embodiment, this test score is based on a probability measure of A based on a probability distribution in X… a test score is associated with a proposition or fact (e.g., in form of X is A)… this test score is based on a probability measure of A based on a probability distribution in X. In one embodiment, a fuzzy test score is associated with a proposition or fact (e.g., in form of X is A*), where the test score is based on a fuzzy probability measure of A* and a probability distribution in X… the set of candidate probability distributions is based on one or more parameters associated to a model of probability distribution function, e.g. a family of class of probability distribution functions… the fuzzy logic inference engine uses a pattern matching algorithm in a forward chaining inference; Par. [1617]: inference engine (system), with a pattern matching engine that matches the current data state against the predicate of each rule, to find the ones that should be executed (or fired). Pattern matching module is connected to both processing (or controlling) module and interpreter module, to find the rules and also to change the association threads that find each candidate node for next loop (cycle); Par. [1808]: the unclamped labels do not contribute to the error function, and their related weights are prevented to change during the learning step (e.g., by setting the corresponding learning rate to zero for the related weights and biases). In one embodiment, the labels provided for the training are associated with corresponding reliability factors. In one embodiment, such reliability factors (e.g., in range of [0,1]) are used to scale the learning step related to weights and biases of such unit. In one embodiment, the state of unclamped label units are allowed to vary stochastically based on links form other units; and wherein the determination unit determines the configuration pattern to be utilized for subsequent learning, based on a plurality of learning results obtained by the learning unit (e.g. fuzzy logic inference engine which uses a pattern matching algorithm in a forward chaining inference, including a fuzzy (i.e. variable, changeable, etc.) probability measure with a given probability distribution, including matching the current data state against the predicate of each rule to find the ones that should be executed (or fired) to find the rules 
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 3.

Conclusion
Applicant’s amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GUILLERMO M RIVERA-MARTINEZ whose telephone number is (571)272-4979.  The examiner can normally be reached on 9 am to 5 pm.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached on 571-272-7332.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/GUILLERMO M RIVERA-MARTINEZ/           Primary Examiner, Art Unit 2668