DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to the preliminary amendment filed on 2/16/2021. In the preliminary amendment, none of original claims 1-30 were amended, and no claims were cancelled or added. Claims 1-30 are pending and have been examined.

Priority
Applicant’s claim for the benefit of a prior-filed application under 35 U.S.C. 119(e) or under 35 U.S.C. 120, 121, 365(c), or 386(c) is acknowledged. The present application is a national stage application of PCT application number PCT/US2019/046107 filed on 08/12/2019, which claims priority to U.S. Provisional application number 62/719,433 filed on 08/17/2018.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 2/16/2021 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement has been considered by the examiner.

Specification
The disclosure is objected to because of the following informalities:
comprise, in the illustrated embodiment, multiple (N) sets of processor cores 304A-N.” Appropriate correction is required.
In paragraph 59, the recitation of “wherein each of the plurality of ensemble members is training according to an associated objective for the ensemble member” is grammatically incorrect. The examiner suggests that two ways to overcome this objection is to amend this recitation in the specification to recite “wherein each of the plurality of ensemble members is trained training trained  according to a shared objective” and “each of the plurality of ensemble members is trained 

Claim Objections
Claims 1-30 are objected to because of the following informalities: 
trained trained trained . Appropriate correction is required.
Also, claims 2-15 and 17-30, which each depend directly or indirectly from claims 1 and 16, respectively, are objected to based on their respective dependencies from claims 1 and 16.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

Claims 7-8 and 22-23 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claims 7 and 22 each recite “updated values of the learned parameters of each of the plurality of ensemble members for the training data item being a positive example” (see, e.g., lines 5-6 of claim 7). Applicant did not previously introduce any “training data item being a positive example” or any “training data item” that is “a 
Claims 8 and 23 each recite “updated values of the learned parameters of each of the plurality of ensemble members for the training data item being a negative example” and “updated values of the learned parameters of each of the plurality of ensemble members for the training data item being a positive example” (see, e.g., lines 5-6 and 8-9 of claim 8). Applicant did not previously introduce any “training data item being a negative example”, any “training data item being a positive example” or any “training data item” that is “a negative example” or “a positive example” in these claims, their respective intervening claims, claims 5 and 20, or their respective base claims, independent claims 1 and 16. Applicant previously introduced “a training data item from a training data set” in intervening claims 5 and 20 (see, e.g., line 4 of claim 5). For the purposes of determining patent eligibility and comparison with the prior art, the examiner is interpreting the terms “the training data item being a negative example” and “the training data item being a positive example” as the previously introduced “training data item”. Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-14 and 16-29 are rejected under 35 U.S.C. 103 as being unpatentable over Xiong et al. (U.S. Patent Application Pub. No. 2017/0024642 A1, hereinafter “Xiong”) in view of non-patent literature Zhang et al. ("Deep belief networks ensemble with multi-objective optimization for failure diagnosis." 2015 IEEE International .
Regarding claim 1, Xiong discloses the invention as claimed including a computer-implemented method of training an ensemble machine learning system comprising a plurality of ensemble members (see, e.g., paragraphs 5 and 19, “a computer-implemented method for use in training a plurality of neural networks”, “training an ensemble of exponentially many neural networks”), the method comprising:
training, by a computer system, individually the plurality of ensemble members on a training data set (see, e.g., paragraphs 19-20 and 25, “Dropout training facilitates training an ensemble of exponentially many neural networks … through the use of parameter sharing. Ensemble learning using dropout training … ensemble learning implementing variance-adjustable dropout training", “a feedforward neural network training system [i.e., by a computer system/neural network training system] … that controls the variance of the ensemble predictors and generalizes ensemble learning. The hyper-parameter can be smoothly adjusted to vary the behaviour of the method from a single model learning to a family of ensemble learning comprising a plurality of interacting models. This technique can be applied to ensemble learning” [i.e., individually training ensemble members/models/networks], “memory (106) may further store a training set comprising training data.” [i.e., a training data set]), wherein each of the plurality of ensemble members is training according to an associated objective for the ensemble member (as indicated in the claim objections above, “is training” should read “is trained”) (see, e.g., paragraphs 20 and 25-26, ; and
after training the plurality of ensemble members, training, by the computer system, a consolidated machine learning system (see, e.g., paragraphs 18-20 and 52, “Dropout training of a single neural network has the effect of creating an exponentially large ensemble of neural networks with different structures, but with shared parameters”, “a feedforward neural network training system comprising an extra hyper-parameter that controls the variance of the ensemble predictors and generalizes ensemble learning. The hyper-parameter can be smoothly adjusted to vary the behaviour of the method from a single model learning to a family of ensemble learning , wherein:
the consolidated machine learning system comprises the plurality of ensemble members and a joint optimization machine learning system (see, e.g., paragraphs 25-26 and 47, “The hyper-parameter can be smoothly adjusted to vary the behaviour of the method from a single model learning to a family of ensemble learning comprising a plurality of interacting models” [i.e., consolidated family/machine learning system comprising ensemble members/models], “the neural network optimizes weights for each feature detector”, “produce a jointly optimal hyper-parameter setting, using a … Bayesian hyper-parameter optimization” [i.e., a joint optimization machine learning system/neural network]), such that an output from … ensemble members is input to the joint optimization machine learning system (see, e.g., paragraphs 29-39, 47, and 54, “switches (108) are linked to all feature detectors of the hidden layers. In another embodiment, the switches (08) are linked to all feature detectors of the input layers [i.e., input of the machine learning system]. In yet another embodiment, the switches (108) may be linked to all feature detectors in both the hidden and input layers. In yet another embodiment, the switch (108) may be linked to the feature detectors of a subset of the input and hidden layers. In yet further embodiments, the switches may be linked to the connections between neural network units. In another aspect, the switch ,
the joint optimization machine learning system is training according to a shared objective (as indicated in the claim objections above, “is training” should read “is trained”) (see, e.g., paragraphs 34-36, "Dropout training for a neural network can be derived by assuming that each training case processed during the training stage contributes the following cost, which may be combined across a mini batch of training cases or an entire training set", "During dropout training, a joint setting of m is sampled for each presentation of a training case and this corresponds to a randomly sampled element in the sum in the above equation. For each mini-batch of training cases, forward propagation may be used to determine f(x, m, w) and then the error at the output may be computed"; "the ensemble of networks configured by dropout training can be averaged to produce a single prediction" [i.e., trained according to a shared objective/to produce a prediction]); and
each of the ensemble members is training according to both the shared objective and the associated objective for the plurality of ensemble members (as indicated in the claim objections above, “is training” should read “is trained”) (see, e.g., paragraphs 19-20, 39 and 47-52, “Dropout training facilitates training an ensemble of exponentially many neural networks”, “a feedforward neural network training system comprising an extra hyper-parameter that controls the variance of the ensemble predictors and generalizes ensemble learning.” [i.e., each of the ensemble members are trained according to varied, associated objectives], “The mean network, f(x,w), is constructed by performing forward propagation using the mean contribution that the 2. Variance-adjustable dropout training thus makes use of an additional hyper-parameter α, which may be set using cross-validation. The search for a [α] may be combined with the search for other hyper-parameters, to produce a jointly optimal hyper-parameter setting”, “the inputs of a mini-batch of training cases are forward propagated using two networks with shared parameters. One of these networks is the mean network described above” [i.e., ensemble members are trained according to varied, associated objectives and a general, shared objective of the mean network]).
Although Xiong substantially discloses the claimed invention, Xiong is not relied on to explicitly disclose that an output from each of the plurality of ensemble members is input to the joint optimization machine learning system.
In the same field, analogous art Zhang teaches that an output from each of the plurality of ensemble members is input to the joint optimization machine learning system (see, e.g., pages 33-34 and Table I, “developing an ensemble of DBNs with multi-objective optimization”, “The hybrid DBN [deep belief network] ensemble model with multi-objective optimization uses weighted sum ensemble scheme with diversity and classification error two objective optimization [i.e., joint optimization machine learning system/hybrid DBN ensemble model]. Given M ensemble members to classify N patterns, Si is the ensemble output di is its corresponding desired output. 
    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale
 where rj(i) denotes the ith output of network j for a given input pattern” [i.e., output from each of the M ensemble members forms input Si to hybrid DBN], “Step 7: Use multi-objective ensemble learning to optimize outputs Step 8: Select preferred near optimal solutions Step 9: Evaluate results in diagnosis using trained hybrid model” [i.e., output from ensemble models/members is input to the machine learning system/hybrid DBN ensemble model]).
Alternatively, Zhang also teaches that each of the plurality of ensemble members is training according to both the shared objective and the associated objective for the plurality of ensemble members (as indicated above, “is training” should read “is trained”) (see, e.g., page 34 and Table I, “resampling method generates ensemble members by training individual member[s] on different samples of the original dataset [i.e., training each ensemble member]. The obtained near optimal solutions can be used for the combination of ensemble members. Therefore, a multi-objective ensemble learning is conducted to optimize ensemble outputs. … DBN is trained by pre-training with pre-processed data and followed by back propagation training process on all DBN layers at the same time. The training accuracy is the similarity between model outputs and the predefined failure modes.” [i.e., training according to multiple objectives, both the shared and associated objectives], “Step 5: Train DBN classifiers in the model using training dataset” [i.e., train each ensemble member/DBN classifier]). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Xiong to incorporate the teachings of Zhang to provide a “hybrid ensemble model for degradation pattern classification” 

With respect to independent claim 16, Xiong discloses the invention as claimed including a computer system for training an ensemble machine learning system comprising a plurality of ensemble members (see, e.g., paragraphs 7 and 19-20, “a system and method for training neural networks”, “training an ensemble of exponentially many neural networks”, “a feedforward neural network training system comprising an extra hyper-parameter that controls the variance of the ensemble predictors and generalizes ensemble learning”), the computer system comprising:
a processor; and a memory coupled to the processor, the memory storing (see, e.g., paragraphs 15, 23 and 25, “Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology”, “The neural network is implemented by one or more processors. Each feature detector may be considered as a processing ‘node’ of the network and one or more nodes may be implemented by a processor”, “A memory (106) may be provided for storing activations and learned weights for each feature detector. The memory (106) may further store a :
the plurality of ensemble members (see, e.g., paragraphs 19 and 25, “Dropout training facilitates training an ensemble of exponentially many neural networks”, “a family of ensemble learning comprising a plurality of interacting models”);
a joint optimization machine learning system (see, e.g., paragraphs 26 and 47, “the neural network optimizes weights for each feature detector”, “produce a jointly optimal hyper-parameter setting, using a … Bayesian hyper-parameter optimization” [i.e., a joint optimization machine learning system/neural network]); and
instructions that, when executed by the processor, cause the computer system to (see, e.g., paragraph 15, “server, computer … that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media … Computer storage media may include … technology for storage of information, such as computer readable instructions … Any application or module herein described may be implemented using computer readable/executable instructions”):
train individually the plurality of ensemble members on a training data set (see, e.g., paragraphs 19-20 and 25, “Dropout training facilitates training an ensemble of exponentially many neural networks … through the use of parameter sharing. Ensemble learning using dropout training … ensemble learning implementing variance-adjustable dropout training", “a feedforward neural network training system [i.e., training by a computer system/neural network training system] … that controls the variance of the ensemble predictors and generalizes ensemble learning. The hyper-parameter can , wherein each of the plurality of ensemble members is training according to an associated objective for the ensemble member (as indicated above, “is training” should read “is trained”) (see, e.g., paragraphs 20 and 25-26, “training system comprising an extra hyper-parameter that controls the variance of the ensemble predictors and generalizes ensemble learning. The hyper-parameter can be smoothly adjusted to vary the behaviour of the method from a single model learning to a family of ensemble learning comprising a plurality of interacting models. This technique can be applied to ensemble learning with various cost functions, structures and parameter sharing” [i.e., each ensemble member/model of the plurality of interacting models is trained for an associated prediction function/objective], “The training data may, for example, be used for image classification in which case the training data may comprise images with known classifications", “During the training stage, the neural network optimizes weights for each feature detector. … Exemplary applications of such a neural network include image classification, machine translation, object recognition, speech recognition and genomic-oriented applications (including, for example, protein binding site prediction and splice site prediction [i.e., training according to an associated objective: image classification translation, object or speech recognition, or genomic predictions]); and
after training the plurality of ensemble members, train a consolidated machine learning system (see, e.g., paragraphs 18-20 and 52, “Dropout training of a single neural network has the effect of creating an exponentially large ensemble of neural networks with different structures, but with shared parameters”, “a feedforward neural network training system comprising an extra hyper-parameter that controls the variance of the ensemble predictors and generalizes ensemble learning. The hyper-parameter can be smoothly adjusted to vary the behaviour of the method from a single model learning to a family of ensemble learning comprising a plurality of interacting models. This technique can be applied to ensemble learning with various cost functions, structures and parameter sharing", “inputs of a mini-batch of training cases are forward propagated using two networks with shared parameters” [i.e., training a generalized, consolidated machine learning system – a family of interacting models with shared parameters]), wherein:
the consolidated machine learning system comprises the plurality of ensemble members and the joint optimization machine learning system (see, e.g., paragraphs 25-26 and 47, “The hyper-parameter can be smoothly adjusted to vary the behaviour of the method from a single model learning to a family of ensemble learning comprising a plurality of interacting models” [i.e., consolidated family/machine learning system comprising ensemble members/models], “the neural network optimizes weights for each feature detector”, “produce a jointly optimal hyper-parameter setting, using a … Bayesian hyper-parameter optimization” [i.e., a joint optimization machine learning system/neural network]), such that an output from … ensemble members is input to the joint optimization machine learning system (see, e.g., paragraphs 29-39, 47, ;
the joint optimization machine learning system is training according to a shared objective (as indicated above, “is training” should read “is trained”) (see, e.g., paragraphs 34-36, "Dropout training for a neural network can be derived by assuming that each training case processed during the training stage contributes the following cost, which may be combined across a mini batch of training cases or an entire training set", "During dropout training, a joint setting of m is sampled for each presentation of a training case and this corresponds to a randomly sampled element in the sum in the above equation. For each mini-batch of training cases, forward propagation may be used to determine f(x, m, w) and then the error at the output may be computed"; "the ensemble of networks configured by dropout training can be averaged to produce a single prediction" [i.e., trained according to a shared objective/to produce a prediction]); and
each of the plurality of ensemble members is training according to both the shared objective and the associated objective for the plurality of ensemble members (as indicated above, “is training” should read “is trained”) (see, e.g., paragraphs 19-20, 39 and 47-52, “Dropout training facilitates training an ensemble of exponentially many neural networks”, “a feedforward neural network training system comprising an extra hyper-parameter that controls the variance of the ensemble predictors and generalizes ensemble learning.” [i.e., each of the ensemble members are trained according to varied, associated objectives], “The mean network, f(x,w), is constructed by performing forward propagation using the mean contribution that the inputs to a unit make, by averaging the effects of their mask variables. This is equivalent to scaling all outgoing weights of every unit by one minus the dropout probability of that unit, as shown in (204). As a result, after scaling the weights, only one forward pass is required to make a prediction” [i.e., a shared objective – the prediction], “By using the mean network approximation described above in equation 5, the new predictor f(x, m, w) adjusts the variance of f(x, m, w) by a factor of α2. Variance-adjustable dropout training thus makes use of an additional hyper-parameter α, which may be set using cross-validation. The search for a [α] may be combined with the search for other hyper-parameters, to produce a jointly optimal hyper-parameter setting”, “the inputs of a mini-batch of training cases are forward propagated using two networks with shared parameters. One of these networks is the mean network described above” [i.e., ensemble members are trained according to varied, associated objectives and a general, shared objective of the mean network]).
Although Xiong substantially discloses the claimed invention, Xiong is not relied on to explicitly disclose that an output from each of the plurality of ensemble members is input to the joint optimization machine learning system.
that an output from each of the plurality of ensemble members is input to the joint optimization machine learning system (see, e.g., pages 33-34 and Table I, “developing an ensemble of DBNs with multi-objective optimization”, “The hybrid DBN [deep belief network] ensemble model with multi-objective optimization uses weighted sum ensemble scheme with diversity and classification error two objective optimization [i.e., joint optimization machine learning system/hybrid DBN ensemble model]. Given M ensemble members to classify N patterns, Si is the ensemble output di is its corresponding desired output. 
    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale
 where rj(i) denotes the ith output of network j for a given input pattern” [i.e., output from each of the M ensemble members forms input Si to hybrid DBN], “Step 7: Use multi-objective ensemble learning to optimize outputs Step 8: Select preferred near optimal solutions Step 9: Evaluate results in diagnosis using trained hybrid model” [i.e., output from ensemble models/members is input to the machine learning system/hybrid DBN ensemble model]).
Alternatively, Zhang also teaches that each of the plurality of ensemble members is training according to both the shared objective and the associated objective for the plurality of ensemble members (as indicated above, “is training” should read “is trained”) (see, e.g., page 34 and Table I, “resampling method generates ensemble members by training individual member[s] on different samples of the original dataset [i.e., training each ensemble member]. The obtained near optimal solutions can be used for the combination of ensemble members. Therefore, a multi-objective ensemble learning is conducted to optimize ensemble outputs. … DBN is trained by pre-training with pre-processed data and followed by back propagation training process 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Xiong to incorporate the teachings of Zhang to provide a “hybrid ensemble model for degradation pattern classification” and a “framework combined ensemble of Deep Belief Networks with multi-objective optimization” (See, e.g., Zhang, pages 32-33, Abstract). Doing so would have allowed Xiong to use Zhang’s framework for “diagnosis [and] to handle failure degradation with multivariate sensory data” in order to enable “[e]arly diagnosis that can detect faults from some symptoms accurately” and provide “benefits such as reducing maintenance costs, improving productivity and avoiding serious damages”, as suggested by Zhang (See, e.g., Zhang, page 32, Abstract). This is an example of “use of known technique to improve similar devices (methods, or products) in the same way.” See MPEP 2143.

Regarding claims 2 and 17, as discussed above, Xiong in view of Zhang teaches the method of claim 1 and the system of claim 16. 
Xiong further discloses wherein the associated objective for each of the plurality of ensemble members is unique (see, e.g., paragraphs 4, 18, 21 and 42, “different models in an ensemble … trained with different random initializations, different architectures, different hyper-parameter settings and different subsets of data”, “Dropout training of a single neural network has the effect of creating an exponentially large 

Regarding claims 3 and 18, as discussed above, Xiong in view of Zhang teaches the method of claim 1 and the system of claim 16. 
Xiong further discloses wherein:
each of the plurality of ensemble members comprises an output detector node (see, e.g., paragraphs 21-23, “selection of feature detectors”, “a feedforward neural network (100) having a plurality of layers (102) … Each layer comprises one or more feature detectors (104), each of which may be associated with activation functions and weights for each parameter input to the respective feature detector (104) … the output of a feature detector of layer i may be provided as input to one or more feature detector of layer i+1. … the output of a feature detector of layer i could further be provided as input to layer i+n”, “Each feature detector may be considered as a processing ‘node’ of the network” [i.e., each ensemble member comprises an output feature detector node]); and
the associated objective comprises a subset of the training data set as a target for each output detector node (see, e.g., paragraphs 22-27, “the output of a layer i may be compared to a target value in a training dataset” [i.e., objective of each output detector node includes a subset of the training data as a target value of the node], “Each feature detector may be considered as a processing ‘node’ of the network”, “memory (106) may be provided for storing activations and learned weights for each feature detector. The memory (106) may further store a training set comprising training data. The training data may, for example, be used for image classification in which case the training data may comprise images with known classifications.” [i.e., a subset of the training data], “switches (108) are linked to at least a subset of the feature detectors.”).

Regarding claims 4 and 19, as discussed above, Xiong in view of Zhang teaches the method of claim 1 and the system of claim 16. 
Xiong further discloses wherein:
each of the plurality of ensemble members comprises an output detector node (see, e.g., paragraphs 21-23, “selection of feature detectors”, “a feedforward neural network (100) having a plurality of layers (102) … Each layer comprises one or more feature detectors (104), each of which may be associated with activation functions and weights for each parameter input to the respective feature detector (104) … the output of a feature detector of layer i may be provided as input to one or more feature detector of layer i+1. … the output of a feature detector of layer i could further be provided as input to layer i+n”, “Each feature detector may be considered as a ; and
the training data set comprises a first subset and a second subset that is disjoint from the first subset (see, e.g., paragraphs 4, 16-17 and 22, “models trained with … different subsets of data”, “training the neural network on mini-batches of training cases processed using a dropout neural network training process”, “selectively disables a randomly (or pseudorandomly) selected subset of hidden units and/or input units in the neural network, for each training case” [i.e., training data includes different/disjoint subsets, mini-batches of training cases]); and
the associated objective comprises:
a first value for the output detector node when a training data item falls within the first subset of the training data set (see, e.g., paragraphs 22-27, 29, 32-33 and 35, “the output of a layer i may be compared to a target value in a training dataset” [i.e., objective of each output detector node includes a first value/target within a first subset of the training dataset], “Each training case is then processed by the neural network, one or a mini-batch at a time (202). For each such training case, the switch may reconfigure the neural network by selectively disabling each linked feature detector”, “the prediction f(x, m, w) … For each mini-batch of training cases forward propagation may be used to determine f(x, m, w)” [i.e., objective/prediction is a value f of the node when a training case/data items within a first mini-batch/subset of the training data]); and
a second value for the output detector node when the training data item falls within the second subset of the training data set (see, e.g., paragraphs 33-34, 2”, “the inputs of a mini-batch of training cases are forward propagated using two networks with shared parameters … At block (314) f(x, w) is added, providing the new predictor f(x, m, w) [i.e., a second value, new predictor f]. At block (316), f(x, m, w) is then compared to the target … and the training procedure may move on to the next mini-batch.” [i.e., a second, target value for the node when the training case/data item is within a next, second mini-batch/subset of the training data]).

Regarding claims 5 and 20, as discussed above, Xiong in view of Zhang teaches the method of claim 1 and the system of claim 16. 
Xiong further discloses wherein training the consolidated machine learning system comprises:
computing, by the computer system, feed-forward activations for each of the plurality of ensemble members for a training data item from a training data set (see, e.g., FIG. 4 – depicting step 308 to “Forward propagate with dropout network” and paragraphs 20-25, 34-42, 46-52, 56 and 59-62, “provides a feedforward neural network training system comprising an extra hyper-parameter that controls the variance of the ensemble predictors and generalizes ensemble learning.”, “a feedforward neural network (100) having a plurality of layers (102) … Each layer comprises one or more feature detectors (104), each of which may be associated with activation functions” [i.e., ;
computing, by the computer system, feed-forward activations for the joint optimization machine learning system for the training data item (see, e.g., FIG. 4 – depicting step 304 to “Forward propagate with mean network” and paragraphs 35-39, 47-48 and 52, “During dropout training, a joint setting of m is sampled for each presentation of a training case and this corresponds to a randomly sampled element in the sum in the above equation. For each mini-batch of training cases, forward propagation may be used to determine f(x, m, w)”, “The mean network, f(x,w), is constructed by performing forward propagation using the mean contribution that the inputs to a unit make”, “By using the mean network approximation described above in equation 5, the new predictor f(x, m, w) adjusts the variance of f(x, m, w) by a factor of α2. Variance-adjustable dropout training thus makes use of an additional hyper-parameter α, which may be set using cross-validation … to produce a jointly optimal hyper-parameter setting, using a … Bayesian hyper-parameter optimization” [i.e., a joint optimization machine learning system/neural network], “When α=0, f(x, m, w) is deterministic and the training procedure is effectively the same as a regular feed-;
back propagating, by the computer system, partial derivatives of the shared objective through the joint optimization machine learning system (see, e.g., FIG. 4 – depicting step 318 to “Back propagate” and paragraphs 50, 52, 56, 62, “normal back-propagation technique can still be applied to compute a surrogate of derivatives” [i.e., back-propagating partial derivatives], “The error is then back-propagated at block (318) through the two networks according to equation 7”, “When training the ensemble, for the i-th model in the ensemble, it may be necessary to backpropagate the error at the output, dl/fi(x, w). … for variance-adjustable ensemble learning depends on all models because the use of the mean prediction [i.e., the joint/mean prediction]. An implementation of backpropagation that asynchronously processes dl/fi(x, w) one at a time … for each pass through all of then models. For each i, it may be necessary to perform forward and backward propagation through all n models”);
computing, by the computer system, a weighted sum of the partial derivatives of the shared objective and a derivative of the associated objective for each of the plurality of ensemble members (see, e.g., paragraphs 50 and 62, “During back-propagation, the gradient of w may be a weighted sum of the gradient of the mean 
    PNG
    media_image2.png
    200
    400
    media_image2.png
    Greyscale
(7)” [i.e., equation 7 for calculating/computing a weighted sum of the derivatives], “backpropagation that asynchronously processes dl/fi(x, w) one at a time … for each pass through all of then models. For each i, it may be necessary to perform forward and backward propagation through all n models” [i.e., for each of the plurality n of models/ensemble members]);
estimating, by the computer system, an update term for each of the plurality of ensemble members according to the weighted sum (see, e.g., FIG. 4 – depicting step 320 to “update parameters” after step 318 to “Back propagate” and paragraphs 50 and 52, “During back-propagation, the gradient of w may be a weighted sum of the gradient of the mean network and the gradient of the dropout network: 
    PNG
    media_image2.png
    200
    400
    media_image2.png
    Greyscale
(7)” [i.e., equation 7 for calculating/computing the weighted sum]”, “The error is then back-propagated at block (318) through the two networks according to equation 7. After that, the network parameters are updated at block (320) and the training procedure may move on to the next mini-batch.” [i.e., estimate an update/adjustment term for each of the ensemble members according to the weighted sum from equation 7]); and
updating, by the computer system, learned parameters of each of the plurality of ensemble members according to the update term (see, e.g., paragraphs 42, 52 and 62, “generate gradients for updating the network parameters”, “The 
Alternatively, Zhang also teaches computing, by the computer system, a weighted sum of the partial derivatives of the shared objective and a derivative of the associated objective for each of the plurality of ensemble members (see, e.g., page 34, “The hybrid DBN [deep belief network] ensemble model with multi-objective optimization [i.e., including the shared objective and the associated objective] uses weighted sum ensemble scheme with diversity and classification error two objective optimization. Given M ensemble members to classify N patterns, Si is the ensemble output di is its corresponding desired output. 
    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale
 where rj(i) denotes the ith output of network j for a given input pattern” [i.e., computing a weighted sum Si of partial derivatives of the shared and associated objectives for each of the M plurality of ensemble members]) and 
estimating, by the computer system, an update term for each of the plurality of ensemble members according to the weighted sum (see, e.g., page 34, “the combination of DBN ensemble for generating ensemble classifiers. MOEA/D can be utilized to adjust the ensemble weights to the best solution of trade-off between 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Xiong to incorporate the teachings of Zhang to provide a “hybrid ensemble model for degradation pattern classification” and a “framework combined ensemble of Deep Belief Networks with multi-objective optimization” (See, e.g., Zhang, pages 32-33, Abstract). Doing so would have allowed Xiong to use Zhang’s framework for “diagnosis [and] to handle failure degradation with multivariate sensory data” in order to enable “[e]arly diagnosis that can detect faults from some symptoms accurately” and provide “benefits such as reducing maintenance costs, improving productivity and avoiding serious damages”, as suggested by Zhang (See, e.g., Zhang, page 32, Abstract). 

Regarding claims 6 and 21, as discussed above, Xiong in view of Zhang teaches the method of claim 5 and the system of claim 20. 
Xiong further discloses wherein estimating the update term comprises:
back propagating, by the computer system, a derivative of the weighted sum through each of the plurality of ensemble members (see, e.g., paragraphs 34-42, 46-52, 56, 59 and 63-64, “During back-propagation, the gradient of w may be a weighted sum of the gradient of the mean network and the gradient of the dropout network”, “the normal back-propagation technique can still be applied to compute a 

Regarding claims 7 and 22, as discussed above, Xiong in view of Zhang teaches the method of claim 5 and the system of claim 20. 
Xiong further discloses wherein estimating the update term comprises:
storing, by the computer system, current values of learned parameters of each of the plurality of ensemble members as stored values (see, e.g., paragraphs 25, 42-43, 46-52 and 59-62, “memory (106) may be provided for storing activations and learned weights for each feature detector” [i.e., storing current values of learned weights/parameters for feature detector nodes of each ensemble member], “generate gradients for updating the network parameters”, “The parameters may be adjusted using the gradient error function, evaluated using the desired output and the variance-adjusted output. The parameters may further be adjusted using the parameters” [i.e., current, updated values of learned parameters], “Then, parameters of the models may ;
determining, by the computer system, updated values of the learned parameters of each of the plurality of ensemble members for the training data item being a positive example (as indicated above, “the training data item being a positive example” has been interpreted as the previously introduced “training data item”) (see, e.g., paragraphs 42-43, 46-52 and 62-64, “generate gradients for updating the network parameters”, “the inputs of a mini-batch of training cases are forward propagated using two networks with shared parameters. … After that, the network parameters are updated at block (320) and the training procedure may move on to the next mini-batch [i.e., determine updated parameter values for an example training case/data item in a mini-batch of training cases/data items]. The parameters may be adjusted using the gradient error function, evaluated using the desired output and the variance-adjusted output. The parameters may further be adjusted using the parameters are adjusted using gradient descent, stochastic gradient descent, momentum, Nesterov's accelerated momentum, Ada Grad, RMS Prop, conjugate gradient, or a combination of these.” [i.e., determining updated parameter values of the learned parameters], “Then, parameters of the models may be updated based on the comparison of the adjusted prediction with the target [i.e., updating learned parameters of each of the ensemble members]);
adding, by the computer system, a difference between the stored values and the updated values of the learned parameters to an accumulated gradient estimate for the training data set (see, e.g., paragraphs 16, 34, 50 and 63-64, “a ; and
resetting, by the computer system, the learned parameters to the stored values (see, e.g., paragraphs 25, 47-52, 56 and 59-64, “memory (106) may be provided for storing … learned weights for each feature detector. … memory (106) may further store a validation set” [i.e., stored, learned weights and values in a stored validation set], “Variance-adjustable dropout training thus makes use of an additional hyper-parameter α, which may be set using cross-validation. The search for a [α] may be combined with the search for other hyper-parameters, to produce a jointly optimal hyper-parameter setting … the parameter a [α] may be determined by examining a plurality of values for a [α], computing the outputs for data items in a held out validation set” [i.e., reset learned parameter α to a stored value in the stored validation set], “the 

Regarding claims 8 and 22, as discussed above, Xiong in view of Zhang teaches the method of claim 5 and the system of claim 20. 
Xiong further discloses wherein estimating the update term comprises:
storing, by the computer system, current values of learned parameters of each of the plurality of ensemble members as stored values (see, e.g., paragraphs 25, 42-43, 46-52 and 59-62, “memory (106) may be provided for storing activations and learned weights for each feature detector” [i.e., storing current values of learned weights/parameters for feature detector nodes of each ensemble member], “generate gradients for updating the network parameters”, “The parameters may be adjusted using the gradient error function, evaluated using the desired output and the variance-adjusted output. The parameters may further be adjusted using the parameters” [i.e., current, updated values of learned parameters], “Then, parameters of the models may be updated based on the comparison of the adjusted prediction with the target [i.e., current/updated values of learned parameters of each of the ensemble members]);
determining, by the computer system, first updated values of the learned parameters of each of the plurality of ensemble members for the training data item being a negative example (as indicated above, “the training data item being a ;
resetting, by the computer system, the learned parameters to the stored values (see, e.g., paragraphs 25, 47-52, 56 and 59-64, “memory (106) may be provided for storing … learned weights for each feature detector. … memory (106) may further store a validation set” [i.e., stored, learned weights and values in a stored validation set], “Variance-adjustable dropout training thus makes use of an additional hyper-parameter α, which may be set using cross-validation. The search for a [α] may be combined with the search for other hyper-parameters, to produce a jointly optimal hyper-parameter setting … the parameter a [α] may be determined by examining a ;
determining, by the computer system, second updated values of the learned parameters of each of the plurality of ensemble members for the training data item being a positive example (as indicated above, “the training data item being a positive example” has been interpreted as the previously introduced “training data item”) (see, e.g., paragraphs 42-43, 46-52 and 62-64, “generate gradients for updating the network parameters”, “the inputs of a mini-batch of training cases are forward propagated using two networks with shared parameters. … After that, the network parameters are updated at block (320) and the training procedure may move on to the next mini-batch [i.e., determine second updated parameter values for a second example training case/data item in a next, second mini-batch of training cases/data items]. The parameters may be adjusted using the gradient error function, evaluated using the desired output and the variance-adjusted output. The parameters may further be adjusted using the parameters are adjusted using gradient descent, stochastic gradient descent, momentum, Nesterov's accelerated momentum, Ada Grad, RMS Prop, conjugate gradient, or a combination of these.” [i.e., determining second updated parameter values of the learned parameters], “Then, parameters of the models may be ;
adding, by the computer system, an averaged difference between the first updated values and the second updated values of the learned parameters to an accumulated gradient estimate for the training data set (see, e.g., paragraphs 16, 34, 46-52, 56 and 59-64, “a stochastic gradient descent process may be applied for training the neural network on mini-batches of training cases processed using a dropout neural network training process” [i.e., a gradient estimate for the training cases/mini-batches/data set], “Dropout training for a neural network can be derived by assuming that each training case processed during the training stage contributes the following cost, which may be combined across a mini batch of training cases or an entire training set, when computing gradients used to update weights” [i.e., an accumulated/combined gradient estimate for the entire training data set], “the gradient of w may be a weighted sum of the gradient of the mean network and the gradient of the dropout network” [i.e., adding/summing to an accumulated gradient estimate], “When testing the neural network after training, the average of the new predictor may be used, which can be approximated by the mean network”, “At block (310), the prediction f(x, m, w) is subtracted from the prediction f(x, w). At block (312), the difference between f(x, m, w) and f(x, w) is adjusted by the hyper-parameter α. At block (314) f(x, w) is added, providing the new predictor f(x, m, w) … The difference between the variance-adjusted output and the desired output may be computed”, “optimizing the average of the cost function of models in the ensemble … where fi (x, wi) is the i-th model with parameters wi. When testing the ensemble, predictions from the models in the ensemble are ; and
resetting, by the computer system, the learned parameters to the stored values (see, e.g., paragraphs 25, 47-52, 56 and 59-64, “memory (106) may be provided for storing … learned weights for each feature detector. … memory (106) may further store a validation set” [i.e., stored, learned weights and values in a stored validation set], “Variance-adjustable dropout training thus makes use of an additional hyper-parameter α, which may be set using cross-validation. The search for a [α] may be combined with the search for other hyper-parameters, to produce a jointly optimal hyper-parameter setting … the parameter a [α] may be determined by examining a plurality of values for a [α], computing the outputs for data items in a held out validation set” [i.e., reset learned parameter α to a stored value in the stored validation set], “the inputs of a mini-batch of training cases are forward propagated using two networks with shared parameters. … After that, the network parameters are updated at block (320) and the training procedure may move on to the next mini-batch. ... The parameters may further be adjusted using the parameters” [i.e., adjust/reset the learned parameters to the parameters]).

Regarding claims 9 and 24, as discussed above, Xiong in view of Zhang teaches the method of claim 5 and the system of claim 20. 
wherein the weighted sum comprises a weight applied to the partial derivatives of the shared objective relative to the derivative of the associated objective for each of the plurality of ensemble members (see, e.g., paragraphs 34-42, 46-52, 56 and 59-64, “During back-propagation, the gradient of w may be a weighted sum of the gradient of the mean network and the gradient of the dropout network: 
    PNG
    media_image2.png
    200
    400
    media_image2.png
    Greyscale
(7)” [i.e., equation 7 for calculating/computing a weighted sum applied to derivatives], “backpropagation that asynchronously processes dl/fi(x, w) one at a time … for each pass through all of then models. For each i, it may be necessary to perform forward and backward propagation through all n models” [i.e., for each of the plurality n of models/ensemble members], “normal back-propagation technique can still be applied to compute a surrogate of derivatives used for learning”, “For a general method of variance-adjustable ensemble learning, predictions of models in the ensemble may be adjusted relative to the mean prediction of the ensemble, referred to as an aggregate output [i.e., relative to the shared objective/prediction of the ensemble]. Then, parameters of the models may be updated based on the comparison of the adjusted prediction with the target” [i.e., relative to derivatives of the associated objectives/predictions of models/members in the ensemble]).

Regarding claims 10 and 25, as discussed above, Xiong in view of Zhang teaches the method of claim 9 and the system of claim 24. 
controlling, by the computer system, the weight according to a training progress of each of the plurality of ensemble members (see, e.g., paragraphs 26, 33-34 and 59-63, “During the training stage, the neural network optimizes weights for each feature detector. After learning, the optimized weight configuration can then be applied to test data”, “Once the training set has been learned by the neural network, the switch may enable all feature detectors and normalize their outgoing weights”, “each training case processed during the training stage contributes the following cost, which may be combined across a mini batch of training cases or an entire training set, when computing gradients used to update weights” [i.e., controlling/updating/optimizing/normalizing the weight during training stages and after training – according to a training progress]).

Regarding claims 11 and 26, as discussed above, Xiong in view of Zhang teaches the method of claim 10 and the system of claim 24. 
Xiong further discloses wherein controlling the weight according to the training progress of each of the plurality of ensemble members comprises:
reducing, by the computer system, the weight as each of the plurality of ensemble members reaches convergence (see, e.g., paragraphs 33-42, 46-52, 56 and 59-64, “Once the training set has been learned by the neural network, the switch may enable all feature detectors and normalize their outgoing weights (204). Normalization comprises reducing the outgoing weights of each feature detector” [i.e., reducing the weight of each feature detector node of the ensemble members/networks] “The mean network, f(x,w), is constructed by performing forward propagation using the 

Regarding claims 12 and 27, as discussed above, Xiong in view of Zhang teaches the method of claim 1 and the system of claim 16. 
Xiong further discloses wherein the plurality of ensemble members comprises a plurality of different machine learning system types (see, e.g., paragraphs 19-24 and 55-59, “Dropout training facilitates training an ensemble of exponentially many neural networks”, “The hyper-parameter can be smoothly adjusted to vary the behaviour of the method from a single model learning to a family of ensemble learning comprising a plurality of interacting models. This technique can be applied to ensemble learning with various cost functions, structures and parameter sharing.” “In one aspect, different neural networks in the plurality of neural networks differ”, “the type of neural network implemented is not limited merely to feedforward neural networks but can also be applied to any neural networks, including convolutional neural networks, recurrent neural networks, auto-encoders and Boltzmann machines. Further, the neural networks may comprise linear regression models, logistic regression models, neural network models with at least one layer of hidden units, or a combination thereof. In addition, this method is generally applicable to supervised machine learning methods that are not generally regarded as neural networks, such as regression trees, 

Regarding claims 13 and 28, as discussed above, Xiong in view of Zhang teaches the method of claim 1 and the system of claim 16. 
Xiong further discloses wherein the plurality of ensemble members comprises a single machine learning system type (see, e.g., paragraphs 19-24, “The following provides a feedforward neural network training system comprising an extra hyper-parameter that controls the variance of the ensemble predictors and generalizes ensemble learning”, “an illustrative feedforward network is described, the type of neural network implemented”, “Referring now to FIG. 1, a feedforward neural network (100) having a plurality of layers (102) is shown. Each layer comprises one or more feature detectors (104)” [i.e., the plurality of ensemble members/networks can comprise a single feedforward machine learning method/neural network type]).

Regarding claims 14 and 29, as discussed above, Xiong in view of Zhang teaches the method of claim 13 and the system of claim 28. 
Xiong further discloses wherein the single machine learning system type comprises a neural network (see, e.g., paragraphs 19-24, “The following provides a feedforward neural network training system comprising an extra hyper-parameter that controls the variance of the ensemble predictors and generalizes ensemble learning” [i.e., the machine learning system type comprises a feedforward neural network]). 

Claims 15 and 30 are rejected under 35 U.S.C. 103 as being unpatentable over Xiong in view of Zhang as applied to claims 1, 13-14, 16, and 28-29 above, and further in view of Majumdar (U.S. Patent Application Pub. No. 2018/0137155 A1, hereinafter “Majumdar”). Majumdar was filed on October 20, 2017 and claims priority to PCT Application No. PCT/US2016/024060 filed on March 24, 2016, and both of these dates are before the effective filing date of this application, i.e., August 17, 2018. Therefore, Majumdar constitutes prior art under 35 U.S.C. 102(a)(2).
With regard to claims 15 and 30, as discussed above, Xiong in view of Zhang teaches the method of claim 14 and the system of claim 29.
Xiong further discloses wherein each neural network comprises a same number of layers, a same number of nodes within each of the layers, and a same arrangement of … connections between the nodes (see, e.g., paragraphs 19-24, 29 and 49, “Dropout training facilitates training an ensemble of exponentially many neural networks, almost as efficiently as a single neural network, through the use of parameter sharing” [i.e., as efficiently as a single neural network with the same number of layers, nodes, and connections between nodes], “In one aspect, different neural networks in the plurality of neural networks differ only in that during the forward pass, feature detectors are selectively disabled randomly, pseudorandomly or using a fixed or predetermined pattern, in the fashion of the Dropout procedure, and feature detectors to be deactivated is not the same in different neural networks” [i.e., each neural network differs only activation/deactivation of feature detector nodes, but have the same number of layers and detectors/nodes within each layer], “switches (108) may be linked to all feature detectors in both the hidden and input layers … the switches may be linked to 
Although Xiong in view of Zhang substantially teaches the claimed invention, Xiong in view of Zhang is not relied on to teach wherein each neural network comprises … a same arrangement of directed arc connections between the nodes.
In the same field, analogous art Majumdar teaches wherein each neural network comprises … a same arrangement of directed arc connections between the nodes (see, e.g., paragraphs 50, 227 and 233, “directed relationships from one direction to the other for each edge” and “two property values on the network are computed called respectively T (for ‘Topological’) and G (for ‘Geometrical’) [i.e., topology/structure of the network in terms of connections between nodes] … Examples of functions for nodes, for example in web-graphs, include the in-degree and out-degree of directed edges to a node, and an example of a node weight is the ratio of input degree to output degree." [i.e., neural network includes a same arrangement of directed edges to/arc connections between the nodes]).
Xiong, Zhang, and Majumdar are analogous art because they are each directed to systems and methods for machine learning (see, e.g., Majumdar, paragraphs 50-59).


Conclusion
The prior art made of record, listed on form PTO-892, and not relied upon, is considered pertinent to applicant's disclosure.
The examiner requests, in response to this office action, support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line no(s) in the specification and/or drawing figure(s). This will assist the examiner in prosecuting the application.
When responding to this office action, Applicant is advised to clearly point out the patentable novelty which he or she thinks the claims present, in view of the state of the art disclosed by the reference cited or the objections made. He or she must also show how the amendments avoid such references or objections See 37 CFR 1.111 (c).

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on 571-272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/R.K.B./Examiner, Art Unit 2125 

/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125