DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on February 23, 2021 has been entered.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to the claims and remarks filed on February 23, 2021. Claims 102-126 are pending and have been examined. 

Response to Amendment
The amendment filed on February 23, 2021 has been entered. However, as detailed below, the amendment to the claims filed February 23, 2021 does not comply with the requirements of 37 CFR 1.121(c)(2) because each claim has not been provided with the proper status identifier, and as such, the individual status of each claim cannot be identified. In particular, the status identifier for claim 121 indicates that the claim is “Currently Amended”. However, claim 121 was not amended in the amendment filed on February 23, 2021. Claims 102, 113, 114 and 125 were amended, claim 126 was added 
The previous rejections of claims 102-125 under 35 U.S.C. 112(b) are withdrawn in view of the February 23, 2021 amendment.

Response to Arguments
Applicant's arguments filed February 23, 2021 with respect to the previous rejections of claims 102-125 under 35 U.S.C. 112(b) have been fully considered and are persuasive.
Applicant's arguments filed February 23, 2021 with respect to the rejections of claims 102-125 under 35 U.S.C. 103 have been carefully and fully considered but are moot because the arguments do not apply to the combination of references used in the current rejections. Applicant’s amendments have necessitated the claim rejections under 35 U.S.C. 103 discussed below.
With reference to amended independent claims 102 and 114, applicant states: “The cited prior art references do not teach or suggest adding an additional cost term to a cost function for training a neural network, where ‘the additional cost term rewards the first and second nodes upon a determination that the activation values for the first and second nodes for the error-prone training data example are not in agreement.’” (applicant’s remarks, page 9). 
With continued reference to amended claims 102 and 114, applicant asserts that “neither Misra nor Li disclose the steps of ‘detecting ... an error-prone training data example’ and ‘adding ... an additional cost term to the cost function for the first and 
With continued reference to the Baker reference and amended claims 102 and 114, applicant further asserts that “there is no need, or even a suggestion, to determine in the AuriLab Application's [i.e., Baker] corrective training whether the two nodes are in agreement or not for the error-prone training data example and, upon a determination that the two nodes are not in agreement, adding an additional cost term to a cost function for training a neural network, where ‘the additional cost term rewards the first and second nodes ....’” (applicant’s remarks, page 10). 
Applicant then generally alleges that “The other references cited in the Office Action do not cure the deficiencies of Misra, Lin [sic – Li] and the AuriLab Application [i.e., Baker] relative to independent claims 102 and 114, as amended. Therefore, claims 102 and 114, as well as their respective dependent claims, would not have been obvious in light of the cited references.” (applicant’s remarks, page 10).
Accordingly, applicant appears to argue that the newly presented claim limitation that was added to each of claims 102 and 114 in the amendment filed on February 23, 2021, are not disclosed or taught in the portions of the Misra, Li and Baker references applied to claims 102 and 114 in the previous Office Action. 

Second, as additionally discussed below, paragraph 277 of applicant’s specification discloses “When, in box 1603, the computer system detects a case of such error-prone data, in box 1604, the computer system applies a special regularization cost term to the node pairs created or selected in box 1602. The special regularization cost function penalizes the two nodes for agreeing with each other and/or rewards them for being different. Note that the penalty for agreeing is only applied on data examples in which two or more ensemble members or merged former ensemble members make the same mistake.” Therefore, adding “an additional cost term to the cost function for the first and second nodes for the error-prone training data example”, under the BRI is adding a penalty term to make scores worse (or better) when ensemble members/nodes agree/correlate (or disagree/vary) with each other for the error-prone training data example.
Regarding applicant’s argument that the newly presented claim limitation that was added to each of claims 102 and 114 in the amendment filed on February 23, 2021, i.e., “wherein the additional cost term rewards the first and second nodes upon a determination that the activation values for the first and second nodes for the error-
Regarding the feature “wherein the additional cost term rewards the first and second nodes upon a determination that the activation values for the first and second nodes for the error-prone training data example are not in agreement” added to amended claims 102 and 114, the examiner points to paragraphs 111 and 202 of Baker, which disclose that “the parameters of the models are adjusted to help correct the errors by improving the scores of the correct label when there is an error” and “That is, it will systematically improve the scores of the incorrect answers” [i.e., the additional cost term rewards the nodes by improving their model scores]. 
Further regarding the above-noted limitation added to amended claims 102 and 114, the examiner points to pages 105 and 107-108 of Alhamdoosh, which disclose “select[ing] ensemble component networks with maximum disagreement among their outputs” [i.e., determine/select network nodes with output disagreement/activation values that are not in agreement], “Negative correlation learning (NCL) [14] amends the cost function with a penalty term that weakens the relationship with other individuals and controls the … variance … in the ensemble learning” [i.e., amend cost function with an additional penalty term upon determining disagreement/variance], “the learning error of the ith base model, given in Eq. (8), was modified to include a decorrelation penalty 
With reference to newly-added claim 126, Applicant asserts that “New claim 126 also would not have been obvious because the cited references do not teach or suggest adding an additional cost term that ‘penalizes the first and second nodes upon a determination that the activation values for the first and second nodes for the error-prone training data example are in agreement.’” (applicant’s remarks, page 11). 
The examiner disagrees in view of the newly-cited Alhamdoosh reference and points applicant to the below discussion of Baker and Alhamdoosh.
Regarding the recitation of “wherein the additional cost term penalizes the first and second nodes upon a determination that the activation values for the first and second nodes for the error-prone training data example are in agreement” in new claim 126, the examiner points to paragraphs 111, 199 and 202 of Baker, which disclose that “the parameters of the models are adjusted to help correct the errors … by making the incorrect best-scoring label get a worse score”, “the worse score of the hypothesis that contains the /r/ will be judged as making the score worse for the correct answer rather than for an incorrect close call” and “That is, it will … systematically degrade the scores of the correct answer” [i.e., the additional cost term penalizes the nodes by making their scores worse]. The examiner further points to pages 104-105 and 107-108 of Cov(xn) explicitly helps in controlling the disagreement among ensemble components’ outputs and hence producing better generalized ensemble model.” [i.e., additional cost term that penalizes individuals/nodes in the ensemble when they are in agreement to control disagreement].
As discussed in detail below, the combination of Misra, Li, Baker and Alhamdoosh (i.e., Misra in view of Li and Baker and further in view of Alhamdoosh) teaches all of the limitations of pending claims 102-103, 114-115 and 126, the combination of Misra, Li, Baker, Alhamdoosh and Knittel (i.e., Misra in view of Li, Baker and Alhamdoosh, and further in view of Knittel) teaches the limitations of dependent claims 104-109, 112-113, 116-121 and 124-125, and the combination of Misra, Li, Baker, Alhamdoosh and Ghorpade (i.e., Misra in view of Li, Baker and Alhamdoosh, and further in view of Ghorpade) teaches the limitations of dependent claims 110-111 and 122-123. Applicant’s amendments have necessitated the claim rejections under 35 U.S.C. 103 discussed below.

Claim Objections
The amendment to the claims filed February 23, 2021 does not comply with the requirements of 37 CFR 1.121(c)(2) because each claim has not been provided with the proper status identifier, and as such, the individual status of each claim cannot be identified. In particular, in the amendment to the claims filed February 23, 2021, the status identifier for claim 121 indicates “Currently Amended” (see page 6 of the response). However, as noted above, claim 121 was not amended in the amendment filed February 23, 2021. Thus, the status identifier for claim 121 should be “Previously presented” (see 37 CFR 1.121(c), which states “the status of every claim must be indicated after its claim number by using one of the following identifiers in a parenthetical expression: (Original), (Currently amended), (Canceled), (Withdrawn), (Previously presented), (New), and (Not entered)” and see 37 CFR 1.121(c)(2), which states “All claims being currently amended in an amendment paper shall be presented in the claim listing, indicate a status of ‘currently amended,’ and be submitted with markings to indicate the changes that have been made relative to the immediate prior version of the claims. The text of any added subject matter must be shown by underlining the added text.”). Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to 
Claims 102, 103, 114, 115 and 126 are rejected under 35 U.S.C. 103 as being unpatentable over non-patent literature Misra et al. ("Cross-stitch Networks for Multi-task Learning", 2016, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3994-4003, hereinafter “Misra”) in view of non-patent literature Li et al. ("Cost-sensitive sequential three-way decision modeling using a deep neural network", 2017, International Journal of Approximate Reasoning 85, pp. 68–78, hereinafter “Li”) and Baker (U.S. Patent Application Pub. No. 2008/0069437 A1, hereinafter “Baker”), and further in view of non-patent literature Alhamdoosh et al. ("Fast decorrelated neural network ensembles with random weights." Information Sciences 264 (2014): 104-117, hereinafter “Alhamdoosh”). 
With respect to independent claim 102, Misra discloses the invention as claimed including a method for training a nodal network (see, e.g., FIG. 4 – depicting a neural network [i.e., a nodal network] and page 3999, “We can initialize networks [i.e., nodal networks] A and B by networks that were trained on these tasks separately, or have the same initialization and train them jointly”), comprising:
back-propagating, by a computer system (see, e.g., pages 3995, 3997 and 3999-4000, Section 6, “Multi-task learning [5, 48] has a rich history in machine learning”, we focus only on multi-task learning in the context of ConvNets used in computer vision” [i.e., a computer system is used for machine learning/training networks], “Backpropagation through cross-stitch units” [i.e., back-propagating], “We train these two networks jointly, using end-to-end learning” [i.e., training neural networks on large , partial derivatives of a cost function for an objective for the nodal network through the nodal network (paragraphs 277 and 281 of applicant’s specification disclose “selected node pairs are trained with the normal back-propagation of the main objective alone” and “the computer system, at box 1701, can select the reflexive and/or backward connections that are most important to the task objective.” The plain meaning of objective is something that one's efforts or actions are intended to attain or accomplish; purpose; goal; target. See https://www.dictionary.com/browse/objective. Therefore, “an objective for the nodal network”, under the broadest reasonable interpretation (BRI) is any objective, purpose, goal or target of a task performed by the nodal network) (see, e.g., page 3997, “Backpropagating through cross-stitch units. Since cross-stitch units are modeled as linear combination, their partial derivatives for loss L with tasks A, B are computed as

    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale
” [i.e., backpropagating partial derivatives of a loss/cost function for an objective of task A performed by the network]).
Although Misra substantially discloses the claimed invention, Misra is not relied on to explicitly disclose back-propagating, by a computer system … wherein back-propagating the partial derivatives comprises counter-tying activation values of first and second nodes of the nodal network and back-propagating the partial derivatives through the nodal network for the error-prone training data example.
back-propagating, by a computer system (see, e.g., pages 70 and 74, section 5, “method for computing the partial derivative for each layer is the back-propagation (BP) algorithm [17], which is used widely in neural networks” [i.e., back-propagating], “[a]ll of the experiments were performed on a computer with an Intel i7-4790k (4G processor) and 16 GB RAM, and the method was programmed in MATLAB (version R2014a)” [i.e., by a computer system]),
wherein back-propagating the partial derivatives comprises counter-tying activation values of first and second nodes of the nodal network (paragraph 274 of applicant’s specification discloses “a procedure, called counter-tying, for improving the performance of any ensemble or merged ensemble. Counter-tying can be applied to the output nodes in a conventional ensemble in the form of an extra cost function” and “counter-tying can be applied to any two existing nodes in the network, to two nodes that are created just for the purpose, or to two nodes that are created by node splitting”. The examiner notes that these are the sole mentions of “counter-tying” in applicant’s specification. Therefore, “counter-tying activation values”, under BRI is applying a cost/loss function to activations of any two nodes of a network) (see, e.g., pages 70-71, “Let us denote a(l)i as the activation of unit i in layer l and an input vector x∈Rs1 can be denoted as a vector a(1) [i.e., activation values of units/nodes in layers of the network]. Let z(l+1)=W(l)a(l)+b(l), where W(l)and b(l)are a matrix and a vector comprising W(l)ij and b(l)i, respectively. The feed-forward propagation can be described as: a(l+1)=f(z(l+1)), where f(·) is an element-wise activation function”, “loss function J(W, b) is defined in (2) to measure the dispersion of the actual output and ideal output … [t]o minimize the loss J(W, b), an optimization algorithm using batch Gradient Descent (GD) is employed, which minimizes the errors between the output ˆx and target x. The crucial feature of GD is computing the partial derivatives, i.e., the descent direction. An efficient method for computing the partial derivative for each layer is the back-propagation (BP) algorithm” [i.e., partial derivatives of a loss/cost function are calculated and back-propagated by applying a loss/cost function to activations of nodes of the network]) and
back-propagating the partial derivatives through the nodal network for the error-prone training data example (paragraphs 276-277 of applicant’s specification disclose “data on which an error has been made or an error on similar data is likely because two or more members of the ensemble have both made the same mistake” and “the computer system detects a case of such error-prone data”. Therefore, “the error-prone training data example”, under the BRI is a training data example that has led to an error or mistake by members/models of an ensemble) (see, e.g., pages 71-73, “To minimize the loss function J(W, b), an optimization algorithm using batch Gradient Descent (GD) is employed, which minimizes the errors between the output ˆx and target x … An efficient method for computing the partial derivative for each layer is the back-propagation (BP) algorithm” [i.e., back-propagating the partial derivatives through the nodal network], “In the overall training process, the training error of the identity mapping hW,b(x) ≈x decreases in each training loop”, “In example-dependent problems, the cost is determined by the examples” [i.e., a training error resulting from an error-prone training data example]).
Misra and Li are analogous art because they are directed to using neural networks (i.e., nodal networks) for image recognition.

Although Misra in view of Li substantially teaches the claimed invention, Misra in view of Li is not relied on to teach wherein counter-tying the activation values of the first and second nodes comprises:
detecting by the computer system an error-prone training data example in a set of training data for the nodal network; and
adding by the computer system an additional cost term to the cost function for the first and second nodes for the error-prone training data example.
In the same field, analogous art Baker teaches wherein counter-tying the activation values of the first and second nodes comprises:
detecting by the computer system an error-prone training data example in a set of training data for the nodal network (paragraphs 276-277 of applicant’s specification disclose “the computer system detects data on which an error has been ; and
adding by the computer system an additional cost term to the cost function for the first and second nodes for the error-prone training data example (paragraph 277 of applicant’s specification discloses “When, in box 1603, the computer system detects a case of such error-prone data, in box 1604, the computer system applies a special regularization cost term to the node pairs created or selected in box 1602. The special regularization cost function penalizes the two nodes for agreeing with each other and/or rewards them for being different. Note that the penalty for agreeing is only applied on data examples in which two or more ensemble members or merged former ensemble members make the same mistake.” Therefore, adding “an additional cost term to the cost function for the first and second nodes for the error-prone training , wherein the additional cost term rewards the first and second nodes (see, e.g., paragraphs 111 and 202, “the parameters of the models are adjusted to help correct the errors by improving the scores of the correct label when there is an error”, “That is, it will systematically improve the scores of the incorrect answers” [i.e., the additional cost term rewards the nodes by improving their model scores]).
Misra, Li and Baker are analogous art because they are directed to using neural networks (i.e., nodal networks) for image recognition. In particular, Baker teaches pattern recognition with a plurality of models (i.e., an ensemble) within a classifier and using a model to determine labels associated with a plurality of links. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li to incorporate the teachings of Baker to provide “a collection of cooperating recognition systems … managed as a population of systems, continually evolving and improving” and a “system … designed to correct its own errors” by operating “on training data that has been labeled automatically by running the recognition process” and performing “Delayed-decision training … on this designated training data with feedback of validated or 
Although Misra in view of Li and Baker substantially teaches the claimed invention, Misra in view of Li and Baker is not relied on to teach the additional cost term rewards the first and second nodes upon a determination that the activation values for the first and second nodes for the error-prone training data example are not in agreement.
In the same field, analogous art Alhamdoosh teaches the additional cost term rewards the first and second nodes upon a determination that the activation values for the first and second nodes for the error-prone training data example are not in agreement (see, e.g., pages 105 and 107-108, “select ensemble component networks with maximum disagreement among their outputs” [i.e., determine/select network nodes with output disagreement/activation values that are not in agreement], “Negative correlation learning (NCL) [14] amends the cost function with a penalty term that weakens the relationship with other individuals and controls the … variance … in the ensemble learning” [i.e., amend cost function with an additional penalty term upon determining disagreement/variance], “the learning error of the ith base model, given in Eq. (8), was modified to include a decorrelation penalty term pi … The penalty term pi can be designed in different ways depending on whether the ensemble networks are trained sequentially or parallelly.” [i.e., the additional term can reward nodes that are not 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li and Baker to incorporate the teachings of Alhamdoosh to provide an ensemble learning scheme implemented by using error back-propagation algorithms/neural networks with back-propagation learning algorithms (BPNNs) and employing random vector functional link (RVFL) networks as base components, and incorporating these components with a negative correlation learning strategy (NCL) for building neural network ensembles where a cost function is defined for the NCL (See, e.g., Alhamdoosh, Abstract and pages 104-105). Doing so would have allowed Misra in view of Li and Baker to produce an ensemble (i.e., an ensemble of nodal/neural networks) with sound generalization capabilities through controlling disagreement among base learners’ outputs by amending a cost function with a penalty term that weakens the relationship with other individuals and controls the trade-off among the bias, variance and covariance in the ensemble, as suggested by Alhamdoosh (See, e.g., Alhamdoosh, Abstract and pages 104-105).

With respect to independent claim 114, Misra discloses the invention as claimed including a computer system (see, e.g., pages 3995 and 3999-4000, Section 6, “Multi-task learning [5, 48] has a rich history in machine learning”, “we focus only on multi-task learning in the context of ConvNets used in computer vision” [i.e., a computer system is comprising: 
one or more processor cores; and a memory in communication with the one or more processor cores, wherein the memory stores software that, when executed by the one or more processor cores (see, e.g., page 3999 Sections 6 and 6.1, “Fast-RCNN is trained” - discloses training neural networks, such as complex Fast-RCNNs, on large image datasets, which reasonably discloses that a computer system including a processor, memory, and software is used, “such work, with publicly available code [i.e., software], formulates multi-task learning in an optimization framework that requires all data points in memory” [i.e., memory storage for data]) to train a nodal network by back-propagating (see, e.g., pages 3997 and 3999-4000, section 6, “Backpropagation through cross-stitch units” [i.e., back-propagating], “We train these two networks jointly, using end-to-end learning” [i.e., training nodal/neural networks]) partial derivatives of a cost function for an objective for the nodal network through the nodal network (as indicated above, “an objective for the nodal network”, under the BRI is any objective, purpose, goal or target of a task performed by the nodal network) (see, e.g., page 3997, “Backpropagating through cross-stitch units. Since cross-stitch units are modeled as linear combination, their partial derivatives for loss L with tasks A, B are computed as

    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale
” [i.e., backpropagating partial derivatives of a loss/cost function for an objective of task A performed by the network]).
Although Misra substantially discloses the claimed invention, Misra is not relied on to explicitly disclose back-propagating partial derivatives of a cost function ... by counter-tying activation values of first and second nodes of the nodal network … wherein the memory stores software that, when executed by the one or more processor cores, causes the one or more processor cores to counter-tie the activation values of the first and second nodes and back-propagating the partial derivatives through the nodal network for the error-prone training data example.
In the same field, analogous art Li teaches back-propagating partial derivatives of a cost function (see, e.g., page 70, “loss function J(W, b) is defined in (2) to measure the dispersion of the actual output and ideal output … [t]o minimize the loss function J(W, b), an optimization algorithm using batch Gradient Descent (GD) is employed, which minimizes the errors between the output ˆx and target x. The crucial feature of GD is computing the partial derivatives, i.e., the descent direction. An efficient method for computing the partial derivative for each layer is the back-propagation (BP) algorithm” [i.e., back-propagating partial derivatives of a loss/cost function]) ... by counter-tying activation values of first and second nodes of the nodal network (as indicated above, “counter-tying activation values”, under the BRI is applying a cost/loss function to activations of any two nodes of a network) (see, e.g., pages 70-71, “Let us a(l)i as the activation of unit i in layer l and an input vector x∈Rs1 can be denoted as a vector a(1) [i.e., activation values of units/nodes in layers of the network]. Let z(l+1)=W(l)a(l)+b(l), where W(l)and b(l)are a matrix and a vector comprising W(l)ij and b(l)i, respectively. The feed-forward propagation can be described as: a(l+1)=f(z(l+1)), where f(·) is an element-wise activation function”, “loss function J(W, b )is defined in (2) to measure the dispersion of the actual output and ideal output … [t]o minimize the loss function J(W, b), an optimization algorithm using batch Gradient Descent (GD) is employed, which minimizes the errors between the output ˆx and target x. The crucial feature of GD is computing the partial derivatives, i.e., the descent direction. An efficient method for computing the partial derivative for each layer is the back-propagation (BP) algorithm” [i.e., partial derivatives of a loss/cost function are calculated and back-propagated by applying a loss/cost function to activations of nodes of the network]), 
wherein the memory stores software that, when executed by the one or more processor cores, causes the one or more processor cores to counter-tie the activation values of the first and second nodes (paragraph 274 of applicant’s specification discloses “a procedure, called counter-tying, for improving the performance of any ensemble or merged ensemble. Counter-tying can be applied to the output nodes in a conventional ensemble in the form of an extra cost function” and “counter-tying can be applied to any two existing nodes in the network, to two nodes that are created just for the purpose, or to two nodes that are created by node splitting”. The examiner notes that these are the sole mentions of “counter-tying” or “counter-tie” in applicant’s specification. Therefore, “counter-tie the activation values”, under the BRI is applying a cost/loss function to activations of any two nodes of a network) (see, e.g., pages 70-71 a(l)i as the activation of unit i in layer l and an input vector x∈Rs1 can be denoted as a vector a(1) [i.e., activation values of units/nodes in layers of the network]. Let z(l+1)=W(l)a(l)+b(l), where W(l)and b(l)are a matrix and a vector comprising W(l)ij and b(l)i, respectively. The feed-forward propagation can be described as: a(l+1)=f(z(l+1)), where f(·) is an element-wise activation function”, “loss function J(W, b ) is defined in (2) to measure the dispersion of the actual output and ideal output … [t]o minimize the loss function J(W, b), an optimization algorithm using batch Gradient Descent (GD) is employed, which minimizes the errors between the output ˆx and target x. The crucial feature of GD is computing the partial derivatives, i.e., the descent direction. An efficient method for computing the partial derivative for each layer is the back-propagation (BP) algorithm” [i.e., partial derivatives of a loss/cost function are calculated and back-propagated by applying a loss/cost function to activations of nodes of the network], “[a]ll of the experiments were performed on a computer with an Intel i7-4790k (4G processor) and 16 GB RAM, and the method was programmed in MATLAB (version R2014a)” [i.e., a computer system comprising a processor core and a RAM/memory storing software/MATLAB programs that cause the processor to perform operations]) and 
back-propagating the partial derivatives through the nodal network for the error-prone training data example (paragraphs 276-277 of applicant’s specification disclose “data on which an error has been made or an error on similar data is likely because two or more members of the ensemble have both made the same mistake” and “the computer system detects a case of such error-prone data”. Therefore, “the error-prone training data example”, under the BRI is a training data example that has J(W, b), an optimization algorithm using batch Gradient Descent (GD) is employed, which minimizes the errors between the output ˆx and target x … An efficient method for computing the partial derivative for each layer is the back-propagation (BP) algorithm” [i.e., back-propagating the partial derivatives through the nodal network], “In the overall training process, the training error of the identity mapping hW,b(x) ≈x decreases in each training loop”, “In example-dependent problems, the cost is determined by the examples” [i.e., a training error resulting from an error-prone training data example]). 
Alternatively, Li also teaches a computer system comprising: 
one or more processor cores; and a memory in communication with the one or more processor cores, wherein the memory stores software that, when executed by the one or more processor cores (see, e.g., page 74, section 5, “[a]ll of the experiments were performed on a computer with an Intel i7-4790k (4G processor) and 16 GB RAM, and the method was programmed in MATLAB (version R2014a)” [i.e., a computer system comprising a processor core and a RAM/memory storing software/MATLAB programs]).
Misra and Li are analogous art because they are directed to using neural networks (i.e., nodal networks) for image recognition.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra to incorporate the teachings of Li to provide “a DNN [deep neural network]-based sequential granular feature extraction method” that uses “a cost-sensitive sequential 3WD [three-way decision] 
Although Misra in view of Li substantially teaches the claimed invention, Misra in view of Li is not relied on to teach counter-tie the activation values of the first and second nodes by:
detecting by the computer system an error-prone training data example in a set of training data for the nodal network; and
adding by the computer system an additional cost term to the cost function for the first and second nodes for the error-prone training data example, wherein the additional cost term rewards the first and second nodes.
In the same field, analogous art Baker teaches counter-tie the activation values of the first and second nodes by:
detecting by the computer system an error-prone training data example in a set of training data for the nodal network (paragraphs 276-277 of applicant’s specification disclose “the computer system detects data on which an error has been made or an error on similar data is likely because two or more members of the ensemble have both made the same mistake” and “the computer system detects a case of such error-prone data”. Therefore, detecting “an error-prone training data example”, ; and
adding by the computer system an additional cost term to the cost function for the first and second nodes for the error-prone training data example (paragraph 277 of applicant’s specification discloses “When, in box 1603, the computer system detects a case of such error-prone data, in box 1604, the computer system applies a special regularization cost term to the node pairs created or selected in box 1602. The special regularization cost function penalizes the two nodes for agreeing with each other and/or rewards them for being different. Note that the penalty for agreeing is only applied on data examples in which two or more ensemble members or merged former ensemble members make the same mistake.” Therefore, adding “an additional cost term to the cost function for the first and second nodes for the error-prone training data example”, under the BRI is adding a penalty term to make scores worse (or better) when ensemble members/nodes agree/correlate (or disagree/vary) with each other for the error-prone training data example) (see, e.g., paragraph 111, “in corrective training, , 
wherein the additional cost term rewards the first and second nodes (see, e.g., paragraphs 111 and 202, “the parameters of the models are adjusted to help correct the errors by improving the scores of the correct label when there is an error”, “That is, it will systematically improve the scores of the incorrect answers” [i.e., the additional cost term rewards the nodes by improving their model scores]).
Misra, Li and Baker are analogous art because they are directed to using neural networks (i.e., nodal networks) for image recognition. In particular, Baker teaches pattern recognition with a plurality of models (i.e., an ensemble) within a classifier and using a model to determine labels associated with a plurality of links. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li to incorporate the teachings of Baker to provide “a collection of cooperating recognition systems … managed as a population of systems, continually evolving and improving” and a “system … designed to correct its own errors” by operating “on training data that has been labeled automatically by running the recognition process” and performing “Delayed-decision training … on this designated training data with feedback of validated or corrected labels.” (See, e.g., Baker, paragraphs 16-17). Doing so would have allowed Misra in view of Li to use “the validated or corrected labels … as the final, improved recognition output”, as suggested by Baker (See, e.g., Baker, paragraph 17). This is an 
Although Misra in view of Li and Baker substantially teaches the claimed invention, Misra in view of Li and Baker is not relied on to teach the additional cost term rewards the first and second nodes upon a determination that the activation values for the first and second nodes for the error-prone training data example are not in agreement.
In the same field, analogous art Alhamdoosh teaches the additional cost term rewards the first and second nodes upon a determination that the activation values for the first and second nodes for the error-prone training data example are not in agreement (see, e.g., pages 105 and 107-108, “select ensemble component networks with maximum disagreement among their outputs” [i.e., determine/select network nodes with output disagreement/activation values that are not in agreement], “Negative correlation learning (NCL) [14] amends the cost function with a penalty term that weakens the relationship with other individuals and controls the … variance … in the ensemble learning” [i.e., amend cost function with an additional penalty term upon determining disagreement/variance], “the learning error of the ith base model, given in Eq. (8), was modified to include a decorrelation penalty term pi … The penalty term pi can be designed in different ways depending on whether the ensemble networks are trained sequentially or parallelly.” [i.e., the additional term can reward nodes that are not in agreement/de-correlated], “the penalty term in Eq. (14) reduces the correlation mutually among all ensemble individuals” [i.e., the additional term rewards individuals/nodes in the ensemble that do not correlate/agree with each other]).


Regarding claims 103 and 115, as discussed above, Misra in view of Li, Baker and Alhamdoosh teaches the method of claim 102 and the system of claim 114. 
Misra further discloses wherein counter-tying the activation values of the first and second nodes comprises tying the activation values of the first and second nodes such that the first and second nodes are trained together to perform a machine learning task instead of being trained to optimize individual performance of the first and second nodes (as indicated above, “counter-tying the 

Regarding new claim 126, as discussed above, Misra in view of Li, Baker and Alhamdoosh teaches the system of claim 114.
Although Misra in view of Li substantially teaches the claimed invention, Misra in view of Li is not relied on to teach wherein the additional cost term penalizes the first and second nodes.
In the same field, analogous art Baker teaches wherein the additional cost term penalizes the first and second nodes (see, e.g., paragraphs 111, 199 and 202, “the parameters of the models are adjusted to help correct the errors … by making the incorrect best-scoring label get a worse score”, “the worse score of the hypothesis that contains the /r/ will be judged as making the score worse for the correct answer rather 
Misra, Li and Baker are analogous art because they are directed to using neural networks (i.e., nodal networks) for image recognition. In particular, Baker teaches pattern recognition with a plurality of models (i.e., an ensemble) within a classifier and using a model to determine labels associated with a plurality of links. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li to incorporate the teachings of Baker to provide “a collection of cooperating recognition systems … managed as a population of systems, continually evolving and improving” and a “system … designed to correct its own errors” by operating “on training data that has been labeled automatically by running the recognition process” and performing “Delayed-decision training … on this designated training data with feedback of validated or corrected labels.” (See, e.g., Baker, paragraphs 16-17). Doing so would have allowed Misra in view of Li to use “the validated or corrected labels … as the final, improved recognition output”, as suggested by Baker (See, e.g., Baker, paragraph 17). This is an example of “use of known technique to improve similar devices (methods, or products) in the same way.” See MPEP 2143.
In the same field, analogous art Alhamdoosh teaches the additional cost term penalizes the first and second nodes upon a determination that the activation values for the first and second nodes for the error-prone training data example are in agreement.
 the additional cost term penalizes the first and second nodes upon a determination that the activation values for the first and second nodes for the error-prone training data example are in agreement (see, e.g., Abstract and pages 104-105 and 107-108, “Negative correlation learning (NCL) aims to produce ensembles with sound generalization capability through controlling the disagreement among base learners’ outputs” [i.e., control disagreement among the nodes], “Negative correlation learning (NCL) [14] amends the cost function with a penalty term that weakens the relationship with other individuals and controls the … covariance in the ensemble learning”, “the learning error of the ith base model … was modified to include a decorrelation penalty term pi” [i.e., amend cost function with an additional cost/penalty term that penalizes the individuals/nodes in the ensemble when there is covariance/correlation/agreement], “managing the covariance term Cov(xn) explicitly helps in controlling the disagreement among ensemble components’ outputs and hence producing better generalized ensemble model.” [i.e., additional cost term that penalizes individuals/nodes in the ensemble when they are in agreement to control disagreement]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li and Baker to incorporate the teachings of Alhamdoosh to provide an ensemble learning scheme implemented by using error back-propagation algorithms/neural networks with back-propagation learning algorithms (BPNNs) and employing random vector functional link (RVFL) networks as base components, and incorporating these components with a negative correlation learning strategy (NCL) for building neural network ensembles .

Claims 104-109, 112, 113, 116-121, 124 and 125 are rejected under 35 U.S.C. 103 as being unpatentable over Misra in view of Li, Baker and Alhamdoosh as applied to claims 102 and 114 above, and further in view of Knittel (U.S. Patent Application Pub. No. 2018/0174051 A1, hereinafter “Knittel”). Knittel was filed on December 15, 2017 and claims foreign priority to AU application No. 2016277542, filed on December 15, 2016, and both of these dates are before the effective filing date of this application, i.e., January 30, 2018. Therefore, Knittel constitutes prior art under 35 U.S.C. 102(a)(2).
Regarding claims 104 and 116, as discussed above, Misra in view of Li, Baker and Alhamdoosh teaches the method of claim 102 and the system of claim 114. 
Although Misra in view of Li substantially teaches the claimed invention, Misra in view of Li is not relied on to teach detecting the error-prone training data example.
Baker teaches detecting the error-prone training data example (as indicated above, detecting the “error-prone training data example”, under the BRI is detecting, identifying or determining that a training data example has led to an error or mistake by 
Misra, Li and Baker are analogous art because they are directed to using neural networks (i.e., nodal networks) for image recognition. In particular, Baker teaches pattern recognition with a plurality of models (i.e., an ensemble) within a classifier and using a model to determine labels associated with a plurality of links. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li to incorporate the teachings of Baker to provide “a collection of cooperating recognition systems … managed as a population of systems, continually evolving and improving” and a “system … designed to correct its own errors” by operating “on training data that has been labeled automatically by running the recognition process” and performing “Delayed-decision training … on this designated training data with feedback of validated or corrected labels.” (See, e.g., Baker, paragraphs 16-17). Doing so would have allowed Misra in view of Li to use “the validated or corrected labels … as the final, improved recognition output”, as suggested by Baker (See, e.g., Baker, paragraph 17). This is an 
Although Misra in view of Li, Baker and Alhamdoosh substantially teaches the claimed invention, Misra in view of Li, Baker and Alhamdoosh is not relied on to teach determining whether the activation values for the first and second nodes are equal for a training data example.
In the same field, analogous art Knittel teaches determining whether the activation values for the first and second nodes are equal for a training data example (see, e.g., FIG. 3 – elements 311 and 314 showing “Activation value for node #1” and 4 are equal, 0.8 for “Training data instance #1 (301)” [i.e., determining that activation values for first and second nodes are equal for a training data example] and paragraphs 45-46 and 49, “FIG. 3 depicts an example arrangement 300 of activation values of an artificial neural network, in response to a set of training data instances … each row 301, 302 … corresponds to a training data instance, and each column 311 … 314 corresponds to a node in the artificial neural network” [i.e., first and second nodes of the neural/nodal network], “the response of the set of nodes #1 (column 311), … #4 (column 314) to receiving training data instance corresponding to row 301, as input to an artificial neural network … The activation values are [0.8, … 0.8].” [i.e., determining whether the activation values are equal, 0.8 for first and second nodes for the training example 301]). 
Misra, Li, Baker, Alhamdoosh and Knittel are analogous art because they are directed to using neural networks (i.e., nodal networks) for objectives such as image recognition.


Regarding claims 105 and 117, as discussed above, Misra in view of Li, Baker, Alhamdoosh and Knittel teaches the method of claim 104 and the system of claim 116.
adjusting, by a learning coach computer system, the … cost term (paragraphs 66 and 78 of applicant’s specification disclose “a second machine learning system, called a ‘learning coach.’ The learning coach does not learn the same thing that the first machine learning system is trying to learn … the learning coach learns to recognize situations where the progress of learning by the first learning system is slower than it should be and thereby can guide the first learning system to take actions that accelerate the learning process” and “A learning coach is a second machine learning system that learns knowledge about the learning process in order to coach a first learning system to have more effective learning so as to achieve better performance.” Therefore “a learning coach computer system” under the BRI is any machine learning computer system that learns something different (i.e., a sub-task) than another learning system is trying to learn (i.e., a task) or improves performance of another learning system) (see, e.g., FIG. 4 depicting a sub-classification task that shares representations with a primary classification task to improve performance of a first learning system by learning “representations that can help with both tasks A and B” and using a “sub-network that gets direct supervision from task A as network A” [i.e., a first learning system gets supervision from a learning coach system] “and the other as network B” and pages 3995 and 3997, “This paper proposes cross-stitch units … automatically learns an optimal combination of shared and task-specific representations … such a cross-stitched network can achieve better performance than the networks found by brute-force enumeration and search” [i.e., learning coach system improves performance], “Since cross-stitch units are modeled as linear combination, their partial derivatives for loss L with tasks A, B are computed as
    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale
 We denote αAB, αBA by αD and call them the different task values because they weigh the activations of another task” [i.e., adjusting the loss L/cost term]). 
Although Misra in view of Li substantially teaches the claimed invention, Misra in view of Li is not relied on to teach adjusting … the additional cost term.
In the same field, analogous art Baker teaches adjusting … the additional cost term (paragraph 277 of applicant’s specification discloses “When … the computer system detects a case of such error-prone data, … the computer system applies a special regularization cost term to the node pairs created or selected in box 1602. The special regularization cost function penalizes the two nodes for agreeing with each other”. Therefore, adjusting “the additional cost term”, under the BRI is penalizing nodes or making node scores worse when a training data example leads to an error or mistake) (see, e.g., paragraph 111, “in corrective training, a recognition process is run and the parameters of the models are adjusted to help correct the errors by improving the scores of the correct label when there is an error or by making the incorrect best-scoring label get a worse score.” [i.e., adjusting the additional cost term/adjusting scores]).
Misra, Li and Baker are analogous art because they are directed to using neural networks (i.e., nodal networks) for image recognition. In particular, Baker teaches 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li to incorporate the teachings of Baker to provide “a collection of cooperating recognition systems … managed as a population of systems, continually evolving and improving” and a “system … designed to correct its own errors” by operating “on training data that has been labeled automatically by running the recognition process” and performing “Delayed-decision training … on this designated training data with feedback of validated or corrected labels.” (See, e.g., Baker, paragraphs 16-17). Doing so would have allowed Misra in view of Li to use “the validated or corrected labels … as the final, improved recognition output”, as suggested by Baker (See, e.g., Baker, paragraph 17). This is an example of “use of known technique to improve similar devices (methods, or products) in the same way.” See MPEP 2143.
 
Regarding claims 106 and 118, as discussed above, Misra in view of Li, Baker, Alhamdoosh and Knittel teaches the method of claim 104 and the system of claim 116.
Misra further discloses wherein the nodal network comprises an ensemble of classifier networks (see, e.g., page 3999, “Attribute prediction … is a multi-label classification problem … our approach cross-stitches two networks and therefore uses 2× parameters, we also consider an ensemble of two one-task networks (denoted by ‘Ensemble’) … the ensemble baseline uses ∼ 2× the cross-stitch parameters” [i.e., a nodal/neural network comprising an ensemble of two classifier/classification networks]).

Regarding claims 107 and 119, as discussed above, Misra in view of Li, Baker, Alhamdoosh and Knittel teaches the method of claim 106 and the system of claim 118.
Misra further discloses a first classifier network in the ensemble of classifier networks; and … a second classifier network in the ensemble of classifier networks (see, e.g., page 3999, “we also consider an ensemble of two one-task networks (denoted by ‘Ensemble’).” [i.e., first and second classifier networks in the ensemble of networks]).
Although Misra in view of Li, Baker and Alhamdoosh substantially teaches the claimed invention, Misra in view of Li, Baker and Alhamdoosh is not relied on to teach wherein the first node is an output node of a first classifier network … ; and
the second node is an output node of a second classifier network … of classifier networks.
In the same field, analogous art Knittel teaches wherein the first node is an output node of a first classifier network (see, e.g., paragraph 38, “For each instance of the training data the activation value of output nodes of the artificial neural network is determined” [i.e., first node is an output node of a first neural/classifier network]) … ; and
the second node is an output node of a second classifier network … of classifier networks (see, e.g., paragraphs 90 and 124, “the artificial neural network 600 is a 2 layer network, consisting of … a set of second layer nodes, also known as output nodes” [i.e., second node is an output node of a second neural/classifier network 
Misra, Li, Baker and Knittel are analogous art because they are directed to using neural networks (i.e., nodal networks) for image recognition.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li, Baker and Alhamdoosh to incorporate the teachings of Knittel to provide a system and method for “training an artificial neural network” (i.e., a nodal network) by “determining an activation value for each node in a set of nodes of the artificial neural network, the activation values being determined by applying training data to the artificial neural network” (i.e., determining activation values for first and second nodes for training data examples), “and scaling the determined activation values for each of a plurality of the nodes in a portion of the artificial neural network.” (See, e.g., Knittel, Abstract). Doing so would have allowed Misra in view of Li, Baker and Alhamdoosh to use a sparsity penalty value (i.e., a cost term in a cost function) to address problems with training artificial neural networks such as overfitting, and the large amount of processing time in order to achieve a required accuracy and to use scaling factors applied to the activation values of a set of nodes, based on the relative rank of those activation values to improve the training of artificial neural networks by adjusting the edge weights “in a way that allows a small number nodes to have large activation values, and the remaining nodes to have small activation values, and thus the artificial neural network converges to a sparsely responding network”, as suggested by Knittel (See, e.g., Knittel, paragraph 124).


Although Misra in view of Li substantially teaches the claimed invention, Misra in view of Li is not relied on to teach detecting the error-prone training data example
In the same field, analogous art Baker teaches detecting the error-prone training data example (as indicated above, detecting the “error-prone training data example”, under the BRI is detecting, identifying or determining that a training data example has led to an error or mistake by members/models of an ensemble) (see, e.g., paragraphs 50-51 and 96, “creating a set of linked model sets … based on training said … recognition system on the sample of data wherein each model in the set of linked models is created by training on the given sample with a training label” [i.e., a training data example/sample in a set of training data], “collecting … estimates of a degree to which errors made by each two of the linked models are diverse” [i.e., detecting errors by members/models of an ensemble], “Delayed-decision training is designed to … make it more tolerant of labeling errors in the training data” [i.e., identify error-prone training data examples]).
Misra, Li and Baker are analogous art because they are directed to using neural networks (i.e., nodal networks) for image recognition. In particular, Baker teaches pattern recognition with a plurality of models (i.e., an ensemble) within a classifier and using a model to determine labels associated with a plurality of links. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li to incorporate the teachings of Baker to provide “a collection of cooperating recognition systems … 
Although Misra in view of Li, Baker and Alhamdoosh substantially teaches the claimed invention, Misra in view of Li, Baker and Alhamdoosh is not relied on to teach determining whether the ensemble made an error in a classification task for the ensemble on a training data example. 
In the same field, analogous art Knittel teaches determining whether the ensemble made an error in a classification task for the ensemble on a training data example (see, e.g., paragraph 97, “determining step 503 executes to produce a value representing the error between the output produced by the artificial neural network 600 and the target … error value between the network activation … and the training target input represents a relationship between the input data and the training target” [i.e., determining that the neural network ensemble made an error in a classification task on a training data example]).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li, Baker and Alhamdoosh to incorporate the teachings of Knittel to provide a system and method for “training an artificial neural network” (i.e., a nodal network) by “determining an activation value for each node in a set of nodes of the artificial neural network, the activation values being determined by applying training data to the artificial neural network” (i.e., determining activation values for first and second nodes for training data examples), “and scaling the determined activation values for each of a plurality of the nodes in a portion of the artificial neural network.” (See, e.g., Knittel, Abstract). Doing so would have allowed Misra in view of Li, Baker and Alhamdoosh to use a sparsity penalty value (i.e., a cost term in a cost function) to address “problems with training artificial neural networks such as overfitting, and the large amount of processing time in order to achieve a required accuracy” and to use “scaling factors applied to the activation values of a set of nodes, based on the relative rank of those activation values” to improve “the training of artificial neural networks by adjusting the edge weights “in a way that allows a small number nodes to have large activation values, and the remaining nodes to have small activation values, and thus the artificial neural network converges to a sparsely responding network”, as suggested by Knittel (See, e.g., Knittel, paragraph 124). This is an example of “use of known technique to improve similar devices (methods, or products) in the same way.” See MPEP 2143.

Regarding claims 109 and 121, as discussed above, Misra in view of Li, Baker, Alhamdoosh and Knittel teaches the method of claim 108 and the system of claim 120.
Misra further discloses determining, by a learning coach computer system, the … cost term (as indicated above, “a learning coach computer system” under the BRI is any machine learning computer system that learns something different (i.e., a sub-task) than another learning system is trying to learn (i.e., a task) or improves performance of another learning system) (see, e.g., FIG. 4 depicting a sub-classification task that shares representations with a primary classification task to improve performance of a first learning system by learning “representations that can help with both tasks A and B” and using a “sub-network that gets direct supervision from task A as network A” [i.e., a first learning system gets supervision from a learning coach system] “and the other as network B” and pages 3995 and 3997, “This paper proposes cross-stitch units … automatically learns an optimal combination of shared and task-specific representations … such a cross-stitched network can achieve better performance than the networks found by brute-force enumeration and search” [i.e., learning coach system improves performance], “Since cross-stitch units are modeled as linear combination, their partial derivatives for loss L with tasks A, B are computed as
    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale
 We denote αAB, αBA by αD and call 
Although Misra in view of Li substantially teaches the claimed invention, Misra in view of Li is not relied on to teach determining … the additional cost term.
In the same field, analogous art Baker teaches determining … the additional cost term (paragraph 277 of applicant’s specification discloses “When … the computer system detects a case of such error-prone data, … the computer system applies a special regularization cost term to the node pairs created or selected in box 1602. The special regularization cost function penalizes the two nodes for agreeing with each other”. Therefore, determining “the additional cost term”, under the BRI is determining a reward or penalty for nodes or determining how to improve or worsen node scores) (see, e.g., paragraph 111, “in corrective training, a recognition process is run and the parameters of the models are adjusted to help correct the errors by improving the scores of the correct label when there is an error or by making the incorrect best-scoring label get a worse score.” [i.e., determining the additional cost term/for improving or worsening scores]).
Misra, Li and Baker are analogous art because they are directed to using neural networks (i.e., nodal networks) for image recognition. In particular, Baker teaches pattern recognition with a plurality of models (i.e., an ensemble) within a classifier and using a model to determine labels associated with a plurality of links. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li to incorporate the teachings of Baker to provide “a collection of cooperating recognition systems … 

Regarding claims 112 and 124, as discussed above, Misra in view of Li, Baker, Alhamdoosh and Knittel teaches the method of claim 104 and the system of claim 116.
Misra further discloses prior to counter-tying activation values of first and second nodes of the nodal network:
adding, by the computer system, the first and second nodes to the nodal network (see, e.g., pages 3996-3997 Section 3.3, “The network can decide to make certain layers task specific by setting αAB or αBA to zero, or choose a more shared representation by assigning a higher value to them”, “cross-stitch units are modeled as linear combination” [i.e., adding/combining cross-stitched/merged units/nodes], “We call the sub-network that gets direct supervision from task A as network A, and correspondingly the other as B. Cross-stitch units help regularize both tasks by learning and enforcing shared representations by combining activation (feature) maps.” [i.e., adding/merging the first and second nodes from the sub-network to the nodal network] 
Although Misra in view of Li, Baker and Alhamdoosh substantially teaches the claimed invention, Misra in view of Li, Baker and Alhamdoosh is not relied on to teach training, by the computer system, the nodal network with the first and second nodes, wherein training the nodal network comprises:
splitting the training data into multiple subsets, comprising a first subset and a second subset;
back-propagating information to the first node only from the first subset of training data; and
back-propagating information to the second node only from the second subset of training data.
In the same field, analogous art Knittel teaches splitting the training data into multiple subsets, comprising a first subset and a second subset (see, e.g., FIG. 3 showing elements 301 and 302 - training data instances #1 and #2 [i.e., splitting training data into first and second subsets and paragraph 45, “In FIG. 3, each row 301, 302, 303, 304, 305 corresponds to a training data instance” [i.e., training data is split into multiple subsets 301, 302]);
back-propagating information to the first node only from the first subset of training data (see, e.g., paragraph 78, “the network is trained using backpropagation of a set of derivative values. The set of derivative values is determined from the combination of the derivatives of an error value determined by the difference between the activation value of the output nodes [i.e., including the first node] and the target for .); and
back-propagating information to the second node only from the second subset of training data (see, e.g., paragraph 78, “the network is trained using backpropagation of a set of derivative values. The set of derivative values is determined from the combination of the derivatives of an error value determined by the difference between the activation value of the output nodes [i.e., including the second node] and the target for each training instance or training example” [i.e., back-propagating information to the first node from the second instance #2/subset 302 of training data]).
Misra, Li, Baker, Alhamdoosh and Knittel are analogous art because they are directed to using neural networks (i.e., nodal networks) for objectives such as image recognition.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li, Baker and Alhamdoosh to incorporate the teachings of Knittel to provide a system and method for “training an artificial neural network” (i.e., a nodal network) by “determining an activation value for each node in a set of nodes of the artificial neural network, the activation values being determined by applying training data to the artificial neural network” (i.e., determining activation values for first and second nodes for training data examples), “and scaling the determined activation values for each of a plurality of the nodes in a portion of the artificial neural network.” (See, e.g., Knittel, Abstract). Doing so would have allowed Misra in view of Li, Baker and Alhamdoosh to use a sparsity penalty value (i.e., a cost term in a cost function) to address problems with training artificial neural 

Regarding claims 113 and 125, as discussed above, Misra in view of Li, Baker and Alhamdoosh teaches the method of claim 112 and the system of claim 124.
Although Misra in view of Li, Baker and Alhamdoosh substantially teaches the claimed invention, Misra in view of Li, Baker and Alhamdoosh is not relied on to teach wherein splitting the training data comprises splitting, by the computer system, the training data based on a sign of a derivative of the objective of the nodal network with respect to an activation of an original node of the nodal network.
In the same field, analogous art Knittel teaches wherein splitting the training data comprises splitting, by the computer system, the training data (see, e.g., FIG. 3 showing elements 301 and 302 - training data instances #1 and #2 [i.e., splitting training data into first and second subsets and paragraphs 45 and 51-52, “In FIG. 3, each row 301, 302, 303, 304, 305 corresponds to a training data instance”, “an artificial neural network implemented on a computer system … computer system 800, upon which the various arrangements described can be practiced” [i.e., splitting the training data by the computer system]) based on a sign of a derivative of the objective of the nodal network with respect to an activation of an original node of the nodal network (as indicated above, the “objective of the nodal network” has been interpreted as any objective, purpose, goal or target of a task performed by the nodal network) (see, e.g., paragraph 78, “the network is trained using backpropagation of a set of derivative values. The set of derivative values is determined from the combination of the derivatives of an error value determined by the difference between the activation value of the output nodes and the target for each training instance or training example [i.e., a derivative of the target/the objective of the nodal network], and the derivatives of one or more sparsity penalty values determined from the distribution of activation values of nodes in the network. The network 600 represents an artificial neural network consisting of sets of nodes” [i.e., based on a derivative of a classification objective of the network with respect to an activation of an original node in neural/nodal network 600]).

Claims 110, 111, 122 and 123 are rejected under 35 U.S.C. 103 as being unpatentable over Misra in view of Li and Baker as applied to claims 102 and 114 above, and in further view of non-patent literature Ghorpade et al. ("Neural Networks for Face Recognition Using SOM", Dec. 2010, IJCST Vol. 1, Issue 2, pp. 65-67, hereinafter “Ghorpade”).
Regarding claims 110 and 122, as discussed above, Misra in view of Li, Baker and Alhamdoosh teaches the method of claim 102 and the system of claim 114.
Misra further discloses wherein the nodal network comprises a merger of an ensemble of … partially ordered networks (see, e.g., FIG. 2 depicting “train[ing] a variety of multi-task (two-task) architectures by splitting at different layers” [i.e., layered, 
Although Misra in view of Li and Baker substantially teaches the claimed invention, Misra in view of Li and Baker is not relied on to teach self-organizing partially ordered networks.
In the same field, analogous art Ghorpade teaches self-organizing partially ordered networks (see, e.g., page 67 Section VIII: “SOM is sheet-like artificial neural network, the cells of which become specially tuned to various input signal patterns or classes through an unsupervised learning process ... SOM reduce dimensions and display similarities. Self-Organizing Maps are topologically ordered, which leads to good extracting feature ability” [i.e., self-organizing partially ordered network]). 
Misra, Li, Baker and Ghorpade are analogous art because they are directed to neural network for image recognition.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li and Baker to incorporate the teachings of Ghorpade to provide a self-organizing partially ordered nodal network (See, e.g., Ghorpade, page 67 Section VIII). Doing so would have 

Regarding claims 111 and 123, as discussed above, Misra in view of Li, Baker and Alhamdoosh teaches the method of claim 102 and the system of claim 114.
Although Misra in view of Li and Baker substantially teaches the claimed invention, Misra in view of Li and Baker is not relied on to teach wherein the nodal network comprises a single self-organizing partially ordered network.
In the same field, analogous art Ghorpade teaches wherein the nodal network comprises a single self-organizing partially ordered network (see, e.g., page 67 Section VIII: “SOM is sheet-like artificial neural network, the cells of which become specially tuned to various input signal patterns or classes through an unsupervised learning process ... SOM reduce dimensions and display similarities. Self-Organizing Maps are topologically ordered, which leads to good extracting feature ability” [i.e., the neural/nodal network comprises a self-organizing partially ordered network]). 
Misra, Li, Baker and Ghorpade are analogous art because they are directed to neural network for image recognition.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Misra in view of Li and Baker to incorporate the teachings of Ghorpade to provide a self-organizing partially ordered nodal network (See, e.g., Ghorpade, page 67 Section VIII). Doing so would have 

Conclusion
The prior art made of record, listed on form PTO-892, and not relied upon, is considered pertinent to applicant's disclosure. 
The examiner requests, in response to this office action, support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line no(s) in the specification and/or drawing figure(s). This will assist the examiner in prosecuting the application.
When responding to this office action, Applicant is advised to clearly point out the patentable novelty which he or she thinks the claims present, in view of the state of the art disclosed by the reference cited or the objections made. He or she must also show how the amendments avoid such references or objections See 37 CFR 1.111 (c).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RANDY K BALDWIN whose telephone number is (571)270-5222. The examiner can normally be reached on Mon - Fri 9:00-6:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on 571-272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/R.K.B./Examiner, Art Unit 2125

/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125