DETAILED ACTION
This office action is in response to the amendment filed on 03/31/2022. Claims 1, 4, 9,10, 12, 18, 19, 20 were amended. No claim was added. Claim 7 was canceled. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are presented for examination.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 09/23/2020 has been entered.
Priority
The following claimed benefit is acknowledged: the instant application, filed 01/18/2019 claims priority from provisional application 62363652, filed 06/09/2016

Response to Argument
Applicant’s argument regarding rejections under 35 U.S.C. § 112, Second Paragraph :.
Applicant’s Argument:
Applicant has amended independent claims 1, 18, and 19 to recite "wherein adjusting the first values of the parameters comprises adjusting the first values of the parameters to optimize an objective function that depends in part on a penalty term that is based on the determined measures of importance of the plurality of parameters to the first machine learning task." Therefore, reconsideration and withdrawal of the rejection are respectfully requested.
Examiner’s Response:
The 112(b) rejection is withdrawn in view of claim amendment filed on 03/31/20222.
Applicant’s argument regarding rejections under 35 U.S.C. § 103:
Applicant’s Argument: 
Applicant respectfully disagrees. The cited art does not teach or suggest "determining, for each of the plurality of parameters, a respective measure of an importance of the parameter to the first machine learning task" as recited by amended claim 1. 
As Marcheret merely assigns a weight to each data item of labeled training data, Marcheret does not teach or suggest determining for each parameter of a machine learning model a respective measure of an importance of the parameter to the first machine learning task, as recited by amended claim 1.
Examiner’s Response:
Examiner respectfully disagrees to applicant argument because Marcheret teaches "determining, for each of the plurality of parameters, a respective measure of an importance of the parameter to the first machine learning task" as it can be seen at Marcheret, [.0063-0064] “Using this notation we may write the joint distribution, which represents the probability that the two GMMs being used to model the training data and the test data match the actual combined training and test data, given the parameters of both GMMs and the classifier's parameters… the GMM parameters are adjusted in order to maximize the log likelihood calculated in the E step. The E step is then performed again, then the M step, and so on until the log likelihood converges. At this stage the maximally likely GMM parameters and classifier parameters have been determined.” Examiner’s note, given GMM parameters value and the classify parameter value based on the probability calculation that the model matched during the training of labeled data and unlabeled data, therefore, the given value of the GMM parameters values and classify parameter value at the state that the model matched during the training of labeled data and unlabeled data is considered as the importance parameter values. therefore, the calculation of the probability to provide the values of GMM parameter and classify parameter is considered as measurement of importance of the parameter. However, the claim does not clearly clarify how the importance parameter is measured. Therefore, the argument regarding the limitation "determining, for each of the plurality of parameters, a respective measure of an importance of the parameter to the first machine learning task" the rejection is still maintained. 
Applicant’s Argument:
Further, without conceding to the position of the Office, and in an effort to advance prosecution of the instant application, the Applicant has amended claim 1 to recite "wherein adjusting the first values of the parameters comprises adjusting the first values of the parameters to optimize an objective function that depends in part on a penalty term that is based on the determined measures of importance of the plurality of parameters to the first machine learning task." 
Applicant respectfully submits that the cited references, alone or in combination, do not teach the above newly added features of amended claim 1. 
Examiner’s Response: 
This argument includes the newly added limitations as the claim is presented.  It has been fully considered but is moot in view of the new grounds of rejection presented below necessitated by the amendment. 
Applicant’s Argument:
In particular, the Office acknowledges that Marcheret and Rousett do not teach an objective function that includes "a second term that imposes a penalty for parameter values deviating from the first parameter values, wherein the second term penalizes deviations from the first values more for parameters that were more to the first machine learning task than for parameters were less to the first machine learning task" as recited by the previous claim 1, but cites Cao as teaching this feature. Office Action at p. 15. Applicant respectfully disagrees. As cited by the Office, section 4.1 of Cao states:
Examiner’s Response:
The argument regarding the limitation "a second term that imposes a penalty for parameter values deviating from the first parameter values, wherein the second term penalizes deviations from the first values more for parameters that were more to the first machine learning task than for parameters were less to the first machine learning task" is not recited to the recited limitation in the claim 1 filed on 03/31/2022, therefore, the argument is not persuasive. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 2, 4, 9, 15, 17, 18, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Marcheret et al. (Pub. No. US 2013/0254153 – hereinafter, Marcheret) in view of Jamaluddin et al. (NPL: Effect of Penalty Function Parameter in Objective Function of System Identification- Accelerating the world’s research-hereinafter, Jamaluddin) and further in view of Rousett et al.(NPL: Neural networks with a self-refreshing memory: knowledge transfer in sequential learning tasks without catastrophic...-Connection science-hereinafter, Rousett).
Regarding claim 1, Marcheret teaches a computer-implemented method of training a machine learning model having a plurality of parameters, wherein the machine learning model has been trained on a first machine learning task using first training data to determine first values of the parameters of the machine learning model (Marcheret, [Par.0046-0047], “The inputs to method 400 comprise a classification model 410, labeled training data 420 that was used to build classification model 410… Labeled training data 420 may be associated with a set of weights that may be used to specify the relative contribution of each item of labeled training data to the classification model 410.”),
and wherein the method comprises: determining, for each of the plurality of parameters, a respective measure of an importance of the parameter on the first machine learning task, comprising: determining an approximation of a probability that the first value of the parameter after the training on the first machine learning task is a correct value of the parameter given the first training data used to train the machine learning model on the first machine learning task (Marcheret, [Par.0063-0064], “Using this notation we may write the joint distribution, which represents the probability that the two GMMs being used to model the training data and the test data match the actual combined training and test data, given the parameters of both GMMs and the classifier's parameters… the GMM parameters are adjusted in order to maximize the log likelihood calculated in the E step. The E step is then performed again, then the M step, and so on until the log likelihood converges. At this stage the maximally likely GMM parameters and classifier parameters have been determined.” Examiner’s note, given GMM parameters value and the classify parameter value based on the probability calculation that the model matched during the training of labeled data and unlabeled data, therefore, the given value of the GMM parameters values and classify parameter values at the state that the model matched during the training of labeled data and unlabeled data is considered as the importance parameter values. Therefore, the calculation of the probability to provide the values of GMM parameter and classify parameter is considered as measurement of importance of the parameter. However, the claim does not clearly clarify how the importance parameter is measured.);
obtaining second training data for training the machine learning model on a second, different machine learning task (Marcheret, [Par.0024], “In accordance with some embodiments, an improved classification model may be generated by identifying similarities bet ween unlabeled test data and labeled training data. For example, by understanding features of the unlabeled test data that are similar to features in the labeled training data, the labeled training data may be reweighted and the classification model may be retrained to improve performance of the classification model for the particular use case with the distribution of unlabeled test data. The labeled training data may be reweighted, for example, by modifying one or more weight values associated with the items of labeled training data. The classification model may be retrained in an unsupervised fashion to achieve a desired level of performance (i.e., without requiring supervised labeling of the test data), and the performance of the retrained classification model on the unlabeled test data may closely match that of the model on the reweighted labeled training data.”) Examiner’s note, reweighed labeled training data is considered as the second training data.; 
and training the machine learning model on the second machine learning task by training the machine learning model on the second training data to adjust the first values of the parameters to optimize performance of the machine learning model on the second machine learning task (Marcheret, [Par.0024], “In accordance with some embodiments, an improved classification model may be generated by identifying similarities bet ween unlabeled test data and labeled training data. For example, by understanding features of the unlabeled test data that are similar to features in the labeled training data, the labeled training data may be reweighted and the classification model may be retrained to improve performance of the classification model for the particular use case with the distribution of unlabeled test data. The labeled training data may be reweighted, for example, by modifying one or more weight values associated with the items of labeled training data. The classification model may be retrained in an unsupervised fashion to achieve a desired level of performance” Examiner’s note, second machine learning task retrained the machine learning model by reweight the labeled training data to improve the performance of classification model.)
[…],
wherein adjusting the first values of the parameters comprises adjusting the first values of the parameters to optimize an objective function […] to the first machine learning task (Marcheret, [Par.0029], “In some other use cases, the similarity model may be used to improve the performance of a classification model for a particular distribution of unlabeled test data. For example, the similarity model may be used to reweight the labeled training data and the classification model trained using the labeled training data may be retrained using the reweighted training data to improve performance of the classification model for the environment having the distribution of unlabeled test data.” Examiner’s note, the labeled training data with the new weight values is retrained to achieve the desired performance value. Achieving the desired performance value is considered as the optimizing an objective function.) 
However, Marcheret does not teaches to optimize an objective function that depends in part on a penalty term that is based on the determined measures of importance of the plurality of parameters, 
On the other hand, Jamaluddin teaches to optimize an objective function that depends in part on a penalty term that is based on the determined measures of importance of the plurality of parameters (Jamaluddin, [page 941-942], “
    PNG
    media_image1.png
    597
    1276
    media_image1.png
    Greyscale
…where | aj | represents the absolute value of the parameter for term j and penalty is a fixed value termed penalty function parameter. The penalty function penalizes terms with the absolute values of the estimated parameter less than the penalty. This is applied so that models that are more parsimonious may be selected over the those that are more accurate but contain many terms.
In Jamaluddin et al. (2007), a trial-and-error approach was adopted in the selection of the penalty function parameter value based on the knowledge that as the value increases, model structures with fewer terms have lower OF. This is true as model structures with more terms, given that ill-conditioning does not occur, have lower residual values but many parameters that are small and considered insignificant to the model’s predictive accuracy, as based on the parameter in Equation (3c).” Examiner’s note, the values of the OF (Objective function) is changing based on the determination of the penalty parameter values. The absolute values of the estimated parameters less than the penalty are considered as the important parameters that can optimize (lower) objective function (OF)), 
Marcheret and Jamaluddin are analogous in arts because they have the same filed of endeavor of using a machine learning to achieve a performance desired level.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified the adjusting the parameter values to optimize the objective function taught by Marcheret and further in view of Jamaluddin by optimizing the objective function that depends in part on a penalty term that is based on the determined measures of importance of the plurality of parameters. The modification would have been obvious because one of the ordinary skills in art would be motivate to optimizing the objective function during training of the machine learning model (Jamaluddin, [Page 940, introduction section], “System identification is a method of recognizing the characteristics of a system, thus producing a quantitative input-output relationship that explains or resembles the system’s dynamics. The procedure involves the interpretation of observed or measure data into a physical relationship, often and easily interpreted in the form of mathematical models (Johansson, 1993). Besides other stages in system identification (i.e. data acquisition, parameter estimation and model validation), model structure selection requires a loss function, also called an objective function (OF), that evaluates the optimality of the model. Hereinafter, only the term objective function will be used such that a lower OF indicates better optimality.”). 
However, Marcheret and Jamaluddin do not clearly teach to optimize performance of the machine learning model on the second machine learning task while protecting performance of the machine learning model on the first machine learning task,
On the other hand, Rousset teaches to optimize performance of the machine learning model on the second machine learning task while protecting performance of the machine learning model on the first machine learning task (Rousset, [Abstract], “We explore a dual-network architecture with self-refreshing memory (Ans and Rousset 1997) which overcomes catastrophic forgetting in sequential learning tasks… With a self-refreshing memory network knowledge can be saved for a long time and therefore reused in subsequent acquisitions” Examiner’s note,  the architecture of  dual network architecture with self-refreshing memory to overcome catastrophic forgetting in sequential learning task such as training on the sequential or second machine learning task without forgetting the machine learning model of the first machine learning task, the self-refreshing memory is able to save/protect the learned knowledge from previous learning task, as it can be see at [Rousset, [page 2, the third paragraph], “To describe the basic principles of this mechanism, consider a series of item sets which have to be learned sequentially (set A , next set B, next set C, . . . , etc.) by a feedforward multilayer network using a gradient descent learning algorithm. Each set contains a number (that can be reduced to only one) of input–target items which have to be associated after learning. Once the learning of the first set A of associative pairs is completed and before the learning of the second set B starts, the network is stimulated by random input patterns, each generating a corresponding output pattern. These input–output pairs are successively stored in a pseudopopulation which is then considered as having captured something reecting the set A structure. During the learning of the second set B, the network is concurrently trained on the input–output pairs previously stored in the pseudopopulation. These last pairs are seen as pseudoassociations reflecting the old knowledge. Learning the second set is considered as being completed when a learning criterion is reached for all set B input–output pairs (the pseudo input–output are not subject to a learning criterion). The same process applies again for the learning of the third set C: before learning set C a pseudopopulation has to be built up, hence capturing some representation of the A –B structure, and then the new set is trained in conjunction with the refreshed A –B pseudo-knowledge. The other sequentially learned new sets will then be processed in the same way.”,), 
Marcheret, Jamaluddin and Rousett are analogous in arts because they have the same filed of endeavor of using a machine learning to train on the multiple machine learning tasks.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Marcheret and Jamaluddin’s method, further in view of Rousset by optimizing performance of the machine learning model on the second machine learning task while protecting performance of the machine learning model on the first machine learning task. The modification would have been obvious because one of the ordinary skills in art would be motivate to improve continuous learning without forgetting the old task (Rousset, [page 16, the last paragraph], “This result was obtained in the framework of arithmetical metaphors which were taken as representative examples of structured sets of items. It is worth noting that the efficiency of sequential learning stems from the fact that the self-refreshing memory process makes it possible to maintain previously learned knowledge, hence improving transfer during subsequent learning of related tasks. What a network (with self-refreshing memory) knows about something will be saved for a long time and therefore possibly reused in subsequent acquisitions of other things. This contrasts with sequential learning without pseudorehearsal, where old knowledge is likely to be destroyed as a network is faced with new acquisitions. In this case, since previously learnt knowledge is lost it cannot be obviously reused.”).
Regarding claim 2, Marcheret teaches the method of claim 1, wherein the first machine learning task and the second machine learning task are different supervised learning tasks (Marcheret, [Par.0005], “A classification model may be constructed using supervised training, in which inputs with labels identifying their known classes are used to train the classification model. The classification model is thereby able to learn how to correctly assign classes based on the labeled training data, and may then be used to determine the classes of unlabeled input for which the class is unknown.”)
Regarding to claim 18, as being rejected as the same reason as the claim 1.
Additionally, Marcheret further disclosed a system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the operations (Marcheret, [Par.0007], “In a further embodiment, there is provided a tangible computer-recordable medium having a plurality of instruction embodied therein, wherein the plurality of instructions, when executed by a processor, cause a machine to perform a method of processing a first classification model that classifies an input into one of a plurality of classes” )
Regarding to claim 19, as being rejected as the same reason as the claim 1.
Additionally. Marcheret disclosed a computer storage medium encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform the operations (Marcheret, [Par.0140], “In this respect, it should be appreciated that one implementation of embodiments of the present invention comprises at least one computer-recordable medium ( e.g., a computer memory, a floppy disk, a compact disk, a magnetic tape, or other tangible, non-transitory computer-recordable medium) encoded with a computer program (i.e., a plurality of instructions) which, when executed on one or more processors, performs the above-discussed functions of one or more embodiments of the present invention )
Regarding claim 4, Marcheret teaches the method of claim 1, wherein the objective function that includes: (i) a first term that measures a performance of the machine learning model on the second machine learning task (Marcheret, [Par.0024], “In accordance with some embodiments, an improved classification model may be generated by identifying similarities bet ween unlabeled test data and labeled training data. For example, by understanding features of the unlabeled test data that are similar to features in the labeled training data, the labeled training data may be reweighted and the classification model may be retrained to improve performance of the classification model for the particular use case with the distribution of unlabeled test data. The labeled training data may be reweighted, for example, by modifying one or more weight values associated with the items of labeled training data. The classification model may be retrained in an unsupervised fashion to achieve a desired level of performance” Examiner’s note, second machine learning task retrained the machine learning model by reweight the labeled training data to improve the performance of classification model.), 
However, Marcheret does not teach (II) the penalty term that imposes a penalty for parameter values deviating from the first parameter values, wherein the penalty term penalizes deviations from the first values more for parameters that were more important to the first machine learning task than for parameters were less important to the first machine learning task.
On the other hand, Jamaluddin teaches (II) the penalty term that imposes a penalty for parameter values deviating from the first parameter values, wherein the penalty term penalizes deviations from the first values more for parameters that were more important to the first machine learning task than for parameters were less important to the first machine learning task (Jamaluddin, [Page 942], “where |aj| represents the absolute value of the parameter for term j and penalty is a fixed value termed penalty function parameter. The penalty function penalizes terms with the absolute values of the estimated parameter less than the penalty. This is applied so that models that are more parsimonious may be selected over the those that are more accurate but contain many terms. In Jamaluddin et al. (2007), a trial-and-error approach was adopted in the selection of the penalty function parameter value based on the knowledge that as the value increases, model structures with fewer terms have lower OF. This is true as model structures with more terms, given that ill-conditioning does not occur, have lower residual values but many parameters that are small and considered insignificant to the model’s predictive accuracy, as based on the parameter in Equation (3c).” Examiner’s note, the parameter is less than the penalty is considered as the parameter is more importance than the parameter less importance.).
Marcheret and Jamaluddin are analogous in arts because they have the same filed of endeavor of using a machine learning to achieve a performance desired level.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified the adjusting the parameter values to optimize the objective function taught by Marcheret and further in view of Jamaluddin by having the penalty term that imposes a penalty for parameter values deviating from the first parameter values, wherein the penalty term penalizes deviations from the first values more for parameters that were more important to the first machine learning task than for parameters were less important to the first machine learning task. The modification would have been obvious because one of the ordinary skills in art would be motivate to optimizing the objective function during training of the machine learning model (Jamaluddin, [Page 940, introduction section], “System identification is a method of recognizing the characteristics of a system, thus producing a quantitative input-output relationship that explains or resembles the system’s dynamics. The procedure involves the interpretation of observed or measure data into a physical relationship, often and easily interpreted in the form of mathematical models (Johansson, 1993). Besides other stages in system identification (i.e. data acquisition, parameter estimation and model validation), model structure selection requires a loss function, also called an objective function (OF), that evaluates the optimality of the model. Hereinafter, only the term objective function will be used such that a lower OF indicates better optimality.”).
Regarding claim 20 is being rejected as the same reason as the claim 4. 

Regarding claim 9, Marcheret teaches the method of claim 1, further comprising: after training the machine learning model on the second machine learning task to determine second values of the parameters of the machine learning model (Marcheret, [Par.0006, lines 7-10], “obtaining unlabeled input for the first classification model; using the unlabeled input to reweight the labeled training data to have a second set of weights that is different from the first set of weights;” ):
[…]
However, Marcheret, does not teach wherein adjusting the second values of the parameters comprises adjusting the second values of the parameters to optimize a second objective function that depends in part on a second penalty term that is based on the determined measures of importance of the plurality of parameters to the first machine learning task and on measures of importance of the plurality of parameters to the second machine learning task.
On the other hand, Jamaluddin teaches wherein adjusting the second values of the parameters comprises adjusting the second values of the parameters to optimize a second objective function that depends in part on a second penalty term that is based on the determined measures of importance of the plurality of parameters to the first machine learning task and on measures of importance of the plurality of parameters to the second machine learning task (Jamaluddin, [Page 950-0951, Fig. 3], “With small penalty parameter values, the number of significant regressors is always higher until a certain penalty parameter value, hereby denoted switchover penalty, is reached. Beyond this point, the number of insignificant regressors rises. Within a certain range of this point, the number of true regressors, i.e. correct regressors in the simulated model, with parameters bigger than or equal to the penalty parameter value reaches an agreement with the identified number of significant regressors… 
    PNG
    media_image2.png
    544
    1172
    media_image2.png
    Greyscale
” Exminer’s note, the plurality of the penalty term such as the penalty 1 and penalty 2. penalty parameter is increasing that having a lower objective function as it can be seeen at [page 942, Objective Function section], “In Jamaluddin et al. (2007), a trial-and-error approach was adopted in the selection of the penalty function parameter value based on the knowledge that as the value increases, model structures with fewer terms have lower OF.”).
Marcheret and Jamaluddin are analogous in arts because they have the same filed of endeavor of using a machine learning to achieve a performance desired level.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified the adjusting the parameter values to optimize the objective function taught by Marcheret and  further in view of Jamaluddin by adjusting the second values of the parameters comprises adjusting the second values of the parameters to optimize a second objective function that depends in part on a second penalty term that is based on the determined measures of importance of the plurality of parameters to the first machine learning task and on measures of importance of the plurality of parameters to the second machine learning task. The modification would have been obvious because one of the ordinary skills in art would be motivate to optimizing the objective function during training of the machine learning model (Jamaluddin, [Page 940, introduction section], “System identification is a method of recognizing the characteristics of a system, thus producing a quantitative input-output relationship that explains or resembles the system’s dynamics. The procedure involves the interpretation of observed or measure data into a physical relationship, often and easily interpreted in the form of mathematical models (Johansson, 1993). Besides other stages in system identification (i.e. data acquisition, parameter estimation and model validation), model structure selection requires a loss function, also called an objective function (OF), that evaluates the optimality of the model. Hereinafter, only the term objective function will be used such that a lower OF indicates better optimality.”).
However, Marcheret and Jamaluddin do not teach obtaining third training data for training the machine learning model on a third, different machine learning task; and training the machine learning model on the third machine learning task by training the machine learning model on the third training data to adjust the second values of the parameters to optimize performance of the machine learning model on the third machine learning task while protecting performance of the machine learning model on the first machine learning task and the second machine learning task,
On the other hand, Rousett teaches obtaining third training data for training the machine learning model on a third, different machine learning task; and training the machine learning model on the third machine learning task by training the machine learning model on the third training data to adjust the second values of the parameters to optimize performance of the machine learning model on the third machine learning task while protecting performance of the machine learning model on the first machine learning task and the second machine learning task  wherein (Rousett, [Page 2, last paragraph], “During the learning of the second set B, the network is concurrently trained on the input–output pairs previously stored in the pseudopopulation. These last pairs are seen as pseudoassociations reflecting the old knowledge. Learning the second set is considered as being completed when a learning criterion is reached for all set B input–output pairs (the pseudo input–output are not subject to a learning criterion). The same process applies again for the learning of the third set C: before learning set C a pseudopopulation has to be built up, hence capturing some representation of the A –B structure, and then the new set is trained in conjunction with the refreshed A –B pseudo-knowledge. The other sequentially learned new sets will then be processed in the same way.”).
Marcheret, Jamaluddin and Rousset are analogous in arts because they have the same filed of endeavor of using a machine learning to train on the multiple machine learning tasks.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Marcheret and Jamaluddin’s method, further in view of Rousset by protecting a performance of machine learning model while training on the other machine learning task. The modification would have been obvious because one of the ordinary skills in art would be motivate to improve continuous learning without forgetting the old task (Rousset, [page 16, the last paragraph], “This result was obtained in the framework of arithmetical metaphors which were taken as representative examples of structured sets of items. It is worth noting that the efficiency of sequential learning stems from the fact that the self-refreshing memory process makes it possible to maintain previously learned knowledge, hence improving transfer during subsequent learning of related tasks. What a network (with self-refreshing memory) knows about something will be saved for a long time and therefore possibly reused in subsequent acquisitions of other things. This contrasts with sequential learning without pseudorehearsal, where old knowledge is likely to be destroyed as a network is faced with new acquisitions. In this case, since previously learnt knowledge is lost it cannot be obviously reused.”).
Regarding claim 15, Marcheret teaches the method of claim 1, the method further comprising providing the trained machine learning model for use in processing data after training the machine learning model on the second machine learning task (Marcheret, [Par. 0023], “Unlabeled test data may differ in some way from the labeled training data used to train the classification model. For example, a classification model may be trained on labeled data created by an ASR system processing the speech of one or more users. The classification model could then be used to classify unlabeled input comprising speech from one or more new users. The new user(s) may have speech characteristics and/or may be speaking in one or more environments that differ from the speech characteristics and/or speaking environments of the one or more users used to train the classification model, which may result in the classification model performing differently than expected on the test data.”)
Regarding claim 17 Marcheret teaches the method of claim 1wherein the first and second machine learning tasks each comprise a classification task, and wherein the classification task is processing data to classify the data (Marcheret, [Par.0022-Par.0023], “The inventor has recognized and appreciated that the distribution of labeled training data ( sometimes referred to simply as 'training data') used to train a classification model and the distribution of unlabeled input data … Unlabeled test data may differ in some way from the labeled training data used to train the classification model. For example, a classification model may be trained on labeled data created by an ASR system processing the speech of one or more users…”)
Claims  6, 10, 11, 12 are rejected under 35 U.S.C. 103 as being unpatentable over over Marcheret et al. (Pub. No. US 2013/0254153 – hereinafter, Marcheret) in view of Jamaluddin et al. (NPL: Effect of Penalty Function Parameter in Objective Function of System Identification- Accelerating the world’s research-hereinafter, Jamaluddin) and further in view of Rousett et al.(NPL: Neural networks with a self-refreshing memory: knowledge transfer in sequential learning tasks without catastrophic...-Connection science-hereinafter, Rousett) and further in view of Cao et al. (NPL: A Practical Transfer Learning Algorithm for Face Verification- hereinafter, Cao).
Regarding claim 6, Marcheret teaches the method of claim 4, wherein the second term depends on, for each of the plurality of parameters (Marcheret, [Par.0047, lines 1-4], “Labeled training data 420 may be associated with a set of weights that may be used to specify the relative contribution of each item of labeled training data to the classification model 410.”),
a product of the respective measure of importance of the parameter and a difference between the current value of the parameter and the first value of the parameter (Marcheret, [Par.0063-0064], “where 8 represents the set of parameters associated with the underlying Gaussian components of the two GMMs used to model the training data and the test data (i.e., the mixture weights and Gaussian densities of the Gaussian components as shown in Equations 3 and 4, and the mean and covariance of each Gaussian component), and w are parameters which describe the function being used to model the classifier p(yu lxu) (i.e., parameters which parameterize the classifier used to determine the probability of a class for a given data point, which include weight values… consequently also represents the probability that the model matches the combined training and test data for given values of the GMM parameters 8 and classifier parameters w. In the M step, the GMM parameters are adjusted in order to maximize the log likelihood calculated in the E step. The E step is then performed again, then the M step, and so on until the log likelihood converges. At this stage the maximally likely GMM parameters and classifier parameters have been determined.” Examiner’s note, the calculation of the parameter value of training data set and test set is considered as the comparison of the parameter value of the current value and the first parameter value).
Regarding claim 11 is being rejected as the same reason as the claim 6.
Regarding claim 12, Marcheret teaches a product of (i) the respective measure importance of the parameter to the second machine learning task (Marcheret, [Par.0047, lines 1-4], “Labeled training data 420 may be associated with a set of weights that may be used to specify the relative contribution of each item of labeled training data to the classification model 410.”), 
and (ii) a difference between the current value of the parameter and the second value of the parameter (Marcheret, [Par.0063-0064], “where 8 represents the set of parameters associated with the underlying Gaussian components of the two GMMs used to model the training data and the test data (i.e., the mixture weights and Gaussian densities of the Gaussian components as shown in Equations 3 and 4, and the mean and covariance of each Gaussian component), and w are parameters which describe the function being used to model the classifier p(yu lxu) (i.e., parameters which parameterize the classifier used to determine the probability of a class for a given data point, which include weight values… consequently also represents the probability that the model matches the combined training and test data for given values of the GMM parameters 8 and classifier parameters w. In the M step, the GMM parameters are adjusted in order to maximize the log likelihood calculated in the E step. The E step is then performed again, then the M step, and so on until the log likelihood converges. At this stage the maximally likely GMM parameters and classifier parameters have been determined.” Examiner’s note, the calculation of the parameter value of training data set and test set is considered as the comparison of the parameter value of the current value and the first parameter value).
However, Marcheret does not teaches wherein the second penalty term depends on, for each of the plurality of parameters 
On the other hand, Jamaluddin teaches wherein the second penalty term depends on, for each of the plurality of parameters (Jamaluddin, [Page 950-0951, Fig. 3], “With small penalty parameter values, the number of significant regressors is always higher until a certain penalty parameter value, hereby denoted switchover penalty, is reached. Beyond this point, the number of insignificant regressors rises. Within a certain range of this point, the number of true regressors, i.e. correct regressors in the simulated model, with parameters bigger than or equal to the penalty parameter value reaches an agreement with the identified number of significant regressors… 
    PNG
    media_image2.png
    544
    1172
    media_image2.png
    Greyscale
” Exminer’s note, the plurality of the penalty term such as the penalty 1 and penalty 2 is generated based on the given penalty parameter values.),
Marcheret and Jamaluddin are analogous in arts because they have the same filed of endeavor of using a machine learning to achieve a performance desired level.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified the adjusting the parameter values to optimize the objective function taught by Marcheret and  further in view of Jamaluddin by having the second penalty term depends on, for each of the plurality of parameters. The modification would have been obvious because one of the ordinary skills in art would be motivate to optimizing the objective function during training of the machine learning model (Jamaluddin, [Page 940, introduction section], “System identification is a method of recognizing the characteristics of a system, thus producing a quantitative input-output relationship that explains or resembles the system’s dynamics. The procedure involves the interpretation of observed or measure data into a physical relationship, often and easily interpreted in the form of mathematical models (Johansson, 1993). Besides other stages in system identification (i.e. data acquisition, parameter estimation and model validation), model structure selection requires a loss function, also called an objective function (OF), that evaluates the optimality of the model. Hereinafter, only the term objective function will be used such that a lower OF indicates better optimality.”).
Regarding claim 10, Macheret teaches the method of claim 9, further comprising: determining, for each of the plurality of parameters, a respective measure of an importance of the parameter to the machine learning model achieving acceptable performance on the second machine learning task, comprising: comprising: determining an approximation of a probability that the second value of the parameter after the training on the second machine learning task is a correct value of the parameter given the second training data used to train the machine learning model (Marcheret, [Par.0063-0064], “Using this notation we may write the joint distribution, which represents the probability that the two GMMs being used to model the training data and the test data match the actual combined training and test data, given the parameters of both GMMs and the classifier's parameters… the GMM parameters are adjusted in order to maximize the log likelihood calculated in the E step. The E step is then performed again, then the M step, and so on until the log likelihood converges. At this stage the maximally likely GMM parameters and classifier parameters have been determined.” Examiner’s note, method to calculate/maximize the GMM parameter and classifier parameters have been determined, wherein the determined classify parameter is considered as the given parameter. );
However, Macheret does not teach and wherein the second objective function that includes: (i) a first term that measures a performance of the machine learning model on the third machine learning task and (ii) a second term that imposes a penalty for parameter values deviating from the first parameter values, wherein the second term penalizes deviations from the first values more for parameters that were more important in achieving acceptable performance on the first machine learning task than for parameters were less important in achieving acceptable performance on the first machine learning task, and (iii) a the second penalty term that imposes a penalty for parameter values deviating from the second parameter values, wherein the third term penalizes deviations from the second values more for parameters that were more important in achieving acceptable performance on the second machine learning task than for parameters were less important in achieving acceptable performance on the second machine learning task.
On the other hand, Jamaluddin teaches […] and (iii) a the second penalty term that imposes a penalty for parameter values deviating from the second parameter values, wherein the third term penalizes deviations from the second values more for parameters that were more important in achieving acceptable performance on the second machine learning task than for parameters were less important in achieving acceptable performance on the second machine learning task (Jamaluddin, [Page 942], “where |aj| represents the absolute value of the parameter for term j and penalty is a fixed value termed penalty function parameter. The penalty function penalizes terms with the absolute values of the estimated parameter less than the penalty. This is applied so that models that are more parsimonious may be selected over the those that are more accurate but contain many terms. In Jamaluddin et al. (2007), a trial-and-error approach was adopted in the selection of the penalty function parameter value based on the knowledge that as the value increases, model structures with fewer terms have lower OF. This is true as model structures with more terms, given that ill-conditioning does not occur, have lower residual values but many parameters that are small and considered insignificant to the model’s predictive accuracy, as based on the parameter in Equation (3c).” Examiner’s note, the parameter is less than the penalty is considered as the parameter is more importance than the parameter less importance.).
Marcheret and Jamaluddin are analogous in arts because they have the same filed of endeavor of using a machine learning to achieve a performance desired level.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified the adjusting the parameter values to optimize the objective function taught by Marcheret and further in view of Jamaluddin by having a the second penalty term that imposes a penalty for parameter values deviating from the second parameter values, wherein the third term penalizes deviations from the second values more for parameters that were more important in achieving acceptable performance on the second machine learning task than for parameters were less important in achieving acceptable performance on the second machine learning task. The modification would have been obvious because one of the ordinary skills in art would be motivate to optimizing the objective function during training of the machine learning model (Jamaluddin, [Page 940, introduction section], “System identification is a method of recognizing the characteristics of a system, thus producing a quantitative input-output relationship that explains or resembles the system’s dynamics. The procedure involves the interpretation of observed or measure data into a physical relationship, often and easily interpreted in the form of mathematical models (Johansson, 1993). Besides other stages in system identification (i.e. data acquisition, parameter estimation and model validation), model structure selection requires a loss function, also called an objective function (OF), that evaluates the optimality of the model. Hereinafter, only the term objective function will be used such that a lower OF indicates better optimality.”).
Marcheret and Jamaluddin do not teach and wherein the second objective function that includes: (i) a first term that measures a performance of the machine learning model on the third machine learning task and (ii) a second term that imposes a penalty for parameter values deviating from the first parameter values, wherein the second term penalizes deviations from the first values more for parameters that were more important in achieving acceptable performance on the first machine learning task than for parameters were less important in achieving acceptable performance on the first machine learning task,
On the other hand, Roussett teaches and wherein the second objective function that includes: (i) a first term that measures a performance of the machine learning model on the third machine learning task (i) a first term that measures a performance of the machine learning model on the third machine learning task (Rousett, [Page 2, last paragraph], “During the learning of the second set B, the network is concurrently trained on the input–output pairs previously stored in the pseudopopulation. These last pairs are seen as pseudoassociations reflecting the old knowledge. Learning the second set is considered as being completed when a learning criterion is reached for all set B input–output pairs (the pseudo input–output are not subject to a learning criterion). The same process applies again for the learning of the third set C: before learning set C a pseudopopulation has to be built up, hence capturing some representation of the A –B structure, and then the new set is trained in conjunction with the refreshed A –B pseudo-knowledge. The other sequentially learned new sets will then be processed in the same way.”).
Marcheret, Jamaluddin and Rousset are analogous in arts because they have the same filed of endeavor of using a machine learning to train on the multiple machine learning tasks.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Marcheret and Jamaluddin’s method, further in view of Rousset by having the second objective function includes: (i) a first term that measures a performance of the machine learning model on the third machine learning task. The modification would have been obvious because one of the ordinary skills in art would be motivate to improve continuous learning without forgetting the old task (Rousset, [page 16, the last paragraph], “This result was obtained in the framework of arithmetical metaphors which were taken as representative examples of structured sets of items. It is worth noting that the efficiency of sequential learning stems from the fact that the self-refreshing memory process makes it possible to maintain previously learned knowledge, hence improving transfer during subsequent learning of related tasks. What a network (with self-refreshing memory) knows about something will be saved for a long time and therefore possibly reused in subsequent acquisitions of other things. This contrasts with sequential learning without pseudorehearsal, where old knowledge is likely to be destroyed as a network is faced with new acquisitions. In this case, since previously learnt knowledge is lost it cannot be obviously reused.”).
However, Mecheret, Jamaluddin and Rousset do not teach and (ii) a second term that imposes a penalty for parameter values deviating from the first parameter values, wherein the second term penalizes deviations from the first values more for parameters that were more important in achieving acceptable performance on the first machine learning task than for parameters were less important in achieving acceptable performance on the first machine learning task, 
On the other hand, Cao teaches and (ii) a second term that imposes a penalty for parameter values deviating from the first parameter values, wherein the second term penalizes deviations from the first values more for parameters that were more important in achieving acceptable performance on the first machine learning task than for parameters were less important in achieving acceptable performance on the first machine learning task (Cao, [Section 4.1, first paragraph], “Given the basic Joint Bayesian model with parameters Θ𝑠 = {𝑆𝜇, 𝑆𝜖} fitted to source-domain data, and a handful of labeled target-domain data 𝒳, the underlying goal is to learn a new model with analogous parameters Θ𝑡 = {𝑇𝜇, 𝑇𝜖} that adequately reflects both domains, and in particular, generalizes to new target-domain data. In the absence of source-domain data, the unknown parameter Θ𝑡 could be estimated by optimizing log𝑝(𝒳∣Θ𝑡) over 𝒳, where the likelihood model for 𝒳 is simply analogous to the Joint Bayesian one. Of course when the available training samples are limited, over-fitting is likely and generalization performance on unseen data will be poor. When additional source-domain data are accessible, however, we may ameliorate the risk of over-fitting by including an additional regularize, or prior, th.0.0.at penalizes deviations from the distribution of source data. From an information-theoretic perspective, the KL divergence, which quantifies the information lost when we approximate the target-domain distribution with the source-domain distribution, represents a useful candidate regularize for this task. After combining with the log-likelihood term, this results in the optimization problem…where the parameter 𝜆 balances the relative importance between the new observations and the prior knowledge.” Examiner’s note, generalizes the new target domain data for specific learning task by using the penalizes deviation, the parameter is able to balance the importance from the new observation and the prior knowledge.).
Marcheret, Jamaluddin , Rousett and Cao are analogous in arts because they have the same filed of endeavor of using a machine learning to train on the multiple machine learning tasks.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Marcheret, Jamaluddin and Rousett’s method, further in view of Cao by having a second term that imposes a penalty for parameter values deviating from the first parameter values, wherein the second term penalizes deviations from the first values more for parameters that were more important in achieving acceptable performance on the first machine learning task than for parameters were less important in achieving acceptable performance on the first machine learning task. The modification would have been obvious because one of the ordinary skills in art would be motivate for related transfer learning purposes (Cao, [Section 4.1, first paragraph], “When additional source-domain data are accessible, however, we may ameliorate the risk of over-fitting by including an additional regularize, or prior, th.0.0.at penalizes deviations from the distribution of source data. From an information-theoretic perspective, the KL divergence, which quantifies the information lost when we approximate the target-domain distribution with the source-domain distribution, represents a useful candidate regularize for this task. After combining with the log-likelihood term, this results in the optimization problem…where the parameter 𝜆 balances the relative importance between the new observations and the prior knowledge…The KL divergence, as well as alternative penalties based on Bregman divergences and maximum mean discrepancy, have been motivated for related transfer learning purposes [7, 23, 16, 20, 24, 9], although not in combination with a likelihood function as we have done here”).
Claims 3, 14 are rejected under 35 U.S.C. 103 as being unpatentable over Marcheret et al. (Pub. No. US 2013/0254153 – hereinafter, Marcheret) in view of Jamaluddin et al. (NPL: Effect of Penalty Function Parameter in Objective Function of System Identification- Accelerating the world’s research-hereinafter, Jamaluddin) and further in view of Rousett et al.(NPL: Neural networks with a self-refreshing memory: knowledge transfer in sequential learning tasks without catastrophic...-Connection science-hereinafter, Rousett) and further in view of Aslan et al. (Pub. No. US20170132528– hereinafter, Aslan).
Regrading to claim 3, Marcheret as modified in view of Jamaluddin, Rousset and Aslan teaches the method of claim 1, wherein the first machine learning task and the second machine learning tasks are different reinforcement learning tasks (Aslan, [Par.0024, lines 9-14], “However, the training data 104 may be unlabeled in some implementations, such that the machine learning models 100 and/or 102 can be trained using any suitable learning technique, such as supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and so on.”  And [Par.0025, lines 3-7], “The task learned by the first model 100 can be the same task as the task learned by the second model 102, or each model 100 and 102 can learn related (or complimentary) tasks, meaning that the tasks can differ slightly between the models 100 and 102” Examiner’s note, the different machine learning model learn a different tasks, the training using a leaning technique such as supervised learning, reinforcement learning. Therefore, the first and second machine learning tasks are different.).
Marcheret, Jamaluddin, Rousett and Aslan are analogous in arts because they have the same filed of endeavor of using a machine learning to train on the multiple machine learning tasks.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Marcheret, Jamaluddin and Rousett’s method, further in view of Aslan by having the first machine learning task and the second machine learning tasks are different reinforcement learning tasks. The modification would have been obvious because one of the ordinary skills in art would be motivate to train a multiple machine learning tasks (Aslan, [Par.0024, lines 1-14], “The training data 104 can be stored in a database or repository of any suitable data, such as image data, speech data, text data, video data, or any other suitable type of data that can be processed by the machine learning models 100 and 102. For example, the training data 104 can comprise a repository of images that are to be classified or labeled by the machine learning models 100 and/or 102. The training data 104 can further include at least two additional components: features and labels. However, the training data 104 may be unlabeled in some implementations, such that the machine learning models 100 and/or 102 can be trained using any suitable learning technique, such as supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and so on.”)
Regarding claim 14, Marcheret as modified in view of Jamaluddin, Rousset and Aslan teaches the method of claim 13, wherein identifying when switching from one machine learning task to another comprises inferring which task is being performed from one or more models (Aslan, [Par.0059], “At 602, a set of multiple machine learning models, such as the first model 100 and the second model 102 of FIG. 1, can be provided. Each of the machine learning models in the set can be capable of learning a task, such as a classification task (binary or multi-label), a regression task to infer a set of probabilities based on unknown input data, or any other suitable machine learning task.” Examiner’s note, the machine learning task is trained on one or more machine models such as first and second machine learning models.).
Marcheret, Jamaluddin, Rousett and Aslan are analogous in arts because they have the same filed of endeavor of using a machine learning to train on the multiple machine learning tasks.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Marcheret, Jamaluddin and Rousett’s method, further in view of Aslan by having task is being performed from one or more models. The modification would have been obvious because one of the ordinary skills in art would be motivate to train a machine learning task performed on one or more models (Aslan, [Par.0059],  “At 602, a set of multiple machine learning models, such as the first model 100 and the second model 102 of FIG. 1, can be provided. Each of the machine learning models in the set can be capable of learning a task, such as a classification task (binary or multi-label), a regression task to infer a set of probabilities based on unknown input data, or any other suitable machine learning task.”).
Claims 5, 8, 13 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Marcheret et al. (Pub. No. US 2013/0254153 – hereinafter, Marcheret) in view of Jamaluddin et al. (NPL: Effect of Penalty Function Parameter in Objective Function of System Identification- Accelerating the world’s research-hereinafter, Jamaluddin) and further in view of Rousett et al.(NPL: Neural networks with a self-refreshing memory: knowledge transfer in sequential learning tasks without catastrophic...-Connection science-hereinafter, Rousett)  and further in view of Sinyavskiy et al. (Patent No.: US9146546-hereinafter, Sinyayskiy et al.).
Regarding claim 5, Marcheret teaches the method of claim 4, wherein training the machine learning model on the training data comprises, for each training example in the training data: processing the training example using the machine learning model in accordance with current values of parameters of the machine learning model to determine a model output (Marcheret, [Par.0047, [lines 8-21], “The weighting of the labeled training data may have been performed in any suitable way, as aspects of the present invention described herein are not limited to use with models that are built using training data that is weighted in any particular way and can, for example, be used with models wherein the training data is all evenly weighted. For example, each item of labeled training data may be associated with one weight value such that each weight value indicates each item's relative contribution to the output of the classification model. As an alternate example, each item oflabeled training data may be associated with a plurality of weight values such that each feature of each item of labeled training data has an indicated relative contribution to the output of the classification model.”)
However, Marcheret, Jamaluddin and Rousset do not teach determining a gradient of the objective function using the model output, a target output for the training example, the current values of the parameters of the machine learning model, and the first values of the parameters of the machine learning model; and adjusting the current values of the parameters using the gradient to optimize the objective function.
On the other hand, Sinyavskiy teaches determining a gradient of the objective function using the model output, a target output for the training example, the current values of the parameters of the machine learning model, and the first values of the parameters of the machine learning model (Sinyavskiy, [Column 3, lines 20-26], “Some existing learning rules for the supervised learning may rely on the gradient of the performance function. The gradient for reinforcement learning part may be implemented through the use of the adaptive critic; the gradient for supervised learning may be implemented by taking a difference between the supervisor signal and the actual output of the controller… Additional analytic derivation of the learning rules may be needed when the loss function between supervised and actual output signal is redefined.” And [Column 3, lines 49-54], “Some of the existing approaches of taking a derivative of a performance function without analytic calculations may include a "brute force" finite difference estimator of the gradient. However, these estimators may be impractical for use with large spiking networks comprising many (typically in excess of hundreds) parameters.” Examiner’s note, the gradient of the performance function (gradient of the objective function) is measured based on the supervisor signal, the actual output, and the parameter value. Furthermore, see at [column 4, lines 37-39], “One common approach is to describe the task in terms of optimization of some function and then use gradient approaches in the parameter space of the spiking neuron. ” Examiner’s note, the gradient is determined for each learning task, therefore, the first parameter value of the machine learning model is being used in the training.);
and adjusting the current values of the parameters using the gradient to optimize the objective function (Sinyavskiy, [Column 16, lines 29-38], “parameters including connection efficacy, firing threshold, resting potential of the neuron, and/or other parameters. The analytical relationship ofEqn.1 may be selected such that the gradient of ln [p(ylx,w)] with respect to the system parameter w exists and can be calculated. The framework shown in FIG. 3 may be configured to estimate rules for changing the system parameters ( e.g., learning rules) so that the performance function F(x,y,r) is minimized for the current set of inputs and outputs and system dynamics S.” Furthermore, see [Column 35, lines 34-39], “At step 816 learning parameter w update may be determined by the Parameter Adjustment block ( e.g., block 426bof FIG. 4) using the performance function F and the gradient g, determined at steps 812, 814 respectively. In some implementations, the learning parameter update may be implemented according to Eqns. 22-31.” Examiner’s note, during the training, the specific parameter (current parameter) is being updated. That is well known in art, the parameter is being updated to optimize the objective function (improve the performance behavior).).
Marcheret, Jamaluddin Rousett and Sinyavskiy are analogous in arts because they have the same filed of endeavor of using a machine learning to train on the multiple machine learning tasks.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Marcheret, Jamaluddin and Rousset’s method, further in view of Sinyavskiy by using the model output to determine a gradient of the objective function. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the accuracy of classifying and flexible of training by different learning rules (Sinyavskiy, [Abstract], “frame-work may be used to enable adaptive spiking neuron signal processing system to flexibly combine different learning rules (supervised, unsupervised, reinforcement learning) with different methods (online or batch learning). The gener-alized learning framework may employ time-averaged per-formance function as the learning measure thereby enabling modular architecture where learning tasks are separated from control tasks, so that changes in one of the modules do not necessitate changes within the other. Separation of learning tasks from the control tasks implementations may allow dynamic reconfiguration of the learning block in response to a task change or learning method change in real time.”).
Regarding to claim 8, Marcheret, as modified in view of Jamaluddin ,Rousset and  Sinyavskiy teaches the method of claim 1, wherein determining, for each of the plurality of parameters, a respective measure of an importance of the parameter to the machine learning model achieving acceptable performance on the first machine learning task comprises: determining a Fisher Information Matrix (FIM) of the plurality of parameters of the machine learning model with respect to the first machine learning task, wherein, for each of the plurality of parameters, the respective measure of the importance of the parameter is a corresponding value on a diagonal of the FIM ( Sinyavskiy, [Column 28-29, lines 63-33], “ In some implementations, the gradient signal g, determined by the PD block 422vof FIG. 4 may be subsequently modified according to another gradient algorithm, as described in detail below. In some implementations, these modifications may comprise determining natural gradient, as follows:

    PNG
    media_image3.png
    629
    578
    media_image3.png
    Greyscale

Examiner’s note, training on the machine learning task including the gradient signal, the fisher information matrix, wherein calculation of the fisher information matrix including the parameter, therefore, the measure of parameter value is corresponding to the FIM.).
Marcheret, Jamaluddin, Rousett and Sinyavskiy are analogous in arts because they have the same filed of endeavor of using a machine learning to train on the multiple machine learning tasks.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Marcheret, Jamaluddin and Rousset’s method, further in view of Sinyavskiy by using FIM value to determine the importance of the parameter. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the accuracy of classifying and flexible of training by different learning rules (Sinyavskiy, [Abstract], “frame-work may be used to enable adaptive spiking neuron signal processing system to flexibly combine different learning rules (supervised, unsupervised, reinforcement learning) with different methods (online or batch learning). The gener-alized learning framework may employ time-averaged per-formance function as the learning measure thereby enabling modular architecture where learning tasks are separated from control tasks, so that changes in one of the modules do not necessitate changes within the other. Separation of learning tasks from the control tasks implementations may allow dynamic reconfiguration of the learning block in response to a task change or learning method change in real time.”).
Regarding to claim 13, Marcheret, as modified in view of Jamaluddin, Rousset, and Sinyavskiy teaches the method of claim 4 when dependent upon claim 4, further comprising identifying when switching from one machine learning task to another and updating the second term of the objective function in response (Sinyavskiy, [Column 3, lines 44-48], “Moreover, analytic determination of a performance function F derivative may require additional operations ( often performed manually) for individual new formulated tasks that are not suitable for dynamic switching and reconfiguration of the tasks described before” furthermore, see [column 25, lines 52-57], “The PD block 475 output may be determined based the output signal 418,  the learning signals 476 comprising the reinforcement component r(t) and the desired output (teaching) component yd(t) and on the input signal 412, that determines the context for switching between supervised and reinforcement task function” Examiner’s note, when the difference machine learning task is being trained then the objective function will be changed automatically, therefore, the machine learning task is switch then the objective function is updated.).
Marcheret, Jamaluddin, Rousett and Sinyavskiy are analogous in arts because they have the same filed of endeavor of using a machine learning to train on the multiple machine learning tasks.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Marcheret, Jamaluddin and Rousset’s method, further in view of Sinyavskiy by switching from one machine learning task to another and updating the second term of the objective function in response. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the accuracy of classifying and flexible of training by different learning rules (Sinyavskiy, [Abstract], “frame-work may be used to enable adaptive spiking neuron signal processing system to flexibly combine different learning rules (supervised, unsupervised, reinforcement learning) with different methods (online or batch learning). The gener-alized learning framework may employ time-averaged per-formance function as the learning measure thereby enabling modular architecture where learning tasks are separated from control tasks, so that changes in one of the modules do not necessitate changes within the other. Separation of learning tasks from the control tasks implementations may allow dynamic reconfiguration of the learning block in response to a task change or learning method change in real time.”).
Regarding to claim 16, Marcheret, as modified in view of Jamaluddin, Rousset and  Sinyavskiy teaches the method of claim 1, wherein the first and second machine learning tasks each comprise a reinforcement learning task, and wherein the reinforcement learning task is controlling an agent to interact with an environment to achieve a goal (Sinyavskiy, [Column 44, lines 55-63], “One or more implementations of reinforcement learning may require solving adaptive control task (e.g., AUV/UAV navigation) without having detailed prior information about the dynamics of the controlled plant ( e.g., the plant 514 in FIG. 5 The reinforcement signal ( e.g., the signal 504 5 is typically used to specify to the adaptive controller ( e.g., the controller 520 of FIG. 5) whether prior behavior led to "desired" or "undesired" results.”  And [Column 45-46, line 65 -3], “Even when existing learning approaches employ neural networks as the computational engine, each learning task is typically performed by a separate network ( or network partition) that operate task specific (e.g., adaptive control, classification, recognition, prediction rules, etc.) set of learning rules (e.g.,supervised, unsupervised, reinforcement). Examiner’s note, therefore, machine learning is trained on multiple machine learning tasks, and the reinforcement learning is controlling a classification behavior in order to reach a desired result.).
Marcheret, Jamaluddin, Rousett and Sinyavskiy are analogous in arts because they have the same filed of endeavor of using a machine learning to train on the multiple machine learning tasks.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Marcheret, Jamaluddin and Rousset’s method, further in view of Sinyavskiy by using FIM value to determine the importance of the parameter. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the accuracy of classifying and flexible of training by different learning rules (Sinyavskiy, [Abstract], “frame-work may be used to enable adaptive spiking neuron signal processing system to flexibly combine different learning rules (supervised, unsupervised, reinforcement learning) with different methods (online or batch learning). The gener-alized learning framework may employ time-averaged per-formance function as the learning measure thereby enabling modular architecture where learning tasks are separated from control tasks, so that changes in one of the modules do not necessitate changes within the other. Separation of learning tasks from the control tasks implementations may allow dynamic reconfiguration of the learning block in response to a task change or learning method change in real time.”).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure is provide below.
Doya et al. (Multiple Model Based Reinforcement Learning- Human Information Science Laboratiories, ART International 2-2-2 Hikaridai, Seiku, Kyoto 619-0288, Japan-hereinafter- Doya et al.) teaches using the reinforcement learning to train multiple models. 
Ruvolo et al. (ELLA: An Efficient Lifelong Learning Algorithm- Bryn Mawr College, computer science Department, 101 North Merion Avenue, Bryn Mawr, PA 19010 USA- hereinafter-Ruvolo et al.) teaches training a machine learning models on the multiple learning tasks.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to EM N TRIEU whose telephone number is (571)272-5747.  The examiner can normally be reached on 7:30 - 5:00 M_TH
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on (571) 272-2589.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/E.T./Examiner, Art Unit 2128      

/OMAR F FERNANDEZ RIVAS/Supervisory Patent Examiner, Art Unit 2128