DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 11/3/2020 have been fully considered but they are not persuasive. Applicant’s first argument is as follows:
“That is, as explained in paragraph [0007], the effect of using the change in complexity is to encourage the reinforcement learning agent to learn behaviors that increase knowledge. The effect of using the complexity of the change is to encourage the reinforcement learning agent to learn behaviors that increase understanding. In particular, paragraph [0007] explains "encouraging large changes in the complexity of the learned model will encourage the agent to maximize its knowledge" and "encouraging more complex changes in the learned model will encourage the agent to challenge its own understanding of the data.”
In response to applicant's argument that the references fail to show certain features of applicant’s invention, it is noted that the features upon which applicant relies (i.e., the effect of using the change in complexity versus the effect of using the complexity of the change) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).  
¶6 of the 9/22/2020 Non-Final Rejection requested Applicant to clarify the difference in scope between “a measure of a change in complexity of the model” and “a measure of the complexity of the change in the model” (hereinafter “1st measure” and “2nd measure” respectively).  The overall effect of the 1st and 2nd measures do not provide sufficient detail for one of ordinary skill in the art to calculate said 1st and 2nd st and 2nd measures are mutually exclusive in scope.  [0007] does not overcome this deficiency since “encouraging large changes” and “encouraging more complex changes” do not have to be mutually exclusive.  It is requested that a distinction be provided in terms of how the measurements are calculated and how said distinction is supported by Applicant’s disclosure.
Applicant’s second argument is as follows:
“To illustrate the difference in knowledge and understanding by example, consider that knowledge may be increased by simple memorization of additional objects in an already known class (like phone numbers in a8 4833-7522-9904.1phone book) while understanding may involve learning about previously unknown classes of objects (like learning about cars when all we've ever seen before were books). A person of ordinary skill in the art understands that the effects are different and, similarly, that the claim limitations are different. Although the two measures are described somewhat similarly, the two measures are not described identically. The examiner's broad interpretation "the complexity of the model" misses the critical element of measuring a change, as required by the claims. The difference between the two measures is the choice of the particular quantity whose change is measured, which produce different effects.”
The example and rationale provided is not found in Applicant’s disclosure and therefore cannot be relied upon to clarify the difference in scope between the 1st measure and the 2nd measure.  More specifically, “choice of the particular quantity whose change is measured” as cited above is not supported by Applicant’s disclosure.
Applicant’s third argument is as follows:

“Applicant does not admit Garvani is prior art and reserve the right to swear behind Garvani at a later point in time.”

Garvani was filed 3/15/2016 which precedes Applicant’s provisional date of 6/17/2016.  Therefore, Garvani qualifies as prior art.
Applicant’s fourth argument is as follows:
“The prior art fails to disclose, inter alia, "select an action to maximize an expected future value of a reward function, wherein the reward function depends at least partly on at least one of: a measure of a change in complexity of the model or a measure of the complexity of the change in the model," as required by claim 1. Similar language is included in claim 16. 
As explained above, the presently pending claims are not about reducing the complexity in the model. Instead, the claims relate to using the change in the complexity, not about minimizing the complexity proper. As explained above, paragraph [0007] of the present application teaches that "encouraging large changes in the complexity of the learned model will encourage the agent to maximize its knowledge", which is a maximization strategy in contrast to the complexity minimization strategy taught by prior art.
The effect of minimizing the complexity would be to mitigate overfitting. This is not the same as the effect of basing the reward on the change in complexity of the model or the complexity of the change of the model. The effect of using the change in complexity of the model, as claimed, is to encourage the reinforcement learning agent to learn behaviors that increase knowledge. The effect of using the complexity of the change of the model, as claimed, is to encourage the reinforcement learning agent to learn behaviors that increase understanding. Weight decay, as described in the prior art, produces neither of these effects.”

In response to Applicant's argument that the references fail to show certain features of applicant’s invention, it is noted that the features upon which applicant relies (i.e., minimization strategy versus maximization strategy) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).  
Claims 1 and 16 require “a measure of a change in complexity of the model” and “a measure of the complexity of the change in the model” which under broadest 
Election/Restrictions
Applicant’s election without traverse of claims 1, 7, 9, 12-16, 22, 24 and 27-30 in the reply filed on 7/9/2020 is acknowledged.
Claim Objections
Claims 2-6, 8, 10-11, 17-21, 23 and 25-26 are objected to because of the following informalities:  “(Original)” should be replaced with “(Withdrawn)” since an election without traverse was made on 7/9/2020.  Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1, 7, 9 12-16, 22, 24 and 27-30 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.  
It is not understood what the difference in scope is between “a measure of a change in complexity of the model” and “a measure of the complexity of the change in the model” as required by all the independent claims.  
Applicant’s Specification teaches that both measures “may be based on the change in description length or, equivalently, the change in negative log likelihood, of the first part of a two-part code describing one or more sequences of received data and actions” or, alternatively, “may be based on the change in description length or, equivalently, the change in negative log likelihood, of a statistical distribution modelling one or more sequences of received data and actions” ([0009]-[0010]).  
Since both measures are described identically, for the purposes of examination, “a measure of a change in complexity of the model” and “a measure of the complexity of the change in the model” will be given the same broadest reasonable interpretation of “the complexity of the model.”
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mnih (US 2015/0100530) in view of Krogh et al (NPL: “A Simple Weight Decay Can Improve Generalization”).
For claim 1, Mnih teaches a reinforcement learning system (Figures 5a and 5b), comprising: 
one or more processors (within 122, Figure 5b); and 
one or more programs residing on a memory and executable by the one or more processors (124, 126, 128 and working memory within 122), the one or more programs configured to: 
perform actions (output of 106) from a set of available actions ([0082]); 
receive data in sequence from one or more sequential data sources (output of 102 and 104 of Figure 5a, [0021]-[0022]); 
generate a model (150, Figures 3b, 4 and 5a) that models sequences of the received data and the performed actions ([0079]-[0083]); and 
select an action to maximize an expected future value of a reward function ([0082] teaches selecting the action with the maximum Q-value).
Mnih does not distinctly disclose:
wherein the reward function depends at least partly on at least one of: a measure of a change in complexity of the model or a measure of the complexity of the change in the model.
It is noted that Mnih teaches in that 
“the reward/cost may be defined by parameters of the system or engineering problem to be solved” ([0017]);
However, Krogh teaches in §1 that: 
“the generalization ability of a neural network (or any other 'learning machine') depends on a balance between the information in the training examples and the complexity of the network… Bad generalization occurs…if the network is very complex and there is little information in the training set… A different way to constrain a network, and thus decrease its complexity, is to limit the growth of the weights through some kind of weight decay. It should prevent the weights from growing too large unless it is really necessary. It can be realized by adding a term to the cost function that penalizes large weights.”
Before the effective filing date of the invention it would have been obvious to one of ordinary skill in the art to calculate the reward/cost used in Mnih’s system by using a weight decay term as taught by Krogh in order to reduce complexity of the model by “[suppressing] any irrelevant components of the weight vector” (§7).
For claim 16, Mnih teaches a method for reinforcement learning (Abstract, Figures 5a and 5b), comprising the steps of: 
receiving data in sequence from the one or more sequential data sources (output of 102 and 104 of Figure 5a, [0021]-[0022]); 
generating a model (150, Figures 3b, 4 and 5a), wherein the model is configured to model sequences of the received data and actions ([0079]-[0083]); and 
selecting an action maximizing the expected future value of a reward function ([0082] teaches selecting the action with the maximum Q-value). 
Mnih does not distinctly disclose:
wherein the reward function depends at least partly on at least one of: a measure of the change in complexity of the model, or a measure of the complexity of the change in the model.
It is noted that Mnih teaches in that 
“the reward/cost may be defined by parameters of the system or engineering problem to be solved” ([0017]);
However, Krogh teaches in §1 that: 
“the generalization ability of a neural network (or any other 'learning machine') depends on a balance between the information in the training examples and the complexity of the network… Bad generalization occurs…if the network is very complex and there is little information in the training set… A different way to constrain a network, and thus decrease its complexity, is to limit the growth of the weights through some kind of weight decay. It should prevent the weights from growing too large unless it is really necessary. It can be realized by adding a term to the cost function that penalizes large weights.”
Before the effective filing date of the invention it would have been obvious to one of ordinary skill in the art to calculate the reward/cost used in Mnih’s system by using a weight decay term as taught by Krogh in order to reduce complexity of the model by “[suppressing] any irrelevant components of the weight vector” (§7).
Claims 7, 9, 22 and 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mnih in view of Krogh and Garnavi et al (US 2017/0270653).
For claim 7
It is noted that Mnih teaches:
“The second neural network is trained on the modulus difference between the target generated from the first neural network and the action-value parameter at step j output from the second neural network, adjusting the weights of the second neural network by (stochastic) gradient descent” ([0019]); and
“Preferably the first and second neural networks are deep neural networks and include a front end portion (an input portion receiving state data) which is locally or sparsely connected, for example, to implement a convolutional neural network” (0024]).
However, Garnavi teaches in [0054] that:
“Training of the CNN may be performed as follows. In one embodiment, the system and/or method of the present disclosure may use negative log-likelihood as the loss function and perform Stochastic Gradient Descent (SGD).”
Before the effective filing date of the invention it would have been obvious to one of ordinary skill in the art to train Mnih’s convolutional neural network by using negative-log likelihood as the loss function since the particular known technique (performing stochastic gradient descent to train the weights of a convolutional neural network using negative log-likelihood as the loss function) was recognized as part of the ordinary capabilities of one skilled in the art.  
Mnih as modified by Krogh and Garnavi as cited above teaches:
the measure of the change in complexity of the model is based on a change in negative log likelihood of a statistical distribution modelling one or more sequences of received data and actions 
For claim 9, Mnih as modified by Krogh teaches the limitations of claim 1 as cited above but fails to teach a negative log likelihood as claimed.
It is noted that Mnih teaches:
“The second neural network is trained on the modulus difference between the target generated from the first neural network and the action-value parameter at step j output from the second neural network, adjusting the weights of the second neural network by (stochastic) gradient descent” ([0019]); and
“Preferably the first and second neural networks are deep neural networks and include a front end portion (an input portion receiving state data) which is locally or sparsely connected, for example, to implement a convolutional neural network” (0024]).
However, Garnavi teaches in [0054] that:
“Training of the CNN may be performed as follows. In one embodiment, the system and/or method of the present disclosure may use negative log-likelihood as the loss function and perform Stochastic Gradient Descent (SGD).”
Before the effective filing date of the invention it would have been obvious to one of ordinary skill in the art to train Mnih’s convolutional neural network by using negative-log likelihood as the loss function since the particular known technique (performing stochastic gradient descent to train the weights of a convolutional neural network using negative log-likelihood as the loss function) was recognized as part of the ordinary capabilities of one skilled in the art.  
Mnih as modified by Krogh and Garnavi as cited above teaches:
the measure of the complexity of the change in the model is based on a negative log likelihood of a change in a statistical distribution modelling one or more sequences of received data and actions (as understood by the use of negative log-likelihood as Mnih’s neural network loss function).
For claim 22, Mnih as modified by Krogh teaches the limitations of claim 16 as cited above but fails to teach a negative log likelihood as claimed.
It is noted that Mnih teaches:
“The second neural network is trained on the modulus difference between the target generated from the first neural network and the action-value parameter at step j output from the second neural network, adjusting the weights of the second neural network by (stochastic) gradient descent” ([0019]); and
“Preferably the first and second neural networks are deep neural networks and include a front end portion (an input portion receiving state data) which is locally or sparsely connected, for example, to implement a convolutional neural network” (0024]).
However, Garnavi teaches in [0054] that:
“Training of the CNN may be performed as follows. In one embodiment, the system and/or method of the present disclosure may use negative log-likelihood as the loss function and perform Stochastic Gradient Descent (SGD).”
Before the effective filing date of the invention it would have been obvious to one of ordinary skill in the art to train Mnih’s convolutional neural network by using negative-log likelihood as the loss function since the particular known technique (performing stochastic gradient descent to train the weights of a convolutional neural network using negative log-likelihood as the loss function) was recognized as part of the ordinary capabilities of one skilled in the art.  
Mnih as modified by Krogh and Garnavi as cited above teaches:
the measure of the change in complexity of the model is based on a change in negative log likelihood of a statistical distribution modelling one or more sequences of received data and actions (as understood by the use of negative log-likelihood as Mnih’s neural network loss function).
For claim 24, Mnih as modified by Krogh teaches the limitations of claim 16 as cited above but fails to teach a negative log likelihood as claimed.
It is noted that Mnih teaches:
“The second neural network is trained on the modulus difference between the target generated from the first neural network and the action-value parameter at step j output from the second neural network, adjusting the weights of the second neural network by (stochastic) gradient descent” ([0019]); and
“Preferably the first and second neural networks are deep neural networks and include a front end portion (an input portion receiving state data) which is locally or sparsely connected, for example, to implement a convolutional neural network” (0024]).
However, Garnavi teaches in [0054] that:
“Training of the CNN may be performed as follows. In one embodiment, the system and/or method of the present disclosure may use negative log-likelihood as the loss function and perform Stochastic Gradient Descent (SGD).”
Before the effective filing date of the invention it would have been obvious to one of ordinary skill in the art to train Mnih’s convolutional neural network by using negative-log likelihood as the loss function since the particular known technique (performing stochastic gradient descent to train the weights of a convolutional neural network using 
Mnih as modified by Krogh and Garnavi as cited above teaches:
the measure of the complexity of the change in the model is based on a negative log likelihood of a change in a statistical distribution modelling one or more sequences of received data and actions (as understood by the use of negative log-likelihood as Mnih’s neural network loss function).
Claims 12-15 and 27-30 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mnih in view of Krogh and Zhang et al (“Clique-based Cooperative Multiagent Reinforcement Learning Using Factor Graphs”).
For claim 12, Mnih as modified by Krogh teaches the limitations of claim 1 as cited above but fails to teach potential functions over cliques on a factor graph as claimed.
However, Zhang teaches clique-based decomposition (e.g., Figure 3) of a global Q-value function into the sum of several simpler local Q-value functions expressed by a factor graph and exploited by the general max-plus algorithm to obtain a greedy joint action (Abstract).  
Before the effective filing date of the invention it would have been obvious to one of ordinary skill in the art to update Mnih’s Q-value function using Zhang’s method in order to reduce learning time and improve the quality of learned strategies (¶1 of §IV).
The combination of Mnih, Krogh and Zhang as defined above teaches:
the model is represented as a statistical distribution factorized into potential functions (each output of Mnih’s Figure 3b) over cliques on a factor graph containing nodes corresponding to the one or more sequential data sources as well as zero or more additional nodes corresponding to auxiliary variables (e.g., Figures 2 and 3 of Zhang).
For claim 13, Mnih as modified by Krogh and Zhang teach the limitations of claim 12 as cited above and Mnih further teaches:
wherein the potential functions comprise at least a first potential function (first neural network, [0019]) and a second potential function (second neural network, [0019]), and wherein the first potential function is similar to the second potential function when variables in the first potential function are substituted for variables in the second potential function (the second neural network being trained using values from the first neural network, [0014]-[0015]), and the complexity of the model is reduced by the second potential function referencing the first potential function or by the first and second potential functions referencing a common function (by processing actions in parallel, [0015]).
For claim 14, Mnih as modified by Krogh and Zhang teach the limitations of claim 12 as cited above and Zhang further teaches:
wherein one or more of the potential functions over a clique is conditioned on one or more conditioning values at each node in the clique, where the conditioning value at each node is one of: a data value received from a data source associated with the node (“an edge connects node (ai) and node (Qj) if and only if ai is an argument of Qj”, IV-C).
For claim 15
the potential functions are further conditioned on one or more conditioning values from a time that is prior to a time the model is generated (via updates to the Q-value function, §III-C and §IV-C).
For claim 27, Mnih as modified by Krogh teaches the limitations of claim 1 as cited above but fails to teach potential functions over cliques on a factor graph as claimed.
However, Zhang teaches clique-based decomposition (e.g., Figure 3) of a global Q-value function into the sum of several simpler local Q-value functions expressed by a factor graph and exploited by the general max-plus algorithm to obtain a greedy joint action (Abstract).  
Before the effective filing date of the invention it would have been obvious to one of ordinary skill in the art to update Mnih’s Q-value function using Zhang’s method in order to reduce learning time and improve the quality of learned strategies (¶1 of §IV, Zhang).
The combination of Mnih, Krogh and Zhang as defined above teaches:
the model is represented as a statistical distribution factorized into potential functions (each output of Mnih’s Figure 3b) over cliques on a factor graph containing nodes corresponding to the one or more sequential data sources as well as zero or more additional nodes corresponding to auxiliary variables (e.g., Figures 2 and 3 of Zhang).
For claim 28
wherein the potential functions comprise at least a first potential function (first neural network, [0019]) and a second potential function (second neural network, [0019]), and wherein the first potential function is similar to the second potential function when variables in the first potential function are substituted for variables in the second potential function (the second neural network being trained using values from the first neural network, [0014]-[0015]), and the complexity of the model is reduced by the second potential function referencing the first potential function or by the first and second potential functions referencing a common function (by processing actions in parallel, [0015]).
For claim 29, Mnih as modified by Krogh and Zhang teach the limitations of claim 27 as cited above and Zhang further teaches:
wherein one or more of the potential functions over a clique are conditioned on one or more conditioning values at each node in the clique, where the one or more conditioning values at each node is one of: a data value received from a data source associated with the node (“an edge connects node (ai) and node (Qj) if and only if ai is an argument of Qj”, IV-C).
For claim 30, Mnih as modified by Krogh and Zhang teach the limitations of claim 29 as cited above and Zhang further teaches:
one or more of the potential functions over the clique are further conditioned on one or more conditioning values from a time that is prior to a time the model is generated (via updates to the Q-value function, §III-C and §IV-C).
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL CALRISSIAN PUENTES whose telephone number is (571)270-5070.  The examiner can normally be reached on M-F 9-6:30 (flex).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/DANIEL C PUENTES/Primary Examiner, Art Unit 2123