DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments with respect to claim(s) 1-3, 6-7, 10-18, 21-22 and 25-40 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 16, 18, 22, 26, 32, 34, 36, 38 and 40 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by McGovern et al (NPL: “Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density”).
For claim 16, McGovern teaches a method for reinforcement learning (§2 and Table 1), comprising the steps of: 
performing actions from a set of available actions that affect an environment (at of §2, “interact with environment” of Table 1); 
receiving data in sequence from the one or more sequential data sources that relate to the environment (st … st+n of §2, “learn using RL” of Table 1); 
generating a model (policy of §2 and Table 1), wherein the model is configured to model sequences of the received data and actions (as understood by §2 and Table 1), wherein the model is a parametric model defined according to a set of parameters (diverse density, positive bags, negative bags, §4 and Table 1); and 
selecting an action maximizing the expected future value of a reward function for reinforcement learning (according to the most diversely dense region in feature space corresponding to the most positive bags with least negative bags, §4 and Table 1), wherein the reward function is a measure of the change in complexity of the model (maximum diverse density of §4, depicted as average log likelihood in Figure 2, “create new option… of reaching concept c” of Table 1), wherein the measure of complexity measures the complexity of the set of parameters defining the model to reward a learning agent for discovering more complexity (maximum diverse density corresponds to multiple successful trajectories of an agent through a bottleneck region, §4, Figure 2 and Table 1).
For claim 18, McGovern further teaches:
the measure of the change in complexity of the model is based on a change in negative log likelihood of the first part of a two- part code (Table 1) describing one or more sequences of received data and actions (§6.1 and Figure 2).
For claim 22, McGovern further teaches:
the measure of the change in complexity of the model is based on a change in negative log likelihood of a statistical distribution modelling one or more sequences of received data and actions (§6.1 and Figure 2).
For claim 26, McGovern further teaches:
the model is represented as a replay memory that stores sequences of received data and actions, and an action is at least sometimes selected according to an action-value function learned via Q Learning with experience replay (Table 1 and 2nd to last ¶ of §3).
For claim 32, McGovern further teaches:
the data sources comprise one or more of pixel values from an image sensor, audio sample values from a microphone, characters in a text stream, telemetry values from a motor system with feedback, image or text media output from a computer application, media data obtained from the internet (pixel values, Figure 1A).
For claim 34, McGovern further teaches:
the selected action comprises one or more of outputting signals to drive a motor, outputting audio samples to a speaker, outputting pixel values to a display, outputting text characters to drive a speech synthesis device, inputting commands to a computer application, and retrieving a media file from the internet (outputting pixel values to a display via a histogram, Figure 1B and 1C).
For claim 36, McGovern further teaches:
the one or more programs are further configured to perform the selected action, the selected action having an effect on an environment (§2).
For claim 38, McGovern further teaches:
the one or more programs are further configured to update the model periodically or whenever new data is received (§5 and Table 1).
For claim 40, McGovern further teaches:
the one or more programs are further configured to receive a reward for a change in complexity of the model based on a measure of the complexity of the model before an update compared to a measure of the complexity of the model after an update (learning acceleration once automatically created options were added at approximately trial 20 in Figure 3 and §6.1).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 3, 7, 11, 31, 33, 35, 37 and 39 is/are rejected under 35 U.S.C. 103 as being unpatentable over
McGovern et al (NPL: “Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density”) in view of Mnih et al (US 2015/0100530).
For claim 1, McGovern teaches a reinforcement learning system (§2 and Table 1) for performing a method, comprising: 
perform actions from a set of available actions (at of §2, “interact with environment” of Table 1) that affect an environment (§2, Table 1); 
receive data in sequence from one or more sequential data sources that relate to the environment (st … st+n of §2, “learn using RL” of Table 1); 
generate a model (policy of §2 and Table 1) that models sequences of the received data and the performed actions (as understood by §2 and Table 1), wherein the model is a parametric model defined according to a set of parameters (diverse density, positive bags, negative bags, §4 and Table 1); and 
select an action to maximize an expected future value of a reward function for reinforcement learning (according to the most diversely dense region in feature space corresponding to the most positive bags with least negative bags, §4 and Table 1), wherein the reward function is a measure of a change in complexity of the model (maximum diverse density of §4, depicted as average log likelihood in Figure 2, “create new option… of reaching concept c” of Table 1), 
wherein the measure of complexity measures complexity of the set of parameters to reward a learning agent for discovering more complexity (maximum diverse density corresponds to multiple successful trajectories of an agent through a bottleneck region, §4, Figure 2 and Table 1).
McGovern fails to distinctly disclose:
one or more processors; and 
one or more programs residing on a memory and executable by the one or more processors, the one or more programs configured to perform a reinforcement learning method.
However, Minh teaches a reinforcement learning system (Figures 5a and 5b) comprising:
one or more processors (within 122, Figure 5b); and 
one or more programs residing on a memory and executable by the one or more processors (124, 126, 128 and working memory within 122), the one or more programs configured to perform a reinforcement learning method (Figure 2).
Before the effective filing date of the invention it would have been obvious to one of ordinary skill in the art to implement McGovern’s reinforcement learning method using one or more processors and memory since the particular known technique was recognized as part of the ordinary capabilities of one skilled in the art.
For claim 3, McGovern as modified by Minh as cited above teaches the limitations of claim 1 and McGovern further teaches:
the measure of the change in complexity of the model is based on a change in negative log likelihood of the first part of a two- part code (Table 1) describing one or more sequences of received data and actions (§6.1 and Figure 2).
For claim 7, McGovern as modified by Minh as cited above teaches the limitations of claim 1 and McGovern further teaches:
the measure of the change in complexity of the model is based on a change in negative log likelihood of a statistical distribution modelling one or more sequences of received data and actions (§6.1 and Figure 2).
For claim 11, McGovern as modified by Minh as cited above teaches the limitations of claim 1 and McGovern further teaches:
the model is represented as a replay memory that stores sequences of received data and actions, and an action is at least sometimes selected according to an action-value function learned via Q Learning with experience replay (Table 1 and 2nd to last ¶ of §3).
For claim 31, McGovern as modified by Minh as cited above teaches the limitations of claim 1 and McGovern further teaches:
the data sources comprise one or more of pixel values from an image sensor, audio sample values from a microphone, characters in a text stream, telemetry values from a motor system with feedback, image or text media output from a computer application, media data obtained from the internet (pixel values, Figure 1A).
For claim 33, McGovern as modified by Minh as cited above teaches the limitations of claim 1 and McGovern further teaches:
the selected action comprises one or more of outputting signals to drive a motor, outputting audio samples to a speaker, outputting pixel values to a display, outputting text characters to drive a speech synthesis device, inputting commands to a computer application, and retrieving a media file from the internet (outputting pixel values to a display via a histogram, Figure 1B and 1C).
For claim 35, McGovern as modified by Minh as cited above teaches the limitations of claim 1 and McGovern further teaches:
the one or more programs are further configured to perform the selected action, the selected action having an effect on an environment (§2).
For claim 37, McGovern as modified by Minh as cited above teaches the limitations of claim 1 and McGovern further teaches:
the one or more programs are further configured to update the model periodically or whenever new data is received (§5 and Table 1).
For claim 39, McGovern as modified by Minh as cited above teaches the limitations of claim 1 and McGovern further teaches:
the one or more programs are further configured to receive a reward for a change in complexity of the model based on a measure of the complexity of the model before an update compared to a measure of the complexity of the model after an update (learning acceleration once automatically created options were added at approximately trial 20 in Figure 3 and §6.1).
Allowable Subject Matter
Claims 2, 6, 10, 12-15, 17, 21, 25 and 27-30 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL CALRISSIAN PUENTES whose telephone number is (571)270-5070. The examiner can normally be reached M-F 9-6:30 (flex).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Menatoallah Yousseff can be reached on 571-270-3684. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DANIEL C PUENTES/Primary Examiner, Art Unit 2849