DETAILED ACTION


Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .



Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



Claim(s) 1-6, 8, and 12-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Arel et al., US PGPUB No. 20170213150 A1, hereinafter Arel, and further in view of van Hasselt et al., US PGPUB No. 20170076201 A1, hereinafter van-Hasselt.

	
Regarding claim 1, Arel teaches a method performed by one or more data processing apparatus for estimating an outcome associated with an environment being interacted with by an agent to perform a task by aggregating reward and value predictions over a sequence of planning steps (Arel; a method [¶ 0004-0005] performed by one or more data processing apparatus [¶ 0020-0021 and ¶ 0087] for estimating an outcome associated with an environment being interacted with by an agent to perform a task [¶ 0022-0024] by aggregating reward and value predictions over a sequence of planning steps (i.e. selectable actions) [¶ 0033-0034, ¶ 0039-0040, and ¶ 0043], as illustrated within Figs. 2 and 3; moreover, selecting actions [¶ 0046-0047] in relation with a sequence of state representations [¶ 0054-0055]; and moreover, predicting next state(s) [¶ 0028]), the method comprising: 
receiving one or more observations characterizing states of the environment being interacted with by the agent (Arel; the method, as addressed above, comprises receiving one or more observations characterizing states of the environment being interacted with by the agent [¶ 0031, ¶ 0033, and ¶ 0038], as illustrated within Fig. 1; moreover, observation with one or more recent observations to generate a state representation [¶ 0027]); 
processing the one or more observations using a state representation neural network to generate an internal state representation for a first planning step of the sequence of planning steps (Arel; the method, as addressed above, comprises processing the one or more observations using a learning model/system (i.e. state representation neural network) [¶ 0021 and ¶ 0024] to generate an internal state representation for a 1st planning step (i.e. selected action) of the sequence of planning steps (i.e. selectable actions) [¶ 0028-0031 and ¶ 0038-0040]; moreover, the space of possible combinations of state representations and actions, for each action, the system combines the current state representation with the action and determines the partition to which the state representation-action combination belongs and then uses the supervised learning model that corresponds to the partition to generate the respective current value function estimate for the state representation-action combination [¶ 0041-0043]); 
for each planning step in the sequence of planning steps, processing an internal state representation for the planning step using a prediction neural network to generate (Arel; processing an internal state representation for the planning step (i.e. selected action) [¶ 0031-0033] using a prediction neural network to generate results [¶ 0055-0056] for each planning step (i.e. selected action) in the sequence of planning steps (i.e. selectable actions) [¶ 0029-0031]; moreover, predicting next state(s) [¶ 0028]):
(i) an internal state representation for a next planning step (Arel; processing an internal state representation for the planning step (i.e. selected action) using a prediction neural network, as addressed above, to generate an internal state representation for a next planning step (i.e. selected action) [¶ 0024-0025, ¶ 0055-0056, and ¶ 0083]; wherein, for each action in the set of actions, identifies a respective partition of the learning input state space [¶ 0031]; and wherein, the respective partition identified for the action to generate a respective value function estimate for the action and uses the value function estimates to select the action to be performed by the agent and selecting an action to be performed [¶ 0031 and ¶ 0054-0056]), and (ii) a predicted reward for the next planning step (Arel; a predicted reward for the next planning step (i.e. selected action) [¶ 0031, ¶ 0033-0034, and ¶ 0055-0056]); 
for each of one or more planning steps in the sequence of planning steps (Arel; for each of one or more planning steps (i.e. selectable actions) in the sequence of planning steps (i.e. selectable actions) [¶ 0031-0033]), processing the internal state representation for the planning step using a value prediction neural network to generate a value prediction that is an estimate of a future cumulative discounted reward received after the planning step (Arel; processing the internal state representation for the planning step (i.e. selected action) [¶ 0031-0033] using a value prediction neural network to generate a value prediction that is an estimate of an implicit future cumulative discounted reward received after the planning step (i.e. selected actions) [¶ 0031, ¶ 0033-0034, and ¶ 0055-0056]; moreover, a discounted sum of the future rewards [¶ 0019]); and 
determining an estimate of the outcome associated with the environment based on the predicted rewards and the value predictions for the planning steps (Arel; determining an estimate of the outcome associated with the environment based on the predicted rewards and the value predictions for the planning steps (i.e. selectable actions) [¶ 0055-0057, ¶ 0059, and ¶ 0076]; moreover, determining unacceptable/acceptable performance [¶ 0061-0064]).  
Arel fails to explicit disclose an estimate of a future cumulative discounted reward.
However, van-Hasselt teaches processing the internal state representation for the planning step using a value prediction neural network to generate a value prediction that is an estimate of a future cumulative discounted reward received after the planning step (van-Hasselt; processing the implicit internal state representation for the planning step (i.e. selected action) using a value prediction neural network (.e. Q network) to generate a value prediction (i.e. future value) that is an estimate of a future cumulative discounted reward [¶ 0019-0020 and ¶ 0022] received after the planning step (i.e. selected action) [¶ 0036-0038 and ¶ 0040]; moreover, generating a respective estimated future cumulative reward for each action in the set of actions [¶ 0036-0038 and ¶ 0045-0047]; wherein, the next selected action [¶ 0046] is processed using a Q network [¶ 0047-0049]).
Arel and van-Hasselt are considered to be analogous art because both pertain to generating and/or managing data in relation with providing learning environment, wherein one or more computerized units are utilized in order to produce interaction in a real-world environment by a mechanical agent.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing of the claimed invention was made to modify Arel, to incorporate processing the internal state representation for the planning step using a value prediction neural network to generate a value prediction that is an estimate of a future cumulative discounted reward received after the planning step (as taught by van-Hasselt), in order to provide effective action selection through improved performance of reinforcement learning (van-Hasselt; [¶ 0005-0006]).
	
Regarding claim 2, Arel in view of van-Hasselt further discloses the method of claim 1, wherein the agent is a robotic agent interacting with a real-world environment (Arel; the agent is a robotic agent interacting with a real-world environment [¶ 0015-0017 and ¶ 0022]).  

Regarding claim 3, Arel in view of van-Hasselt further discloses the method of claim 1, wherein the outcome associated with the environment characterizes an effectiveness of the agent in performing the task (Arel; the outcome associated with the environment characterizes an effectiveness (i.e. unacceptable/acceptable performance) of the agent in performing the task [¶ 0055-0057, ¶ 0059, and ¶ 0076]; moreover, determining unacceptable/acceptable performance [¶ 0061-0064]).  

Regarding claim 4, Arel in view of van-Hasselt further discloses the method of claim 1, wherein each observation characterizing a state of the environment being interacted with by the agent comprises a respective image of the environment (Arel; each observation characterizing a state of the environment being interacted with by the agent [¶ 0021, ¶ 0027, and ¶ 0031] comprises a respective image (i.e. data) of the environment [¶ 0017]; moreover, data received by the reinforcement learning that partially or fully characterizes a state of an environment will be referred to in this specification as an observation [¶ 0017]).  

Regarding claim 5, Arel in view of van-Hasselt further discloses the method of claim 1, wherein for each planning step in the sequence of planning steps, the prediction neural network further generates a predicted discount factor for the next planning step (van-Hasselt; the prediction neural network (i.e. Q network) further generates a predicted discount factor for the next planning step (i.e. selected action) [¶ 0019-0020 and ¶ 0022] for each planning step (i.e. selected action) in the sequence of planning steps (i.e. selectable actions) [¶ 0036-0038 and ¶ 0045-0047]; additionally, estimated error [¶ 0048-0049]), and wherein determining the estimate of the outcome associated with the environment (van-Hasselt; determining the estimate of the outcome associated with the environment [¶ 0046-0049]) further comprises: 
determining the estimate of the outcome associated with the environment based on the predicted discount factors for the planning steps in addition to the predicted rewards and the value predictions for the planning steps (van-Hasselt; determining the estimate of the outcome associated with the environment, as addressed above, further comprises determining the estimate of the outcome associated with the environment based on the predicted discount factors for the planning steps (i.e. selectable actions) in addition to the predicted rewards and the value predictions for the planning steps (i.e. selectable actions) [¶ 0046-0049]).  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing of the claimed invention was made to modify Arel as modified by van-Hasselt, to incorporate wherein for each planning step in the sequence of planning steps, the prediction neural network further generates a predicted discount factor for the next planning step, and wherein determining the estimate of the outcome associated with the environment further comprises: determining the estimate of the outcome associated with the environment based on the predicted discount factors for the planning steps in addition to the predicted rewards and the value predictions for the planning steps (as taught by van-Hasselt), in order to provide effective action selection through improved performance of reinforcement learning (van-Hasselt; [¶ 0005-0006]).

Regarding claim 6, Arel in view of van-Hasselt further discloses the method of claim 5, wherein determining the estimate of the outcome associated with the environment (Arel; determining the estimate of the outcome associated with the environment, as addressed within the parent claim(s)) 
van-Hasselt further comprises combining: 
(i) the predicted reward and the predicted discount factor for each planning step (van-Hasselt; determining the estimate of the outcome associated with the environment comprises combining the predicted reward and the predicted discount factor for each planning step (i.e. selected action) [¶ 0018, ¶ 0022, and ¶ 0046-0049]; moreover, generating a tuple [¶ 0041 and ¶ 0043-0045]), and (ii) a value prediction for a last planning step (Arel; determining the estimate of the outcome associated with the environment comprises combining a value prediction (i.e. future value) for a last planning step (i.e. calculated future selectable action) [¶ 0020 and ¶ 0048-0049]; wherein, generating an estimated future cumulative reward from the input in accordance with a set of parameters [¶ 0036 and ¶ 0053]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing of the claimed invention was made to modify Arel as modified by van-Hasselt, to incorporate combining: (i) the predicted reward and the predicted discount factor for each planning step, and (ii) a value prediction for a last planning step (as taught by van-Hasselt), in order to provide effective action selection through improved performance of reinforcement learning (van-Hasselt; [¶ 0005-0006]).
  	
Regarding claim 8, Arel in view of van-Hasselt further discloses the method of claim 5, further comprising, for each planning step in the sequence of planning steps, processing the internal state representation for the planning step using a lambda neural network to generate a lambda factor for the next planning step (van-Hasselt; processing the internal state representation for the planning step (i.e. selected action) using a lambda neural network (i.e. Q network) to generate a lambda factor for the next planning step (i.e. selected action) for each planning step (i.e. selected action) in the sequence of planning steps (i.e. selectable action) [¶ 0022, ¶ 0028, ¶ 0030]; moreover, next selected action [¶ 0046-0048]), and wherein determining the estimate of the outcome associated with the environment (van-Hasselt; determining the estimate of the outcome associated with the environment [¶ 0046-0049]) comprises: 
determining the estimate of the outcome based on the lambda factors for the planning steps in addition to the predicted discount factors, the predicted rewards, and the value predictions for the planning steps (van-Hasselt; determining the estimate of the outcome based on the lambda factors for the planning steps (i.e. selectable actions) in addition to the predicted discount factors, the predicted rewards, and the value predictions for the planning steps (i.e. selectable actions) [¶ 0020 and ¶ 0048-0049]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing of the claimed invention was made to modify Arel as modified by van-Hasselt, to incorporate for each planning step in the sequence of planning steps, processing the internal state representation for the planning step using a lambda neural network to generate a lambda factor for the next planning step, and wherein determining the estimate of the outcome associated with the environment comprises: determining the estimate of the outcome based on the lambda factors for the planning steps in addition to the predicted discount factors, the predicted rewards, and the value predictions for the planning steps (as taught by van-Hasselt), in order to provide effective action selection through improved performance of reinforcement learning (van-Hasselt; [¶ 0005-0006]).

Regarding claim 12, Arel in view of van-Hasselt further discloses the method of claim 1, wherein the state representation neural network comprises a feedforward neural network (Arel; the state representation neural network comprises a feedforward neural network [¶ 0024-0025 and ¶ 0028]; wherein, input data is implicitly feed forward through a neural network [¶ 0024 and ¶ 0028]).  

Regarding claim 13, Arel in view of van-Hasselt further discloses the method of claim 1, wherein the prediction neural network comprises a recurrent neural network (Arel; the prediction neural network comprises a recurrent neural network [¶ 0028]).  

Regarding claim 14, Arel in view of van-Hasselt further discloses the method of claim 1, wherein the prediction neural network comprises a feedforward neural network that has different parameter values at each planning step (van-Hasselt; the prediction neural network comprises a feedforward neural network that has different parameter values at each planning step [¶ 0036 and ¶ 0047-0048]). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing of the claimed invention was made to modify Arel as modified by van-Hasselt, to incorporate the prediction neural network comprises a feedforward neural network that has different parameter values at each planning step (as taught by van-Hasselt), in order to provide effective action selection through improved performance of reinforcement learning (van-Hasselt; [¶ 0005-0006]).

Regarding claim 15, the rejection of claim 15 is addressed within the rejection of claim 1, due to the similarities claim 15 and claim 1 share, therefore refer to the rejection of claim 1 regarding the rejection of claim 15; however, the subject matter/limitations not addressed by claim 1 is/are addressed below.
Arel discloses a system (Arel; a system [¶ 0020-0021, ¶ 0085, and ¶ 0087]) comprising: 
one or more computers (Arel; the system, as addressed above, comprises one or more computer [¶ 0020-0021, ¶ 0084-0085, and ¶ 0087]); and
one or more storage devices communicatively coupled to the one or more computers (Arel; the system, as addressed above, comprises one or more storage devices communicatively coupled to the one or more computers [¶ 0084-0085, ¶ 0087, and ¶ 0089]), wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations (Arel; the one or more storage devices store instructions that cause the one or more computers to perform operations when executed by the one or more computers [¶ 0084-0085, ¶ 0087, and ¶ 0089]).
(further refer to the rejection of claim 1)

Regarding claim 16, the rejection of claim 16 is addressed within the rejection of claim 2, due to the similarities claim 16 and claim 2 share, therefore refer to the rejection of claim 2 regarding the rejection of claim 16.

Regarding claim 17, the rejection of claim 17 is addressed within the rejection of claim 3, due to the similarities claim 17 and claim 3 share, therefore refer to the rejection of claim 3 regarding the rejection of claim 17.

Regarding claim 18, the rejection of claim 18 is addressed within the rejection of claim 4, due to the similarities claim 18 and claim 4 share, therefore refer to the rejection of claim 4 regarding the rejection of claim 18.

Regarding claim 19, the rejection of claim 19 is addressed within the rejection of claim 5, due to the similarities claim 19 and claim 5 share, therefore refer to the rejection of claim 5 regarding the rejection of claim 19.

Regarding claim 20, the rejection of claim 20 is addressed within the rejection of claim 1, due to the similarities claim 20 and claim 1 share, therefore refer to the rejection of claim 1 regarding the rejection of claim 20; however, the subject matter/limitations not addressed by claim 1 is/are addressed below.
Arel discloses one or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations (Arel; one or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations [¶ 0084-0086]).
(further refer to the rejection of claim 1)
	



Allowable Subject Matter

Claims 7 and 9-11 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.



Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Refer to PTO-892, Notice of Reference Cited for a listing of analogous art.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Charles Lloyd Beard whose telephone number is (571)272-5735. The examiner can normally be reached Monday - Friday, 8:00 AM - 5: 00 PM, alternate Fridays EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached on (571)270-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

CHARLES LLOYD. BEARD
Primary Examiner
Art Unit 2616



/CHARLES L BEARD/Primary Examiner, Art Unit 2616