DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09/04/2018 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Claim Objections
Claims 11 and 18 are objected to because of the following informalities: 
In claim 11, line 19, “as an in intermediate reward” should be “as an intermediate reward”
In claim 18, line 23, “an in intermediate reward” should be “as an intermediate reward”
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):

(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claim 12 is rejected under 35 U.S.C. 112(b)  or pre-AIA  35 U.S.C. 112, second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
Claim 12 recites the limitation “a good initialization state” rendering the claim indefinite. The term “good” is a relative term and is not defined by the claim. The specification does not provide a clear explanation. Therefore, there are no criteria for concluding an evaluation in order to determine or measure how “good” a state could be. The examiner cannot conclude a clear and unique definition for this term as it is considered subjective; thereby the claim is rendered indefinite.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

Claims 11, 13, 14 and 16 are rejected under 35 U.S.C. 102 (a)(1) as being anticipated by Hirkoawa ("Coaching Robots: Online Behavior Learning from Human Subjective Feedback")
In regard to claim 11, Hirkoawa teaches: A method of learning to achieve a goal configuration of an artificial agent, comprising: defining the goal configuration for the agent;  (Hirkoawa, p. 39 "We propose a new learning framework which allows human trainers to give subjective and simple feedback to the agent asynchronously and in real time, to lead the agent towards learning the desired behavior [e.g. learning to achieve a goal configuration]."; p. 43 "The first term, rinit, refers to the initial reward function, which defines the goal state of the task [e.g. a goal configuration] given at the beginning of the learning algorithm...")
providing positive examples via an interface to the agent when the agent has achieved the goal configuration; (Hirkoawa, p.37 "We demonstrated that the agent is capable of learning the desired behavior by receiving simple and subjective instructions such as positive [positive examples] and negative."; p. 39 "Moreover, we use abstract and primitive binary values as the evaluation quantity of Coaching, namely positive or negative."; p.39 "Fig. 1. Coaching is an interactive learning methodology in subjective feedback in real time in addition to the learning mechanism implemented in the agent"; p. 40 "Fig. 2. Learning model of the proposed method. Human trainer can intervene the learning process of the agent by updating the reward function based on Coaching as a learning assistance."; p. 48 "Give a 'positive' or 'negative' evaluation if you think that the agent did or didn't do an action that brings it closer to achieving this task. [achieving the goal configuration]"; p. 50 "... we proposed a novel methodology for behavior learning, called Coaching. This method allows a human trainer to intervene in the learning process of an 
    PNG
    media_image1.png
    452
    776
    media_image1.png
    Greyscale
agent by giving subjective evaluations such as positive and negative.")
providing negative examples via the interface to the agent when the agent fails to achieve the goal configuration; (Hirkoawa, p.37 "We demonstrated that the agent is capable of learning the desired behavior by receiving simple and subjective instructions such as positive and negative. [negative examples]"; p. 39 "Moreover, we use abstract and primitive binary values as the evaluation quantity of Coaching, namely positive or negative."; p.39 "Fig. 1. Coaching is an interactive learning methodology in which a human trainer can assist the agent in an intuitive manner by giving subjective feedback in real time in addition to the learning mechanism implemented in the agent"; p. 40 "Fig. 2. Learning model of the proposed method. Human trainer can intervene the learning process of the agent by updating the reward function based on Coaching as a learning assistance."; p. 48 "Give a 'positive' or 'negative' evaluation if you think that the agent did or didn't do an action that brings it closer to achieving this task. [not achieving the goal configuration]"; p. 50 "... we proposed a novel methodology for behavior learning, called Coaching. This method allows a human trainer to intervene in the learning process of an agent by giving subjective evaluations such as positive and negative.")

    PNG
    media_image2.png
    23
    80
    media_image2.png
    Greyscale
converting state features into a distance function to determine how far the agent is from the goal configuration; (Hirkoawa, p. 45 "… pendulum angle θ [where the agent is / state features] … the target posture range of π ± ε[rad], where ε is a small value"; π [the goal configuration]; 
 is a distance function to determine how far the agent is from the goal configuration. )

    PNG
    media_image3.png
    36
    217
    media_image3.png
    Greyscale
using the distance function as an in intermediate reward for the agent; and (Hirkoawa, p. 45 "Three typical examples of reward functions are considered from the view point of different opportunities for obtaining the reward while learning. 1. Reward is given all over the state space (r1); 
r1 is the intermediate reward, which is based on the distance cos(θ - π).)

    PNG
    media_image4.png
    53
    245
    media_image4.png
    Greyscale
using goal detection as a final reward. (Hirkoawa, p. 45 "3. Reward is given solely at a part of the goal state (r3)"; 
r3 is the final reward)
In regard to claim 13, Hirkoawa teaches: The method of claim 11, further comprising self-practice by the agent. (Hirkoawa, p. 38 "the agent explores the state space [self-practice by the agent] randomly until it discovers the first reward, and then it starts learning based on that reward random number aiming at exploring the state space")
In regard to claim 14, Hirkoawa teaches: The method of claim 13, wherein a selected goal configuration for self- practice is selected based on at least one of a random determination, which goal configuration needs the most improvement, which goal configuration is most likely to improve, which goal configuration has been used least recently, and which goal configuration is most used. (Hirkoawa, p. 38 "the agent explores the state space randomly until it discovers the first reward")

    PNG
    media_image5.png
    116
    296
    media_image5.png
    Greyscale

    PNG
    media_image6.png
    128
    243
    media_image6.png
    Greyscale
In regard to claim 16, Hirkoawa teaches: The method of claim 11, further comprising updating a policy for achieving a goal configuration based on performance of the agent.  (Hirkoawa, p. 42 "The state value V(xt) and the action output u(xt) can be expressed as: The agent updates its own state value and action output [updating a policy (states → actions)] by repeating the following steps.
"; Policies map states (or observations) to actions. States and actions [policies] are updated based on the transition from V(xt-1) to V(xt), i.e. V(xt) - V(Xt-1) [performance of the agent] indicates if the transition is helpful or not helpful for achieving the goal.)
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the 

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 6-9, 17 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Hirkoawa ("Coaching Robots: Online Behavior Learning from Human Subjective Feedback") in view of Gruebler ("Coaching robot behavior using continuous physiological affective feedback").
In regard to claims 1 and 17, Hirkoawa teaches: A method for training an artificial intelligent agent, comprising: (Hirkoawa, p. 37 "Abstract. This chapter describes a novel methodology for behavior learning of an agent, called Coaching [training an artificial intelligent agent]. The proposed method is an interactive and iterative learning method which allows a human trainer to give a subjective evaluation to the robotic agent in real time, and the agent can update the reward function dynamically based on this evaluation simultaneously.")
defining a goal configuration for the agent; (Hirkoawa, p. 39 "We propose a new learning framework which allows human trainers to give subjective and simple feedback to the agent asynchronously and in real time, to lead the agent towards learning the desired behavior [e.g. a goal configuration]."; p. 43 "The first term, rinit, refers to the initial reward function, which defines the goal state of the task [e.g. a goal configuration] given at the beginning of the learning algorithm...")
providing positive examples via an interface to the agent when the agent has achieved the goal configuration; (Hirkoawa, p.37 "We demonstrated that the agent is capable of learning the desired by receiving simple and subjective instructions such as positive [positive examples] and negative."; p. 39 "Moreover, we use abstract and primitive binary values as the evaluation quantity of Coaching, namely positive or negative."; p.39 "Fig. 1. Coaching is an interactive learning methodology in which a human trainer can assist the agent in an intuitive manner by giving subjective feedback in real time in addition to the learning mechanism implemented in the agent"; p. 40 "Fig. 2. Learning model of the proposed method. Human trainer can intervene the learning process of the agent by updating the reward function based on Coaching as a learning assistance."; p. 48 "Give a 'positive' or 'negative' evaluation if you think that the agent did or didn't do an action that brings it closer to achieving this task. [achieving the goal configuration]"; p. 50 "... we proposed a novel methodology for behavior learning, called Coaching. This method allows a human trainer to intervene in the learning process of an 
    PNG
    media_image1.png
    452
    776
    media_image1.png
    Greyscale
agent by giving subjective evaluations such as positive and negative.")
providing negative examples via the interface to the agent when the agent fails to achieve the goal configuration; and (Hirkoawa, p.37 "We demonstrated that the agent is capable of learning the desired behavior by receiving simple and subjective instructions such as positive and negative. [negative examples]"; p. 39 "Moreover, we use abstract and primitive binary values as the evaluation quantity of Coaching, namely positive or negative."; p.39 "Fig. 1. Coaching is an interactive learning methodology in which a human trainer can assist the agent in an intuitive manner by giving subjective feedback in real time in addition to the learning mechanism implemented in the agent"; p. 40 "Fig. 2. Learning model of 'negative' evaluation if you think that the agent did or didn't do an action that brings it closer to achieving this task. [not achieving the goal configuration]"; p. 50 "... we proposed a novel methodology for behavior learning, called Coaching. This method allows a human trainer to intervene in the learning process of an agent by giving subjective evaluations such as positive and negative.")
Hirkoawa fails to teach, but Gruebler teaches: extracting key state features to determine what features are important during receipt of positive examples to the agent.  (Gruebler, p. 469 "An example of robot behavior and human feedback can be seen in Figure 5. The robot detects the object, grasps it and hands it to the human. Because the robot is performing the appropriate action (give red ball to the human), the trainer is giving positive feedback by smiling.... The effect of feedback on robot behavior can be seen in Figure 7. The robot collects [extracting] experiences that result in changes to the value of certainty [important] after an object of a certain color has been grasped [e.g. key state features] based on the affective feedback given by the trainer [e.g. agent receiving positive feedback] in the form of facial expressions. The robot was able to perform the correct action for all objects [e.g. state features] based on the affective feedback. The final red and green objects for which no emotional feedback was given were treated using the correct action by the robot.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa to incorporate the teachings of Gruebler by including social robotics performing actions when confronted with an object during coaching. Doing so would make the robot quickly learn the appropriate actions for different situations from the trainer. (Gruebler, p. 466 "We show how a robot can be coached to perform a certain action when confronted with an object by using the continuous physiological affective feedback from the human trainer. We also show that the robot is able to quickly learn the appropriate actions for different situations from the trainer in 
Claim 17 recites substantially the same limitation as claim 1, therefore the rejection applied to claim 1 also apply to claim 17. In addition, Gruebler teaches: A non-transitory computer-readable storage medium with an executable program stored thereon, wherein the program instructs one or more processors to perform the following steps: (Gruebler, p. 468 "The humanoid robot (Nao, Aldebran Robotics)"; Gruebler teaches humanoid robot, where the memory, the program and processors are inherent.)

    PNG
    media_image7.png
    31
    89
    media_image7.png
    Greyscale
In regard to claim 6, Hirkoawa and Gruebler teach: The method of claim 1, wherein the key state features are weighted according to a predetermined weight value. (Hirkoawa, p. 42 "Xt is the observed state variable at time t… The state value V(xt)… can be expressed as: …  Wk and Vk are weight variables")

    PNG
    media_image2.png
    23
    80
    media_image2.png
    Greyscale
In regard to claim 7, Hirkoawa and Gruebler teach: The method of claim 1, further comprising converting state features into a distance function to determine how far the agent is from the goal configuration.  (Hirkoawa, p. 45 "… pendulum angle θ [where the agent is / state features] … the target posture range of π ± ε[rad], where ε is a small value"; π [the goal configuration];  is a distance function to determine how far the agent is from the goal configuration
In regard to claim 8, Hirkoawa and Gruebler teach: The method of claim 1, further comprising using goal detection as a final reward. (Hirkoawa, p. 45 "3. Reward is given solely at a part of the goal 

    PNG
    media_image4.png
    53
    245
    media_image4.png
    Greyscale
state (r3)"; r3 is the final reward)  


    PNG
    media_image3.png
    36
    217
    media_image3.png
    Greyscale
In regard to claim 9, Hirkoawa and Gruebler teach: The method of claim 1, further comprising using a goal distance as an intermediate reward.  (Hirkoawa, p. 45 "Three typical examples of reward functions are considered from the view point of different opportunities for obtaining the reward while learning. 1. Reward is given all over the state space (r1); r1 


 is the intermediate reward, which is based on the distance cos(θ - π).) 
In regard to claim 18, Hirkoawa and Gruebler teach: The non-transitory computer-readable storage medium of claim 17, wherein the program instructs one or more processors to perform the following steps: (Gruebler, p. 468 "The humanoid robot (Nao, Aldebran Robotics)"; Gruebler teaches humanoid robot, where the memory, the program and processors are inherent.)

    PNG
    media_image2.png
    23
    80
    media_image2.png
    Greyscale
converting state features into a distance function to determine how far the agent is from the goal configuration; (Hirkoawa, p. 45 "… pendulum angle θ [where the agent is / state features] … the target posture range of π ± ε[rad], where ε is a small value"; π [the goal configuration]; 
is a distance function to determine how far the agent is from the goal configuration)
using the distance function an in intermediate reward for the agent; and (Hirkoawa, p. 45 "Three typical examples of reward functions are considered from the view point of different 
    PNG
    media_image3.png
    36
    217
    media_image3.png
    Greyscale
opportunities for obtaining the reward while learning. 1. Reward is given all over the state space (r1); r1 is the intermediate reward, which is based on the distance cos(θ - π).)

    PNG
    media_image4.png
    53
    245
    media_image4.png
    Greyscale
using goal detection as a final reward. (Hirkoawa, p. 45 "3. Reward is given solely at a part of the goal state (r3)"; r3 is the final reward)

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa to incorporate the teachings of Gruebler by including social robotics performing actions when confronted with an object during coaching. Doing so would make the robot quickly learn the appropriate actions for different situations from the trainer.
Claims 2-4, 10 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Hirkoawa in view of Gruebler in further view of Grüneberg ("An Approach to Subjective Computing: A Robot That Learns From Interaction With Humans").
In regard to claim 2, Hirkoawa and Gruebler fail to teach, but Grüneberg teaches: The method of claim 1, wherein the interface includes at least one of a spoken word received by the agent, an electronic signal received from a computing device, and a physical button on the agent. (Grüneberg, p. 10 "Simulated Robotic Arm: In the simulated environment, feedback was provided by means of a graphical user interface with two buttons: one for positive and one for negative feedback.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa and Gruebler to incorporate the teachings of Grüneberg by including subjective intelligence in a relational approach. Doing so would make it fully tractable and therefore implementable in artificial agents. (Grüneberg, p. 5 "we show by means of a relational approach how subjective intelligence can be implemented in terms of the reciprocity of 
In regard to claim 3, Hirkoawa, Gruebler and Grüneberg teach: The method of claim 1, wherein the step of extracting key state features includes looking for similarity in each of the positive and negative examples. (Grüneberg, p. 8 "Consistency detection: The identification of this target behavior (the behavior evaluated by the trainer) is, secondly, complemented by a consistency (error) detection, i.e., checking to what extent a given evaluation corresponds to current and previous evaluations [each of the positive and negative examples] of the same or similar behavior [similarity] . If the history of feedback is inconsistent (contradictory) with the current feedback, the latter is regarded as irrelevant and the behavior will not be modified. Thus, consistency detection allows the agent to evaluate the coherency of feedback")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa and Gruebler to incorporate the teachings of Grüneberg by including subjective intelligence in a relational approach. Doing so would make it fully tractable and therefore implementable in artificial agents.
In regard to claim 4, Hirkoawa, Gruebler and Grüneberg teach: The method of claim 3, further comprising increasing a confidence of the agent as the positive and negative examples are received by the agent. (Grüneberg, p. 11 "Consistency detection is implemented in terms of certainty Ci [e.g. a confidence of the agent] of a specific object action pair: positive feedback increases Ci so that the actual object action pair is confirmed, whereas negative feedback decreases Ci.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa and Gruebler to incorporate the teachings of Grüneberg by including subjective intelligence in a relational approach. Doing so would make it fully tractable and therefore implementable in artificial agents.

    PNG
    media_image8.png
    654
    644
    media_image8.png
    Greyscale
In regard to claim 10, Hirkoawa, Gruebler and Grüneberg teach: The method of claim 1, further comprising asking, by the agent, for human feedback regarding whether the agent is in a goal state. (Grüneberg, p.11 See Fig. 3. Interaction with positive and negative feedback; "The interaction between agent and trainer, i.e., the allocation of a binary feedback, is facilitated by means of detecting the facial expressions of the trainer. For that purpose, facial expression was measured by a wearable device that recognizes smiling and frowning [41], which can be related to either confirmation (smiling) or correction (frowning), respectively. The corresponding signals function as a binary feedback that enabled the robot to modify its behavior continuously while conducting the task (see Fig. 3). No further specification of the signal is necessary. In this way, the robotic agent is open to human affective feedback (nonverbal social cues) in direct interaction."; p. 12 "Interaction: A subjective agent is supposed to interact with other robotic or human agents. Hence, its behavior can be perceived as responsive."; the robot asks a human if giving a red / green ball is a good goal. ) 

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa and Gruebler to incorporate the teachings of 
In regard to claim 19, Hirkoawa, Gruebler and Grüneberg teach: The non-transitory computer-readable storage medium of claim 17, wherein the step of extracting key state features includes looking for similarity in each of the positive and negative examples. (Grüneberg, p. 8 "Consistency detection: The identification of this target behavior (the behavior evaluated by the trainer) is, secondly, complemented by a consistency (error) detection, i.e., checking to what extent a given evaluation corresponds to current and previous evaluations [each of the positive and negative examples] of the same or similar behavior [similarity] . If the history of feedback is inconsistent (contradictory) with the current feedback, the latter is regarded as irrelevant and the behavior will not be modified. Thus, consistency detection allows the agent to evaluate the coherency of feedback")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa and Gruebler to incorporate the teachings of Grüneberg by including subjective intelligence in a relational approach. Doing so would make it fully tractable and therefore implementable in artificial agents.
Claims 5 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Hirkoawa in view of Gruebler in view of Grüneberg in further view of Yang (US 10289910 B1).
In regard to claim 5, Hirkoawa, Gruebler and Grüneberg fail to teach, but Yang teaches: The method of claim 4, wherein the agent takes an action upon reaching a predetermined level of confidence. (Yang, "During operation in the field and while in recognition mode, if the autonomous robot recognizes an explosive device that exceeds a predetermined confidence threshold [a predetermined level of confidence] (e.g., probability greater than 0.50), then the autonomous robot is actuated to grasp the object (e.g., with a robotic arm) [the agent takes an action] and then navigate to 
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa, Gruebler and Grüneberg to incorporate the teachings of Yang by including an autonomous mobile platform. Doing so would make the robot be automatically actuated when the system recognizes a particular object. (Yang, "the mobile platform can be automatically actuated or otherwise caused to perform a physical action or operation when the system recognizes a particular object.")
In regard to claim 20, Hirkoawa, Gruebler, Grüneberg and Yang teach: The non-transitory computer-readable storage medium of claim 17, wherein the program instructs one or more processors to perform the following steps: increasing a confidence of the agent as additional ones of the positive and negative examples are received by the agent; and (Grüneberg, p. 11 "Consistency detection is implemented in terms of certainty Ci [e.g. a confidence of the agent] of a specific object action pair: positive feedback increases Ci so that the actual object action pair is confirmed, whereas negative feedback decreases Ci.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa and Gruebler to incorporate the teachings of Grüneberg by including subjective intelligence in a relational approach. Doing so would make it fully tractable and therefore implementable in artificial agents. 
taking an action by the agent upon reaching a predetermined level of confidence.  (Yang, "During operation in the field and while in recognition mode, if the autonomous robot recognizes an explosive device that exceeds a predetermined confidence threshold [a predetermined level of confidence] (e.g., probability greater than 0.50), then the autonomous robot is actuated to grasp the object (e.g., with a robotic arm) [the agent takes an action] and then navigate to an explosive containment vessel, where the autonomous robot then releases and disposes of the object.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa, Gruebler and Grüneberg to incorporate the teachings of Yang by including an autonomous mobile platform. Doing so would make the robot be automatically actuated when the system recognizes a particular object.
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Hirkoawa in view of Takahiro (US 20150290798 A1).
In regard to claim 12, Hirkoawa fails to teach, but Takahiro teaches: The method of claim 11, further comprising recognizing whether the agent is in a good initialization state. (Takahiro, [0270] "When the processing of moving the robot 50 starts, the axis position state determination unit 28 determines whether or not the position of one or more predetermined axes is the position of the state satisfying the predetermined positional relationship condition (step SA1). Then, the axis position state determination unit 28 is used to determine whether to perform the control of the position and/or the posture of the tip 58 of the robot 50 on the orthogonal coordinate system at each control cycle or to perform the control of the position of desired each axis at each control cycle.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa to incorporate the teachings of Takahiro by including the axis position state determination unit. Doing so would allow the control mode to be set in the first or second mode. (Takahiro, [0271] "When the axis position state determination unit 28 determines that the position of the one or more predetermined axes is not the position of the state satisfying the predetermined positional relationship condition, the control mode is set to the first control mode (step SA2)... the control mode is set to the second control mode (step SA3).")
Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Hirkoawa in view of Kober ("Reinforcement learning in robotics: A survey").
In regard to claim 15, Hirkoawa fails to teach, but Kober teaches: The method of claim 13, further comprising biasing action choices based on which actions are important to achieving the goal configuration. (Kober, p. 1242, "This setting has the problem that it cannot distinguish between policies that initially gain a transient of large rewards and those that do not. This transient phase, also called prefix, is dominated by the rewards obtained in the long run. If a policy accomplishes both an optimal prefix as well as an optimal long-term behavior [important actions], it is called bias optimal [e.g. biasing action choices] (Lewis and Puterman, 2001). An example in robotics would be the transient phase during the start of a rhythmic movement, where many policies will accomplish the same long-term reward but differ substantially in the transient (e.g. there are many ways of starting the same gait in dynamic legged locomotion) allowing for room for improvement in practical application."; p. 1258 "Using demonstrations to initialize reinforcement learning provides multiple benefits. Perhaps the most obvious benefits that it provides supervised training data of what actions to perform in states that are encountered. Such data may be helpful when used to bias policy action selection [e.g. biasing action choices].")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Hirkoawa to incorporate the teachings of Kober by including bias policy action selection. Doing so would make the agent know what actions to perform in states that are encountered. (Kober, "Perhaps the most obvious benefits that it provides supervised training data of what actions to perform in states that are encountered")
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SU-TING CHUANG whose telephone number is (408)918-7519.  The examiner can normally be reached on Monday - Thursday 8-5 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571)272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/S.C./Examiner, Art Unit 2122                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126