DETAILED ACTION
This action is in response the communications filed on 07/01/2021 in which claims 1, 3, 11, 12, 17, 18 and 19 are amended and therefore claims 1-20 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 05/20/2021 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Claim Objections
Claims 12 are objected to because of the following informalities: 
In claim 12, “whether the agent is in a initialization state” should be “whether the agent is in an initialization state”
Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the 

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 2, 3, 17 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Grollman ("Robot Learning from Failed Demonstrations") in view of Nicolescu ("Natural Methods for Robot Task Learning: Instructive Demonstrations, Generalization and Practice").
In regard to claims 1 and 17, Grollman teaches: A method for training an artificial intelligent agent, comprising: (Grollman: p. 331, 1 Motivation "The standard Robot Learning from Demonstration (RLfD) scenario has an end-user who wants to adapt a robot [training an artificial intelligent agent] to perform a new task, perform an old task in a new way, or operate in a new environment."; p. 332 "... such demonstrations are easily obtainable from nearly any human teacher with minimal overhead and can immediately be used for training.")
defining a goal configuration for the agent;  (Grollman: p. 331, 1 Motivation "The standard Robot Learning from Demonstration (RLfD) scenario has an end-user who wants to adapt a robot to perform a new task, perform an old task in a new way, or operate in a new environment [e.g. a goal configuration].")
providing positive examples... to the agent to demonstrate the agent achieving the goal configuration; (Grollman: p. 332 "Almost all current RLfD approaches start with a successful [positive examples] (perhaps suboptimal) human demonstration of the task [a goal configuration]. For relatively pick-and-place [17], point to point motion [12], or washing a surface [8]..."; p. 333 "When the training data is from successful demonstrations, this approach can reproduce the desired task.", 2.2 Learning from Success "When initialized with successful demonstrations, the learned GMMcan deterministically generate fθ (ξ ) by taking the expectation of the conditional distribution in Eq. (2)...")
providing negative examples... to the agent to demonstrate the agent failing to achieve the goal configuration; and (Grollman: p. 332 "In this article we develop and examine RLfD approaches in an attempt to replicate that ability in a robot, based on the idea that failed demonstrations [negative examples] have educational worth in three respects: Firstly, they are examples of what not to do, so replication should be avoided... From this point of view we attempt to perform Learning from Failure (LfF)."; p. 333, 3 Learning from Failure "Consider the failed demonstrations in Fig. 2. Rather than only producing the mean trajectory (solid line), we wish to also generate exploratory trajectories (dotted lines).")

Grollman does not teach, but Nicolescu teaches: … via the interface… (Nicolescu: p. 244 "To achieve this, we add verbal instruction to the existing demonstration capabilities of our system. With this, the teacher can provide the following types of information:"; p. 245 "For the voice commands and feedback we used an off-the-shelf Logitech cordless headset [interface], and the IBM ViaVoice software recognition engine.)
extracting key state features to determine what feature categories are important during receipt of positive examples to the agent. (Nicolescu: p. 242 "During demonstrations [during receipt of positive examples / successful demonstrations] humans almost always make use of additional simple cues and instructions that facilitate the learning process and bias the learner’s attention to the important aspects [important] of the demonstration (e.g. 'watch this', 'lift that', etc.) [positive examples to the objects / feature categories / state features]. Although simple, these cues have a large impact on the robot’s learning performance: by relating them with the state of the environment [extracting / obtaining key state features] at the moment when they are received, the learner is provided with information that may otherwise be impossible or extremely hard to obtain only from the observed data. For example, while teaching a robot to go and pick up the mail, the robot can detect numerous other aspects along its path (e.g., passing by a chair, meeting another robot, etc.) [unimportant objects]. These observations are irrelevant for getting the mail, and simple cues from the teacher could easily indicate that. [mail is important objects for the task]") (Based on spec. [0056] [0059] [0060], feature categories are one example of state features. Another example: p. 244 - 247, see Figure 7, 8 and 9, "The robot has a behavior set that allows it to track cylindrical colored targets, to pick up, and drop small colored objects: PickUp(ColorOfObject) - the robot picks up an object of the color ColorOfObject. The goal state is achieved when the robot senses and has the object in the closed gripper... both paths contain actual behaviors. For example, Figure 8(c) encodes the fact that both going to the Green or to the Light Green targets [e.g. feature categories, object positioning] is acceptable for the task."; MT(Green) or PICKUP3 (Orange) are important objects / feature categories for the object transport task.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman to incorporate the teachings of Nicolescu by including instructive demonstrations. Doing so would make a Pioneer 2DX mobile robot learning tasks from multiple demonstrations and teacher feedback. (Nicolescu: p. 241 Abstract "Thus, instructive demonstrations, generalization over multiple demonstrations and practicetrials are essential features for a successful human-robot teaching approach. We implemented a system that enables all these capabilities and validated these concepts with a Pioneer 2DX mobile robot learning tasks from multiple demonstrations and teacher feedback.")

Claim 17 recites substantially the same limitation as claim 1, therefore the rejection applied to claim 1 also apply to claim 17. In addition, Nicolescu teaches: A non-transitory computer-readable storage medium with an executable program stored thereon, wherein the program instructs one or more processors to perform the following steps: (Nicolescu: p. 245 "We implemented and tested our concepts on a Pioneer 2-DX mobile robot, equipped with two rings of sonars (8 front and 8 rear), a SICK laser range-finder, a pan-tilt-zoom color camera, a gripper, and on-board computation on a PC104 stack. We performed the experiments in a 5.4m x 6.6m arena. The robot was programmed using AYLLU [19], an extension of C for development of distributed control systems for mobile robots."; Nicolescu teaches mobile robot, where the memory and processors are inherent.)

In regard to claim 2, Grollman and Nicolescu teach: The method of claim 1, wherein the interface includes at least one of a spoken word received by the agent, an electronic signal received from a computing device, and a physical button on the agent. (Nicolescu: p. 244 "To achieve this, we add verbal instruction to the existing demonstration capabilities of our system. With this, the teacher can provide the following types of information:"; p. 245 "For the voice commands and feedback we used an off-the-shelf Logitech cordless headset [interface], and the IBM ViaVoice software recognition engine.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman to incorporate the teachings of Nicolescu by including instructive demonstrations. Doing so would make a Pioneer 2DX mobile robot learning tasks from multiple demonstrations and teacher feedback.
In regard to claim 3, Grollman and Nicolescu teach: The method of claim 1, wherein the step of extracting key state features includes looking for similarity in state features in each of the positive and negative examples. (Nicolescu: p. 245 "Each new demonstration [each of the positive and negative examples] is compared with the existing task structure, while computing their similarity [similarity in state features] in the form of their longest common sequence. Common nodes are then merged, while the others appear as alternate execution paths."; p. 243 see Figure 3, states A, B and F are common states across multiple demonstrations, therefore are merged.; p. 246 see Figure 8 (c) and (e), PICKUP3 (Orange) are examples of common / similar state features.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman to incorporate the teachings of Nicolescu by including instructive demonstrations. Doing so would make a Pioneer 2DX mobile robot learning tasks from multiple demonstrations and teacher feedback.
In regard to claim 19, Grollman and Nicolescu teach: The non-transitory computer-readable storage medium of claim 17, wherein the step of extracting key state features includes looking for similarity of the key state features in each of the positive and negative examples. (Nicolescu: p. 245 "Each new demonstration [each of the positive and negative examples] is compared with the existing task structure, while computing their similarity [similarity in key state features] in the form of their longest common sequence. Common nodes are then merged, while the others appear as alternate execution paths."; p. 243 see Figure 3, states A, B and F are common states across multiple demonstrations, therefore are merged.; p. 246 see Figure 8 (c) and (e), PICKUP3 (Orange) are examples of common / similar state features.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman to incorporate the teachings of Nicolescu by including instructive demonstrations. Doing so would make a Pioneer 2DX mobile robot learning tasks from multiple demonstrations and teacher feedback.

Claims 4 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Grollman in view of Nicolescu in further view of Grüneberg ("An Approach to Subjective Computing: A Robot That Learns From Interaction With Humans").
In regard to claim 4, Grollman and Nicolescu do not teach, but Grüneberg teaches: The method of claim 3, further comprising increasing a confidence of the agent as the positive and negative examples are received by the agent. (Grüneberg, p. 11 "Consistency detection is implemented in terms of certainty Ci [e.g. a confidence of the agent] of a specific object action pair: positive feedback increases Ci so that the actual object action pair is confirmed, whereas negative feedback decreases Ci.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman and Nicolescu to incorporate the teachings of Grüneberg by including subjective intelligence in a relational approach. Doing so would make it fully tractable and therefore implementable in artificial agents. (Grüneberg, p. 5 "we show by means of a relational approach how subjective intelligence can be implemented in terms of the reciprocity of autonomous self-referentiality and direct world-coupling... In sum, we conclude that subjective intelligence in relational terms is fully tractable and therefore implementable in artificial agents.")

    PNG
    media_image1.png
    654
    644
    media_image1.png
    Greyscale
In regard to claim 10, Grollman, Nicolescu and Grüneberg teach: The method of claim 1, further comprising asking, by the agent, for human feedback regarding whether the agent is in a goal state. (Grüneberg, p.11 See Fig. 3. Interaction with positive and negative feedback; "The interaction between agent and trainer, i.e., the allocation of a binary feedback, is facilitated by means of detecting the facial expressions of the trainer. For that purpose, facial expression was measured by a wearable device that recognizes smiling and frowning [41], which can be related to either confirmation (smiling) or correction (frowning), respectively. The corresponding signals function as a binary feedback that enabled the robot to modify its behavior continuously while conducting the task (see Fig. 3). No further specification of the signal is necessary. In this way, the robotic agent is open to human affective feedback (nonverbal social cues) in direct interaction."; p. 12 "Interaction: A subjective agent is supposed to interact with other robotic or human agents. Hence, its behavior can be perceived as responsive."; the robot asks a human if giving a red / green ball is a good goal. ) 

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman and Nicolescu to incorporate the teachings of .

Claims 5 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Grollman in view of Nicolescu in view of Grüneberg in further view of Yang (US 10289910 B1).
In regard to claim 5, Grollman, Nicolescu and Grüneberg do not teach, but Yang teaches: The method of claim 4, wherein the agent takes an action upon reaching a predetermined level of confidence. (Yang, "During operation in the field and while in recognition mode, if the autonomous robot recognizes an explosive device that exceeds a predetermined confidence threshold [a predetermined level of confidence] (e.g., probability greater than 0.50), then the autonomous robot is actuated to grasp the object (e.g., with a robotic arm) [the agent takes an action] and then navigate to an explosive containment vessel, where the autonomous robot then releases and disposes of the object.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman, Nicolescu and Grüneberg to incorporate the teachings of Yang by including an autonomous mobile platform. Doing so would make the robot be automatically actuated when the system recognizes a particular object. (Yang, "the mobile platform can be automatically actuated or otherwise caused to perform a physical action or operation when the system recognizes a particular object.")

In regard to claim 20, Grollman, Nicolescu, Grüneberg and Yang teach: The non-transitory computer-readable storage medium of claim 17, wherein the program instructs one or more processors to perform the following steps: increasing a confidence of the agent as additional ones of the positive and negative examples are received by the agent; and (Grüneberg, p. 11 "Consistency certainty Ci [e.g. a confidence of the agent] of a specific object action pair: positive feedback increases Ci so that the actual object action pair is confirmed, whereas negative feedback decreases Ci.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman and Nicolescu to incorporate the teachings of Grüneberg by including subjective intelligence in a relational approach. Doing so would make it fully tractable and therefore implementable in artificial agents. 
taking an action by the agent upon reaching a predetermined level of confidence.  (Yang, "During operation in the field and while in recognition mode, if the autonomous robot recognizes an explosive device that exceeds a predetermined confidence threshold [a predetermined level of confidence] (e.g., probability greater than 0.50), then the autonomous robot is actuated to grasp the object (e.g., with a robotic arm) [the agent takes an action] and then navigate to an explosive containment vessel, where the autonomous robot then releases and disposes of the object.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman, Nicolescu and Grüneberg to incorporate the teachings of Yang by including an autonomous mobile platform. Doing so would make the robot be automatically actuated when the system recognizes a particular object.

Claims 6-9 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Grollman in view of Nicolescu in further view of Hirkoawa ("Coaching Robots: Online Behavior Learning from Human Subjective Feedback").

    PNG
    media_image2.png
    31
    89
    media_image2.png
    Greyscale
In regard to claim 6, Grollman and Nicolescu do not teach, but Hirkoawa teaches: The method of claim 1, wherein the key state features are weighted according to a predetermined weight value. (Hirkoawa, p. 42 "Xt is the observed state variable at time t… The state value V(xt)… can be expressed as: Wk and Vk are weight variables")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman and Nicolescu to incorporate the teachings of Hirkoawa by including Coaching learning method and the reward function. Doing so would make the agent learning the desired behavior by receiving simple and subjective instructions. (Hirkoawa, p.37 "This chapter describes a novel methodology for behavior learning of an agent, called Coaching. The proposed method is an interactive and iterative learning method which allows a human trainer to give a subjective evaluation to the robotic agent in real time, and the agent can update the reward function dynamically based on this evaluation simultaneously. We demonstrated that the agent is capable of learning the desired behavior by receiving simple and subjective instructions such as positive and negative.")


    PNG
    media_image3.png
    23
    80
    media_image3.png
    Greyscale
In regard to claim 7, Grollman, Nicolescu and Hirkoawa teach: The method of claim 1, further comprising converting state features into a distance function to determine how far the agent is from the goal configuration.  (Hirkoawa, p. 45 "… pendulum angle θ [where the agent is / state features] … the target posture range of π ± ε[rad], where ε is a small value"; π [the goal configuration];  is a distance function to determine how far the agent is from the goal configuration)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman and Nicolescu to incorporate the teachings of Hirkoawa by including Coaching learning method and the reward function. Doing so would make the agent learning the desired behavior by receiving simple and subjective instructions.
In regard to claim 8, Grollman, Nicolescu and Hirkoawa teach: The method of claim 1, further comprising using goal detection as a final reward. (Hirkoawa, p. 45 "3. Reward is given solely at a part of the goal 
    PNG
    media_image4.png
    53
    245
    media_image4.png
    Greyscale
state (r3)"; r3 is the final reward)  
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman and Nicolescu to incorporate the teachings of Hirkoawa by including Coaching learning method and the reward function. Doing so would make the agent learning the desired behavior by receiving simple and subjective instructions.


    PNG
    media_image5.png
    36
    217
    media_image5.png
    Greyscale
In regard to claim 9, Grollman, Nicolescu and Hirkoawa teach: The method of claim 1, further comprising using a goal distance as an intermediate reward.  (Hirkoawa, p. 45 "Three typical examples of reward functions are considered from the view point of different opportunities for obtaining the reward while learning. 1. Reward is given all over the state space (r1); r1 


 is the intermediate reward, which is based on the distance cos(θ - π).) 
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman and Nicolescu to incorporate the teachings of Hirkoawa by including Coaching learning method and the reward function. Doing so would make the agent learning the desired behavior by receiving simple and subjective instructions.

In regard to claim 18, Grollman, Nicolescu and Hirkoawa teach: The non-transitory computer-readable storage medium of claim 17, wherein the program instructs one or more processors to perform the following steps: (Nicolescu: p. 245 "We implemented and tested our concepts on a Pioneer 2-DX mobile robot, equipped with two rings of sonars (8 front and 8 rear), a SICK laser range-finder, a pan-tilt-zoom color camera, a gripper, and on-board computation on a PC104 stack. We performed the experiments in a 5.4m x 6.6m arena. The robot was programmed using AYLLU [19], an extension of C for development of distributed control systems for mobile robots."; Nicolescu teaches mobile robot, where the memory and processors are inherent.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman to incorporate the teachings of Nicolescu by including instructive demonstrations. Doing so would make a Pioneer 2DX mobile robot learning tasks from multiple demonstrations and teacher feedback.

    PNG
    media_image3.png
    23
    80
    media_image3.png
    Greyscale
converting state features into a distance function to determine how far the agent is from the goal configuration; (Hirkoawa, p. 45 "… pendulum angle θ [where the agent is / state features] … the target posture range of π ± ε[rad], where ε is a small value"; π [the goal configuration]; 
is a distance function to determine how far the agent is from the goal configuration)
using the distance function as an intermediate reward for the agent; and (Hirkoawa, p. 45 "Three typical examples of reward functions are considered from the view point of different 
    PNG
    media_image5.png
    36
    217
    media_image5.png
    Greyscale
opportunities for obtaining the reward while learning. 1. Reward is given all over the state space (r1); r1 is the intermediate reward, which is based on the distance cos(θ - π).)

    PNG
    media_image4.png
    53
    245
    media_image4.png
    Greyscale
using goal detection as a final reward. (Hirkoawa, p. 45 "3. Reward is given solely at a part of the goal state (r3)"; r3 is the final reward)

.

Claims 11, 13, 14 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Grollman in view of Hirkoawa.
In regard to claim 11, Grollman teaches: A method of learning to achieve a goal configuration of an artificial agent, comprising: defining the goal configuration for the agent;  (Grollman: p. 331, 1 Motivation "The standard Robot Learning from Demonstration (RLfD) scenario has an end-user who wants to adapt a robot to perform a new task, perform an old task in a new way, or operate in a new environment."; p. 331, 1 Motivation "The standard Robot Learning from Demonstration (RLfD) scenario has an end-user who wants to adapt a robot to perform a new task, perform an old task in a new way, or operate in a new environment [e.g. a goal configuration].")
providing positive examples… to the agent to demonstrate the agent has achieving the goal configuration; (Grollman: p. 332 "Almost all current RLfD approaches start with a successful [positive examples] (perhaps suboptimal) human demonstration of the task [a goal configuration]. For relatively simple tasks, such as pick-and-place [17], point to point motion [12], or washing a surface [8]..."; p. 333 "When the training data is from successful demonstrations, this approach can reproduce the desired task.", 2.2 Learning from Success "When initialized with successful demonstrations, the learned GMMcan deterministically generate fθ (ξ ) by taking the expectation of the conditional distribution in Eq. (2)...")
providing negative examples… to the agent to demonstrate the agent failing to achieve the goal configuration; (Grollman: p. 332 "In this article we develop and examine RLfD approaches in an failed demonstrations [negative examples] have educational worth in three respects: Firstly, they are examples of what not to do, so replication should be avoided... From this point of view we attempt to perform Learning from Failure (LfF)."; p. 333, 3 Learning from Failure "Consider the failed demonstrations in Fig. 2. Rather than only producing the mean trajectory (solid line), we wish to also generate exploratory trajectories (dotted lines).") 
Grollman does not teach, but Hirkoawa teaches:
    PNG
    media_image3.png
    23
    80
    media_image3.png
    Greyscale
 converting state features into a distance function to determine how far the agent is from the goal configuration; (Hirkoawa, p. 45 "… pendulum angle θ [where the agent is / state features] … the target posture range of π ± ε[rad], where ε is a small value"; π [the goal configuration]; 
 is a distance function to determine how far the agent is from the goal configuration. )

    PNG
    media_image5.png
    36
    217
    media_image5.png
    Greyscale
using the distance function as an intermediate reward for the agent; and (Hirkoawa, p. 45 "Three typical examples of reward functions are considered from the view point of different opportunities for obtaining the reward while learning. 1. Reward is given all over the state space (r1); 
r1 is the intermediate reward, which is based on the distance cos(θ - π).)

    PNG
    media_image4.png
    53
    245
    media_image4.png
    Greyscale
using goal detection as a final reward. (Hirkoawa, p. 45 "3. Reward is given solely at a part of the goal state (r3)"; 
r3 is the final reward)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman to incorporate the teachings of Hirkoawa by including 

In regard to claim 13, Grollman and Hirkoawa teach: The method of claim 11, further comprising self-practice by the agent. (Hirkoawa, p. 38 "the agent explores the state space [self-practice by the agent] randomly until it discovers the first reward, and then it starts learning based on that reward random number aiming at exploring the state space")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman to incorporate the teachings of Hirkoawa by including Coaching learning method and the reward function. Doing so would make the agent learning the desired behavior by receiving simple and subjective instructions.

In regard to claim 14, Grollman and Hirkoawa teach: The method of claim 13, wherein a selected goal configuration for self- practice is selected based on at least one of a random determination, which goal configuration needs the most improvement, which goal configuration is most likely to improve, which goal configuration has been used least recently, and which goal configuration is most used. (Hirkoawa, p. 38 "the agent explores the state space randomly until it discovers the first reward")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman to incorporate the teachings of Hirkoawa by including Coaching learning method and the reward function. Doing so would make the agent learning the desired behavior by receiving simple and subjective instructions.


    PNG
    media_image6.png
    116
    296
    media_image6.png
    Greyscale

    PNG
    media_image7.png
    128
    243
    media_image7.png
    Greyscale
In regard to claim 16, Grollman and Hirkoawa teach: The method of claim 11, further comprising updating a policy for achieving a goal configuration based on performance of the agent.  (Hirkoawa, p. 42 "The state value V(xt) and the action output u(xt) can be expressed as: The agent updates its own state value and action output [updating a policy (states → actions)] by repeating the following steps.
"; Policies map states (or observations) to actions. States and actions [policies] are updated based on the transition from V(xt-1) to V(xt), i.e. V(xt) - V(Xt-1) [performance of the agent] indicates if the transition is helpful or not helpful for achieving the goal.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman to incorporate the teachings of Hirkoawa by including Coaching learning method and the reward function. Doing so would make the agent learning the desired behavior by receiving simple and subjective instructions.

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Grollman in view of Hirkoawa in further view of Takahiro (US 20150290798 A1).
In regard to claim 12, Grollman and Hirkoawa do not teach, but Takahiro teaches: The method of claim 11, further comprising recognizing whether the agent is in a initialization state. (Takahiro, [0270] "When the processing of moving the robot 50 starts, the axis position state determination unit 28 determines whether or not the position of one or more predetermined axes is the position of the state satisfying the predetermined positional relationship condition (step SA1). Then, the axis position state determination unit 28 is used to determine whether to perform the control of the position and/or the posture of the tip 58 of the robot 50 on the orthogonal coordinate system at each control cycle or to perform the control of the position of desired each axis at each control cycle.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman and Hirkoawa to incorporate the teachings of Takahiro by including the axis position state determination unit. Doing so would allow the control mode to be set in the first or second mode. (Takahiro, [0271] "When the axis position state determination unit 28 determines that the position of the one or more predetermined axes is not the position of the state satisfying the predetermined positional relationship condition, the control mode is set to the first control mode (step SA2)... the control mode is set to the second control mode (step SA3).")

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Grollman in view of Hirkoawa in further view of Kober ("Reinforcement learning in robotics: A survey").
In regard to claim 15, Grollman and Hirkoawa do not teach, but Kober teaches: The method of claim 13, further comprising biasing action choices based on which actions are important to achieving the goal configuration. (Kober, p. 1242, "This setting has the problem that it cannot distinguish between policies that initially gain a transient of large rewards and those that do not. This transient phase, also called prefix, is dominated by the rewards obtained in the long run. If a policy accomplishes both an optimal prefix as well as an optimal long-term behavior [important actions], it is called bias optimal [e.g. biasing action choices] (Lewis and Puterman, 2001). An example in robotics would be the transient phase during the start of a rhythmic movement, where many policies will accomplish the same long-term reward but differ substantially in the transient (e.g. there are many ways of starting the same gait in dynamic legged locomotion) allowing for room for improvement in practical application."; p. 1258 what actions to perform in states that are encountered. Such data may be helpful when used to bias policy action selection [e.g. biasing action choices].")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Grollman and Hirkoawa to incorporate the teachings of Kober by including bias policy action selection. Doing so would make the agent know what actions to perform in states that are encountered. (Kober, "Perhaps the most obvious benefits that it provides supervised training data of what actions to perform in states that are encountered")

Response to Arguments
Applicant's amendments with respect to objections of claims 11 and 18 have been fully considered and are sufficient to overcome the objections. The objections to claims 11 and 18 have been withdrawn.

Applicant's amendments with respect to rejection of claims under 35 U.S.C. 112(b) have been fully considered and are sufficient to overcome the rejection. The rejection to the claims under 35 U.S.C. 112(b) has been withdrawn.

Applicant's arguments with respect to the rejection of the claims under 35 U.S.C. 102 have been fully considered but they are moot:
(a) Applicant argues: (see p. 8 top, claim 11, and p. 11 top claim 1): “… Independent claim 11 is amended to clarify that the step of providing positive examples via an interface to the agent is to actual goal configuration.” 
(b) Examiner answers: The arguments do not apply to the references (Grollman) being used in the current rejection.
(a) Applicant argues: (see p. 11 middle claim 1): “… claims 1 and 17 are amended to describe the step of extracting key state features to determine what feature categories are important during receipt of positive examples to the agent. The feature categories include, as described in the specification, objects, pose, location, and the like.” 
(b) Examiner answers: The arguments do not apply to the references (Nicolescu) being used in the current rejection.

Applicant's arguments with respect to the rejection of the claims under 35 U.S.C. 102 have been fully considered but they are not persuasive:
(a) Applicant argues: (see p. 9 middle, claim 11, p. 12 middle, claim 1): “…There is no teaching that the agent converts the state function that it is provided into a distance function to determine how far the agent is from the goal configuration. Instead, in Hirkoawa, the agent is provided the goal configuration and uses a cosine function to determine a reward between 0 and 1. In the rejection, the Examiner equates cos (θ - π) as a distance. However, this function provides a reward value between 0 and 1, where, when the observed angle of the pendulum is closer to the target, the value is closer to 1. At best, this value would be inverse to the distance that the distal end of the pendulum is away from its target.” 
(b) Examiner answers: Hirkoawa teaches top is the target angle. Assume top angle is ±180° and bottom (0°) is the initial angle of the pendulum, and r1 = (1 + cos (θ - π)) / 2. For example:

 (2) If r1 = 0.5, then cos (θ - π) = 0, θ – π = 90° means θ and π are orthogonal, i.e. cos (θ - π) indicates the target and the pendulum are in the orthogonal positions.
(3) If r1 = 1, then cos (θ - π) = 1, θ – π = 0° means θ and π are exactly the same, i.e. cos (θ - π) indicates that distance of the target and the pendulum are zero.
That is, the value of cos (θ - π) is proportional to the distance of the target and the pendulum. The larger the value of cos (θ - π) is, the larger the distance of the target and the pendulum is and vice versa. 
Under BRI, the claim requires converting state features into a function to determine how far the agent (the pendulum) is from the goal (target). θ is the state features being used in the cos (θ - π), and cos (θ - π) represents a distance function to determine how far the pendulum is from the target as explained above. Therefore, Hirkoawa teaches the claimed invention.

Conclusion
The art made of record and not relied upon is considered pertinent to applicant's disclosure.  
Veˇcerík ("Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards") teaches the second reward function including the distance from the goal on page 4-5.
Katyal ("Leveraging Deep Reinforcement Learning for Reaching Robotic Tasks") teaches the reward function is based on the Euclidean distance between the end effector position and the target object.

THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SU-TING CHUANG whose telephone number is (408)918-7519.  The examiner can normally be reached on Monday - Thursday 8-5 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571)272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-
/S.C./Examiner, Art Unit 2122                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126