DETAILED ACTION
Claims 1-20 are pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 07/30/2018, 08/20/2021, 11/03/2021 and 03/08/2022 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Claim Objections
Claims 1 and 9 are objected to because of the following informalities: 
Claims 1 and 9, the limitation “learning, by a processor device…” should be “learning, by the processor device…”
Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1, 5-9 and 13-17 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
In regard to claims 1, 9 and 17,
Step 1:  Is the claim to a process, machine, manufacture, or composition of matter? 
 Yes, claims 1 and 9 recite a method and a computer product with a non-transitory computer-readable storage medium having instructions, and therefore recite a process and article of manufacture respectively, which is a statutory category of invention; claim 17 recites a system comprising a memory and a processor device, and therefore is a machine, which is a statutory category of invention. 

Step 2A, prong One: Does the claim recite an abstract idea, law of nature or natural phenomenon? 
Yes, claims 1, 9 and 17 recites “learning… a sequence of constraints corresponding to the sequence of tasks by repeating, for each of the tasks in the sequence, reinforcement learning and supervised learning with a set of good samples and a set of bad samples and by applying an obtained constraint for a current task to a next task.” 
Under broadest reasonable interpretation the limitation of (1) learning a sequence of constraints corresponding to tasks (an evaluation or a judgement) and (2) by repeating reinforcement learning and supervised learning with good and bad samples and by applying a constraint (a decision, a judgement) is a mental process.

	If a claim limitation, under its broadest reasonable interpretation, covers mathematics or performance in the human mind, then it falls within the mental processes of abstract ideas. Accordingly, the claims 1, 9 and 17 recite an abstract idea.

Step 2A, prong Two: Does the claim recite additional elements that integrate the judicial exception into a practical application? 
No, the judicial exception is not integrated into a practical application. In particular, claims 1, 9 and 17 recite “obtaining… a sequence of tasks based on hierarchical relations between the tasks, the tasks constituting the target task” further, claim 1 recites “a processor device,” claim 9 recites “a computer program product, a non-transitory computer readable storage medium, program instructions, a computer, a processor device,”; claim 19 recites “a computer processing system, a memory, program code, a processor,” which are all generally linked to the abstract idea.

Claims 1, 9 and 17: The limitations of “obtaining… a sequence of tasks based on hierarchical relations between the tasks, the tasks constituting the target task” in claims 1, 9 and 17 as a whole, represent data gathering. A step of gathering data for use in a claimed process is a pre-solution activity, therefore both of the receiving steps are insignificant extra-solution activities – see MPEP 2016.05(g).

The limitations of “… by repeating… reinforcement learning and supervised learning” describes the use of reinforcement learning and supervised learning in claims 1, 9 and 17. The use of software (i.e. “applying it” with the judicial exception) or reinforcement learning and supervised learning (i.e. specifying a particular technological environment or field) is not eligible – see MPEP 2106.05(f) or 2106.05(h).

The use of “a processor device, a computer program product, a non-transitory computer readable storage medium, program instructions, a computer, a computer processing system, a memory, program code, a processor” amounts to an attempt to generally link the use of a judicial exception to a computer technological environment or field of use - see MPEP 2106.05(h), or it can be viewed as mere instructions to implement an abstract idea on a computer – see MPEP 2106.05(f). 

Accordingly, these additional elements do not provide a meaningful limitation to transform the abstract idea into a patent eligible application of the abstract idea. The claims 1, 9 and 17 as a whole, considering all additional elements both individually and in combination, are directed to an abstract idea.

Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? 
No, the claims 1, 9 and 17 do not recite additional elements that amount to an inventive concept (significantly more) than the recited judicial exception. 
Claim 1:  These additional elements, as explained above: “a processor device, a computer program product, a non-transitory computer readable storage medium, program instructions, a computer, a computer processing system, a memory, program code, a processor” are generally linking the use of a judicial exception to a computer environment; the obtaining step is an insignificant extra-solution activity; and the by repeating reinforcement learning and supervised learning step is using a model to make a prediction (i.e. “applying it” or specifying a particular field). 

Further, the following limitations are well-understood, routine and conventional (WURC):
“obtaining… a sequence of tasks based on hierarchical relations between the tasks, the tasks constituting the target task;” is receiving or transmitting data over a network – see MPEP 2106.05(d).  

Accordingly, considering the claim as a whole and the additional elements both individually and in combination, do not provide significantly more than the abstract idea. These independent claims are not patent eligible.

Dependent claims 5 and 13 recite “applying a respective reward as an input to a non-antagonistic reinforcement learning neural network and an antagonist reinforcement learning neural network, and applying an output of each of the non-antagonistic reinforcement learning and antagonistic reinforcement learning neural networks to a supervised learning neural network to obtain a respective one of the constraints of the sequence.” In step 2A prong One, the claim recites more specifics of abstract idea (i.e. applying an input and an output), and therefore is a mental process (a judgement) in one of the groups of abstract ideas. In step 2A prong Two and step 2B, the additional elements, a non-antagonistic reinforcement learning neural network and an antagonist reinforcement learning neural network, are use of a particular technological environment or field - 2106.05(h).

Dependent claims 6 and 14 recite “the constraints are used to prioritize the tasks by imposing a particular execution order on the tasks.” In step 2A prong One, the claim recites more specifics of abstract idea (i.e. the use of constraints is to prioritize tasks), and therefore is a mental process (a judgement) in one of the groups of abstract ideas. Further in step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Dependent claims 7 and 15 recite “disabling one or more of the constraints based on a current state of a value function, the value function indicating a maximum expected future reward an agent will get at a given state.” In step 2A prong One, the limitation of disabling constraints based on a value is an evaluation or a judgement, therefore is a mental process; and a maximum expected future reward is interpreted as more specifics of the abstract idea. In step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Dependent claims 8 and 16 recite “each of the tasks in the sequence corresponds to a respective hierarchy level.” In step 2A prong One, the claim recites more specifics of abstract idea (i.e. a task corresponding to a level), and therefore is a mental process (a judgement) in one of the groups of abstract ideas. Further in step 2A prong Two and step 2B, the claim does not recite additional elements that amount to integrate the exception into a practical application or provide significantly more than judicial exception.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 2, 6, 8, 9, 10, 14 and 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Le ("Hierarchical Imitation and Reinforcement Learning") in view of Dietterich ("Hierarchical reinforcement learning with the MAXQ value function decomposition").


    PNG
    media_image1.png
    359
    504
    media_image1.png
    Greyscale
In regard to claims 1, 9 and 17, Le teaches: A computer-implemented method for Hierarchical Reinforcement Learning (HRL) with a target task, comprising: (Le, p. 2 right col., learning about HI level policy and LO level policy [Hierarchical Reinforcement Learning] along with the goal G [a target task] and subgoals g.) 


obtaining, by a processor device, a sequence of tasks based on hierarchical relations between the tasks, the tasks constituting the target task; and (Le, p. 2 right col. "Subtasks, which we also call subgoals, are denoted as g ϵ G, and the primitive actions are denoted as a ϵ A agent (also referred to as learner) acts by iteratively choosing a subgoal g, carrying it out by executing a sequence of actions a until completion, and then picking a new sub-goal... We assume that the horizon at the HI level is H_HI, i.e., a trajectory uses at most H_HI subgoals, and the horizon at the LO level is H_LO, i.e., after at most H_LO primitive actions, the agent either accomplishes the subgoal..."; a sequence of subgoals in HI and LO levels are a sequence of tasks based on hierarchical relations, and G is the target task.; p. 8 "Figure 3. Montezuma’s revenge: hg-DAgger/Q versus h-DQN. (Left) Screenshot of Montezuma’s Revenge in black-and-white with color-coded subgoals"; p. 1 right col. "1Code and experimental setups are available at https://sites.google.com/view/hierarchical-il-rl"; code and experimental setups inherently teach the implementation is executed by a processor device.)
learning, by a processor device... the sequence of tasks by repeating, for each of the tasks in the sequence, reinforcement learning and supervised learning with a set of good samples and a set of bad samples… (Le, p. 3 right col., section 4. Hierarchically Guided Imitation Learning "... We instantiate this framework first within passive learning from demonstrations, obtaining hierarchical behavioral cloning (Algorithm 1), and then within interactive imitation learning, obtaining hierarchically guided DAgger (Algorithm 2), our best-performing algorithm... Train (lines 8–9) to find the subpolicies pi_g that best predict a*_l from s*_l, and meta-controller u that best predicts g*_h from s*_h, respectively. Train can generally be any supervised learning subroutine [supervised learning], such as stochastic optimization for neural networks or some batch training procedure..."; p. 2 right col. "We assume access to an expert, endowed with a meta-controller u*, subpolicies pi*_g, and termination functions beta*_g, who can provide one or several types of supervision"; expert supervision builing on top of reinforcement learning; p. 3 left col. "HierDemo(s): hierarchical demonstration... Inspect_LO (τ; g): LO-level inspection. Instead of annotating every state of a trajectory with a good action, the expert only verifies whether a subgoal g was accomplished, returning either Pass or Fail [good and bad samples]... Inspect_FULL(FULL): full inspection. The expert verifies whether the agent’s overall goal was accomplished, returning either 
    PNG
    media_image2.png
    193
    384
    media_image2.png
    Greyscale
Pass or Fail.) 


Le does not teach, but Dietterich teaches: learning... a sequence of constraints corresponding to the sequence of tasks by repeating, for each of the tasks in the sequence, reinforcement learning... by applying an obtained constraint for a current task to a next task. (Dietterich, p. 247 "In the MAXQ method, the constraints take two forms. First, within a subtask, only some of the possible primitive actions may be permitted. For example, in the taxi task, during a Navigate(t), only the North, South, East, and West actions are available - the Pickup and Putdown actions are not allowed. Second, consider a Max node Mj with child nodes {M...M}. The policy learned for Mj must involve executing the learned policies of these child nodes. When the policy for child node Mji is executed, it will run until it enters a state in Tji. Hence, any policy learned for Mj must pass through some subset of these terminal state sets {T...T}."; p. 247 "The HAM method shares these same two constraints and in addition, it imposes a partial policy on each node, so that the policy for any subtask Mi must be a deterministic refinement of the given non-deterministic initial policy for node i."; Constraints are passed though from Max node to child nodes, i.e. constraints are applied from a current task to a next task.)

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Le to incorporate the teachings of Dietterich by including constraints on the policy. Doing so would incorporate prior knowledge and thereby reduce the size of the space that must be searched to find a good policy. (Dietterich, p. 247 "The purpose of imposing these constraints on the policy is to incorporate prior knowledge and thereby reduce the size of the space that must be searched to find a good policy.")

Claims 9 and 17 recites substantially the same limitation as claim 1, therefore the rejection applied to claim 1 also apply to claims 9 and 17. Further, Le teaches: (claim 9) the computer program product comprising a non-transitory computer readable storage medium having program instructions (claim 17) a memory for storing program code; and a processor device operatively coupled to the memory for running the program code (Le, p. 1 right col. "1Code and experimental setups are available at https://sites.google.com/view/hierarchical-il-rl"; code and experimental setups inherently teach the implementation with a processor, a memory, program product etc.)

In regard to claims 2, 10 and 18, reference is made to the rejection of claim 1, and further Le teaches: wherein each performance of the repeating step comprises training a neural network to predict to predict a constraint corresponding to the current task with the set of good samples and the set of bad samples. (Le, p. 3 right col., section 4. Hierarchically Guided Imitation Learning "... Train can generally be any supervised learning subroutine, such as stochastic optimization for neural networks or some batch training procedure..."; p. 3 left col. "... Inspect_LO (τ; g): LO-level inspection. Instead of annotating every state of a trajectory with a good action, the expert only verifies whether a subgoal g was accomplished, returning either Pass or Fail [good and bad samples]... Inspect_FULL(FULL): full inspection. The expert verifies whether the agent’s overall goal was accomplished, returning either Pass or Fail..."; Dietterich teaches constraints, see claim 1.) 
The rationale for combining the teachings of Le and Dietterich is the same as set forth in the rejection of claim 1.

In regard to claims 6 and 14, reference is made to the rejection of claim 1, and further Le does not teach, but Dietterich teaches: wherein the constraints are used to prioritize the tasks by imposing a particular execution order on the tasks. (Dietterich, p. 233 "An action order, denoted w, is a total order over the actions within an MDP. That is, w is an anti-symmetric, transitive relation such that w (a1, a2) is true iff a1 is strictly preferred to a2..."; p. 252 "Definition 9 An ordered GLIE policy is a GLIE policy (Greedy in the Limit with Infinite Exploration) that converges in the limit to an ordered greedy policy, which is a greedy policy that imposes an arbitrary fixed order w on the available actions and breaks ties in favor of the action a that appears earliest in that order.")

The rationale for combining the teachings of Le and Dietterich is the same as set forth in the rejection of claim 1.

In regard to claims 8 and 16, reference is made to the rejection of claim 1, and further Le teaches: wherein each of the tasks in the sequence corresponds to a respective hierarchy level. (Le, p. 2 right col. "We assume that the horizon at the HI level is H_HI, i.e., a trajectory uses at most H_HI subgoals, and the horizon at the LO level is H_LO, i.e., after at most H_LO primitive actions, the agent either accomplishes the subgoal or needs to decide on a new subgoal. The total number of primitive actions in a trajectory is thus at most H_FULL := H_HIH_LO")

Claims 3, 5, 11, 13 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Le in view of Dietterich in further view of Yang ("Physical Human-Robot Adversarial Gameplay").

In regard to claims 3, 11 and 19, reference is made to the rejection of claim 2, and further Le and Dietterich do not teach, but Yang teaches: wherein the each performance of the repeating step further comprises training a protagonist policy and an antagonist policy by reinforcement learning by restricting predicted actions according to the trained neural network. (Yang, p. 1 right col. "we propose to use the Robust Adversarial Reinforcement Learning (RARL) method... The two agents are trained in an alternating manner, where the robot (protagonist) faces off against the human (antagonist). However, in simulation the human is modeled as another robot so that there is a simple method for controlling the antagonist. The protagonist will first be trained by collecting trajectories that result from playing against an adversary with a static policy... The adversary will then be trained against the protagonist with a static policy in order to find a policy that the protagonist’s policy is not robust to...")

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Le and Dietterich to incorporate the teachings of Yang by including robust adversarial reinforcement learning (RARL). Doing so would reinforce the jointly trained adversary – that is, it learns an optimal destabilization policy. (Pinto, "This paper proposes the idea of robust adversarial reinforcement learning (RARL), where we train an agent to operate in the presence of a destabilizing adversary that applies disturbance forces to the system. The jointly trained adversary is reinforced – that is, it learns an optimal destabilization policy.")

In regard to claims 5 and 13, reference is made to the rejection of claim 1, and further Le and Dietterich do not teach, but Yang teaches: wherein said learning step comprises applying a respective reward as an input to a non-antagonistic reinforcement learning neural network and an antagonist reinforcement learning neural network, and applying an output of each of the non-antagonistic reinforcement learning and antagonistic reinforcement learning neural networks to a supervised learning neural network to obtain a respective one of the constraints of the sequence. (Yang, p. 1 left col. "A physical human-robot game can be described by a system state St and game score rt. For time t, St= (s1t…skt) is a set of k continuous values, and rt = (r1t, r2t) us a tuple of continuous values... r1t and r2t are the robot and human’s instant reward at time t respectively... The robot will try to maximize long term reward R from start time to final time ... while the human will try to minimize it... The protagonist will first be trained by collecting trajectories that result from playing against an adversary with a static policy... The adversary will then be trained against the protagonist"; reward r1t and r2t is the respective input, and the trajectories from protagonist and antagonist are the respective output.; Le and Dietterich teach reinforcement learning with neural network, supervised learning, constraints etc., see claim 1.)

The rationale for combining the teachings of Le, Dietterich and Yang is the same as set forth in the rejection of claim 3.

Claims 4, 12 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Le in view of Dietterich in further view of Todorov ("From task parameters to motor synergies: A hierarchical framework for approximately optimal control of redundant manipulators").

In regard to claims 4, 12 and 20, reference is made to the rejection of claim 2, and further Le and Dietterich do not teach, but Todorov teaches: wherein the constraint is an inequality constraint. (Todorov, p.692 lift col. "Here, we propose a hierarchical control scheme inspired by this general organization of the sensorimotor system, as well as by prior work on hierarchical control in robotics... Thus the low-level controller does not solve a specific subtask as usually assumed in hierarchical reinforcement learning but instead performs an instantaneous feedback transformation."; p. 695, 3. HIERARCHICAL CONTROL FRAMEWORK "Key to our framework is the low-level controller u (v, x) - whose design we address first, assuming that the high-level parameters y(x) and their desired dynamics have been given... The control u is thus defined at each time t as the solution to the following constrained optimization problem:... In addition to the above equality constraint, we can incorporate inequality constraints on u.")

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Le and Dietterich to incorporate the teachings of Todorov by including an instantaneous feedback transformation with inequality constraints. Doing so would provide a more appropriate for control of complex redundant manipulators and a more plausible model of biological sensorimotor control in HRL. (Todorov, p. 700 "In contrast our low-level controller performs an instantaneous feedback transformation of plant dynamics, and is continuously driven by high-level commands. We believe this is more appropriate for control of complex redundant manipulators and is a more plausible model of biological sensorimotor control while hierarchical reinforcement learning is more appropriate for non-articulated 'agents' solving navigation problems.")

Claims 7 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Le in view of Dietterich in further view of Hengst ("Safe state abstraction and discounting in hierarchical reinforcement learning").

In regard to claims 7 and 15, reference is made to the rejection of claim 1, and further Le and Dietterich do not teach, but Hengst teaches: further comprising disabling one or more of the constraints based on a current state of a value function, the value function indicating a maximum expected future reward an agent will get at a given state. (Hengst, p. 59 "State abstraction refers to the aggregation of base level states to capture some invariant structure of the problem. For example the same position in each room of the original problem may be aggregated into one abstract position-in room state. This type of state abstraction is related to eliminating irrelevant variables [4] and model minimization [5]. The room identity is irrelevant to the navigation policies inside rooms... Reinforcement learning uses an optimality criterion such as maximising the sum of future rewards from any state. In HRL this state value function may be decomposed over the task hierarchy"; p.62 "… This is the key property that makes it possible to safely abstract subtasks... We define the completion value function, E, as the expected value of future rewards after termination of the subtask associated with abstract action a and define the discount function, D, as the expected discount to be applied to the completion value. That is:.. Equation 6 can be succinctly written as V [value function]"; Safe state abstraction in reinforcement learning allows an agent to ignore aspects of its current state that are irrelevant to its current decision e.g. irrelevant variables, policies, or constraints, therefore safe state abstraction teaches disabling constraints based on V value function, which includes D and E indicating maximum expected value of future rewards.)

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Le and Dietterich to incorporate the teachings of Hengst by including state abstraction. Doing so would eliminate irrelevant variables, model minimization and provide method for scaling in HRL. (Hengst, p. 59 "This type of state abstraction is related to eliminating irrelevant variables [4] and model minimization [5]... State abstraction has been shown to be important for scaling in HRL [6,7,4,8].")

Conclusion
The art made of record and not relied upon is considered pertinent to applicant's disclosure.
Pinto(“Robust adversarial reinforcement learning”) teaches Robust Adversarial Reinforcement Learning (RARL).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SU-TING CHUANG whose telephone number is (408)918-7519.  The examiner can normally be reached on Monday - Thursday 8-5 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571)272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/S.C./Examiner, Art Unit 2122                 


/ERIC NILSSON/Primary Examiner, Art Unit 2122