DETAILED ACTION
This Office Action is in response to Applicant's Communication received on 06/01/2022 for application number 15/960,809.  
Claims 1-20 are presented for examination.  Claims 1, 10, and 19 are independent claims.   

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The Amendment filed on 06/01/2022 has been entered.  
Claims 1, 10, and 19 are amended.  Claims 1-20 are pending in the application.  


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



Claims 1-2, 4, 6-7, 9-11, 13, 15-16, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Bakker, “Hierarchical Reinforcement Learning Based on Subgoal Discovery and Subpolicy Specialization” (2004) in view of Ponulak et al. (US 9,008,840 B1 hereinafter Ponulak) and Bollapragada et al. (US 2011/0125539 A1 hereinafter Bollapragada).

Regarding Claim 1, Bakker discloses a system for executing composite tasks based on computational learning techniques comprising:  a processor (Bakker, page 2, introduction, lines 1 to 19 - autonomous systems (agents) learn behavior based on reinforcement learning on the basis of reward signals provided when the agent reaches desired goals; hierarchical reinforcement learning to solve complex tasks) to:  
detect a plurality of subtasks corresponding to the composite task based on unsupervised data without a label, wherein the plurality of subtasks are identified by a top- level dialog policy (Bakker, page 2, introduction, lines 1 to 19 - high-level policies identify subgoals that precede overall goals; higher-level policies solve the overall task, consider only few high-level observations and actions; page 5, lines 29 to 37 - producing high level observations using unsupervised learning vector quantization technique);  
detect a plurality of actions, wherein each action is to complete one of the subtasks, and wherein each action is identified by a low-level dialog policy corresponding to the subtasks identified by the top-level dialog policy (Bakker, page 2, introduction, lines 1 to 19 - low-level policies, which emit the actual, primitive actions, solve parts of the overall task; low-level policies reach subgoals set by the higher level); 
update a dialog manager based on a completion of each action corresponding to the subtasks (Bakker, page 2, introduction, lines 1 to 19 - autonomous systems (agents) (dialogue manager) learn behavior based on reinforcement learning; high-level policies identify subgoals that precede overall goals; higher-level policies solve the overall task; low-level policies solve parts of the overall task; low-level policies reach subgoals set by the higher level; page 4, lines 15 to 28 - real rewards received via interaction with the environment; at each tL, the currently active low-level policy executes one learning step , and it selects a new low-level, primitive action; at each tH, the active low-level policy receives a reward (i.e., updating completion of action corresponding to subtask); page 6, lines 34 to 36 - episode ends once the agent reaches the goal - thus, the dialog manager updated based on action corresponding to the subtasks), wherein the dialog manager comprises a global state tracker that stores an intrinsic value indicating a sub-cost to execute each action corresponding to each subtask and an extrinsic value indicating a global cost to execute a plurality of actions that perform the composite task (Bakker, page 3, lines 21 to 32 - following a high-level observation change, low-level policy is selected based on the C-values of all low-level policies; it then attempts to reach the current subgoal; the lowlevel policy receives a positive reward if it reaches the subgoal (i.e., intrinsic value indicating a sub-cost to execute each action corresponding to each subtask); page 4, lines 15 to 28 - real rewards received via interaction with the environment; at each tL, the currently active low-level policy executes one learning step , and it selects a new low-level, primitive action; at each tH, the active low-level policy receives a reward (i.e., updating completion of action corresponding to subtask); next, the high-level policy performs one learning step, and it selects a new subgoal; page 6, lines 34 to 36 - episode ends once the agent reaches the goal (where it receives the only reward r = 4) (i.e, extrinsic value indicating a global cost).  Thus, the dialog manager keeps track of the states of the subgoals and overall goals (i.e., comprising a global state tracker)).
 
However, Bakker fails to expressly teach wherein execute instructions based on a policy identified by the dialog manager, wherein the executed instructions implement the policy with a lowest global cost corresponding to the composite task provided by the user.
In the same field of endeavor, Ponulak  teaches wherein execute instructions based on a policy identified by the dialog manager, wherein the executed instructions implement the policy with a lowest global cost corresponding to the composite task (column 17, lines 28 to 32 - cost function minimization, the cost function comprise a global minimum  and one or more local minima disposed within the state-space; column 19, lines 13 to 17 -controller (i.e., dialogue manager) follow target control policy; the target control policy configured based on a minimization of a given cost function based on the input signal x; the output of such combined system comprise an optimal output given the cost function and the input (i.e., lowest global cost corresponding to the composite task); column 19, lines 53 to 56 - solve individual sub-tasks of a composite task ;column 31, lines 17 to 36 -  predictor corresponding to a composite task , predictor operation  comprise determining which lower level (within the hierarchy) predictors are to be activated, and/or plant control output is to be generated; predictor operation comprise generation of control output; no additional subtasks remain, task output is generated in accordance with the sensory input and outcomes of the sub-task predictor operations).  
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have incorporated wherein execute instructions based on a policy identified by the dialog manager, wherein the executed instructions implement the policy with a lowest global cost corresponding to the composite task, as suggested in Ponulak into Bakker.  Doing so would be desirable because it would enable a more efficient use of controller computational resources for a given task set (Ponulak, column 39, lines 1 to 6).  
 
However, Bakker and Ponulak fail to expressly teach wherein the processor to detect a composite task from a user and  the global state tracker is to ensure that a cross-subtask constraint is satisfied.
In a similar field of endeavor, Bollapragada teaches wherein detect a composite task from a user ([0011]-[0012] a multi-resource scheduler system to schedule resources involved in an exam,  inpatient and outpatient appointments; [0086] the referring provider's office calls to schedule an outpatient appointment) and the global state tracker is to ensure that a cross-subtask constraint is satisfied ([0011]-[0012] a multi-resource scheduler system to schedule resources involved in an exam,  inpatient and outpatient appointments; identify a slot for a task defined by a scheduled task duration and one or more resources; the task includes a plurality of sub-tasks, and each sub-task has a sub-task duration utilizing one or more of the one or more resources; each sub-task is to be performed consecutively based on resource constraints; the scheduler engine (global state tracker) is to identify and select a time slot for the task based on resource availability, the plurality of sub-tasks in the task, and a duration associated with each sub-task; [0074] accommodate temporal interdependencies by treating them as a constraint to help ensure resource availability at the task level upon a schedule request - thus, ensuring that a cross-subtask constraint is satisfied).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have incorporated wherein detect a composite task from a user and  the global state tracker is to ensure that a cross-subtask constraint is satisfied, as suggested in Bollapragada into Bakker and Ponulak.  Doing so would be desirable because it would provide for efficient and comprehensive use of resources (Bollapragada [0090]-[0091]).  

As to dependent Claim 2, Bakker, Ponulak, and Bollapragada teach all the limitations of claim 1.  Bakker further teaches wherein the action is a multi-step action (Bakker, page7, lines 10 to 13 - the time-out value for low-level policies is 100, i.e. a lowlevel policy may execute at most 100 low-level actions to reach its assigned subgoal - thus, the action is a multi-step action).  

As to dependent Claim 4, Bakker, Ponulak, and Bollapragada teach all the limitations of claim 1.  Bakker further teaches wherein select each action corresponding to each subtask based on the extrinsic value corresponding to previous identified actions executed in previous states (Bakker, page 3, lines 12 to 21 - each action of a higher-level policy  is the selection of a subgoal to be reached by a lower-level policy; at any time step of its discrete time scale, the highlevel policy receives the current high-level observation as its input; its action is the selection of another high-level observation as the current subgoal; it selects the high-level observation that it wants to see next;  page 2, introduction, lines 1 to 19 - high-level policies identify subgoals that precede overall goals; higher-level policies solve the overall task, consider only few high-level observations and actions; page 4, lines 25 to 30 - at each tH, the active low-level policy receives a reward; next, the high-level policy selects a new subgoal - thus, based on the extrinsic value corresponding to previous identified actions executed in previous states).

As to dependent Claim 6, Bakker, Ponulak, and Bollapragada teach all the limitations of claim 1.  Ponulak further teaches wherein determine an order of the subtasks based on temporal constraints for each of the subtasks (Ponulak, column 30, lines 63 to 67 and column 31, lines 1 to 3 - a given task may be partitioned into two (or more) sub-tasks; such as a task of training of a robotic manipulator to grasp a particular object (e.g., a cup), the subtasks may correspond to identifying the cup (among other objects), approaching the cup, avoiding other objects (e.g., glasses, bottles), grasping the cup (i.e., an order of the subtasks based on temporal constraints). A subtask predictor may comprise action indication predictor).

As to dependent Claim 7, Bakker, Ponulak, and Bollapragada teach all the limitations of claim 1.  Ponulak further teaches wherein generate a first neural network for the high level dialog and a second neural network for the low level dialog (Ponulak, column 16, lines 5 to 6 -predictor comprise a spiking neuron network;  column 31, lines 10 to 25 - the trained predictor configuration  comprise one or more of a neuron network configuration (e.g., number and/or type of neurons and/or connectivity), neuron states (excitability), connectivity (e.g., efficacy of connections), and/or other information; a predictor corresponding to a composite task, predictor operation comprise determining which lower level (within the hierarchy) predictors (i.e. neural networks for the low level and high level dialog) are to be activated; predictor corresponding to the lowest level task, predictor operation comprise generation of control output).

As to dependent Claim 9, Bakker, Ponulak, and Bollapragada teach all the limitations of claim 8.  Bakker further teaches wherein the plurality of actions comprise transmitting data to a plurality of databases corresponding to the subtasks of the composite task (Bakker, page 3, lines 17 to 32 - job of the low-level policies to reach the subgoal; every low level policy contains a table of C-values (i.e., database); following a high-level observation change,one low-level policy is selected based on the C-values of all low-level policies for the current(oHs ; oHg ) pair; this one then attempts to reach the current subgoal - thus, transmitting data to a plurality of databases corresponding to the subtasks of the composite task).  

Claims 10-11, 13, 15-16, and 18 are method claims similar to the system claims 1-2, 4, and 6-9 above and therefore, rejected for the same reasons.  

Claim 19 is a medium claim similar to the system claim 1 above and therefore, rejected for the same reasons.  Ponulak further teaches one or more computer-readable storage media comprising a plurality of instructions that, in response to execution by a processor (column 27, lines 48 to 52 – processing device executing operations of method using instructions stored electronically on an electronic storage medium).

Claims 3, 12, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Bakker in view of Ponulak and Bollapragada, further in view of Sermanet (US 2019/ 0332920 A1).

As to dependent Claim 3, Bakker, Ponulak, and Bollapragada teach all the limitations of claim 2.  However, Bakker, Ponulak, and Bollapragada fail to expressly teach wherein detect a number of the plurality of subtasks based on a predetermined upper limit on a maximum number of allowed segmentations.
In the same field of endeavor, Sermanet teaches wherein detect a number of the plurality of subtasks based on a predetermined upper limit on a maximum number of allowed segmentations ([0032] partitioning of the reinforcement task into multiple subtasks; [0034] the total number of partitions is fixed to a predetermined number).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have incorporated wherein detect a number of the plurality of subtasks based on a predetermined upper limit on a maximum number of allowed segmentations, as suggested in Sermanet into Bakker, Ponulak, and Bollapragada.  Doing so would be desirable because it would allow the agent to make maximal use of information contained in a demonstration during training without requiring costly and often infeasible labelling of the demonstration data (Sermanet [0006]).  

Claim 12 is a method claim similar to the system claim 3 above and therefore, rejected for the same reasons.  

Claim 20 is a medium claim similar to the system claim 3 above and therefore, rejected for the same reasons.  

Claims 5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Bakker in view of Ponulak and Bollapragada, further in view of Scahill (US 2010/0115048 A1).

As to dependent Claim 5, Bakker, Ponulak, and Bollapragada teach all the limitations of claim 1.  However, Bakker, Ponulak, and Bollapragada fail to expressly teach wherein calculate a probability that each of the subtasks is to output a termination symbol; and terminate at least one of the subtasks in response to detecting the probability of outputting the termination symbol is above a threshold value.
In the same field of endeavor, Scahill teaches wherein calculate a probability that each of the subtasks is to output a termination symbol; and terminate at least one of the subtasks in response to detecting the probability of outputting the termination symbol is above a threshold value ([0178] The Application Server 24 assesses the probability that the client application 12a will successfully execute  and if this exceeds a predetermined threshold, removes the current task listing from the ST database 28 (i.e., terminating subtask when the probability that each of the subtasks is to output a termination symbol is above threshold)).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have incorporated wherein calculate a probability that each of the subtasks is to output a termination symbol; and terminate at least one of the subtasks in response to detecting the probability of outputting the termination symbol is above a threshold value, as suggested in Scahill into Bakker, Ponulak, and Bollapragada.  Doing so would be desirable because it would minimize computer processor unit (CPU) load thereby reducing the impact on any foreground applications that the user is interacting with (Scahill [0179]).  

Claim 14 is a method claim similar to the system claim 5 above and therefore, rejected for the same reasons.  

Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Bakker in view of Ponulak and Bollapragada, further in view of Commons (US 9,015,093 B1).

As to dependent Claim 8, Bakker, Ponulak, and Bollapragada teach all the limitations of claim 1.  However, Bakker, Ponulak, and Bollapragada fail to expressly teach wherein detect the composite task from a natural language dialog request.
In the same field of endeavor, Commons teaches wherein detect the composite task from a natural language dialog request (column 41, lines 3 to 6 - Step 610 involves receiving a question about which a report is to be written; the user could speak the question into a computer).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have incorporated wherein detect the composite task from a natural language dialog request, as suggested in Commons into Bakker, Ponulak, and Bollapragada.  Doing so would be desirable because it would provide linked but architecturally distinct hierarchical stacked neural networks that simulate the brain's capacity to organize lower-order actions hierarchically by combining, ordering, and transforming the actions to produce new, more complex higher-stage actions (Commons, column 19, lines 60 to 65).  

Claim 17 is a method claim similar to the system claim 8 above and therefore, rejected for the same reasons.  

Response to Arguments
35 U.S.C. §101: Applicant’s arguments with respect to the 101 rejections have been fully considered and are persuasive. The 101 rejection of claims 19-20 has been withdrawn.  However, the applicant is encouraged to include “non-transitory computer-readable storage medium” in the claim language.

35 U.S.C. §103: In the remarks, Applicant argues that cited references fail to teach “wherein the dialog manager comprises a global state tracker” and “wherein the global state tracker is to ensure that a cross-subtask constraint is satisfied”, as recited in amended independent claims 1, 10, and 19. 

Applicant's arguments with respect to the 103 rejections have been considered, but are moot in view of new ground of rejection made under 35 U.S.C. § 103.    

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to REJI KARTHOLY whose telephone number is (571)272-3432.  The examiner can normally be reached on Monday - Thursday 7:30 am - 3:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Welch can be reached on 571-272-7212.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/R.K./
Examiner, Art Unit 2143                                                                                                                                                                               

/BEAU D SPRATT/Primary Examiner, Art Unit 2143