DETAILED ACTION
This action is in response to the claims filed 01/31/2018 for application 15/885,737. Claims 1-25 are currently pending. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-25 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
The terms “terminal, intermediate, and top” in independent claims 1, 20, 22, and 23 are relative terms which renders the claim indefinite. The terms “terminal, intermediate, and top” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. For rejections under prior art, the examiner will interpret the terminal policy to be the first action that the reinforcement agent is trained on, the intermediate policy as the subsequent action the agent is being trained on, and the top policy as the action after the intermediate action.
Claims 2-19, 21, 24, and 25 are rejected as being dependent on a rejected base claim without curing any of the deficiencies.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-25 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim 1, 
Step 1 Analysis: Claim 1 is directed to a process, which falls within one of the four statutory categories. 
Step 2A Prong 1 Analysis: Claim 1 recites, in part, to accomplish the objective by traversal of the hierarchical policy network, decomposition of one or more tasks in the top task set into tasks in the intermediate task set, and further decomposition of one or more tasks in the intermediate task set into tasks in the terminal task set; and wherein, during the decomposition, a current task in a current task set is executed by executing a previously-learned task selected from a corresponding base task set governed by a corresponding base policy, or performing a primitive action selected from a library of primitive actions. The limitations of to accomplish the objective by traversal of the hierarchical policy network, decomposition of one or more tasks in the top task set into tasks in the intermediate task set, and further decomposition of one or more tasks in the intermediate task set into tasks in the terminal task set; and wherein, during the decomposition, a current task in a current task set is executed by executing a previously-learned task selected from a corresponding base task set governed by a corresponding base policy, or performing a primitive action selected from a library of primitive actions, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2 Analysis: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional elements – “a hierarchical policy network”, “a terminal policy learned by training the agent”, “an intermediate policy learned by training the agent”, and “a top policy learned by training the agent”. These elements that are recited are only generally linked to the judicial exception. Additionally, the claim recites the additional elements – “parallel processors”, “memory”, and “agent”. The elements in the claim are recited at a high level of generality (i.e. as a generic processor performing a generic computer function of generating an index) such that it amounts to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
The claim further recites: wherein the terminal policy serves as a base policy of the intermediate policy and the terminal task set serves as a base task set of the intermediate task set, and the intermediate policy serves as a base policy of the top policy and the intermediate task set serves as a base task set of the top task set; These limitations are more specifics of the judicial exception and thus, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limitations on practicing the abstract idea. The claim as a whole is directed to an abstract idea.
Step 2B Analysis: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of utilizing a hierarchical policy network, a terminal policy learned by training the agent, an intermediate policy learned by training the agent, and a top policy learned by training the agent to perform the steps of the claimed process amount to no more than generally linking the elements to the judicial exception. Additionally, the additional elements of utilizing parallel processors, memory, and agent to perform the steps of the claimed process amount to no mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible.  

Regarding claim 2, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein the selected primitive action is a novel primitive action that is performed when the current task is a novel task. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 3, the rejection of claim 1 is further incorporated, and further, the claim recites: a visual encoder trained to extract feature maps from an image of an environment view of the agent, and encode the features maps in a visual representation; 
an instruction encoder trained to encode a natural language instruction specifying the current task into embedded vectors, and combine the embedded vectors into a bag-of-words (abbreviated BOW) representation; 
a fusion layer that concatenates the visual representation and the BOW representation and outputs a fused representation; 
a long short-term memory (abbreviated LSTM) trained to process the fused representation and output a hidden representation; 
a switch policy classifier trained to process the hidden representation and determine whether to execute the current task by executing the previously-learned task or by performing the primitive action; 
an instruction policy classifier trained to process the hidden representation when the switch policy classifier determines that the current task is to be executed by executing the previously- learned task, and select the previously-learned task from the corresponding base task set and emit a natural language description of the selected previously-learned task; 
an augmented flat policy classifier trained to process the hidden representation when the switch policy classifier determines that the current task is to be executed by performing the primitive action, and select the primitive action from the library of primitive actions; and 
an action processor that, based on the switch policy classifier's determination, implements one or more primitive actions of the selected previously-learned task or the selected primitive action. This limitation amounts to additional mental steps and more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does recite the additional elements of “visual encoder”, “an instruction encoder”, “fusion layer”, “LSTM”, “switch policy classifier”, “instruction policy classifier”, an augmented flat policy classifier”, and “an action processor” however they do not amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception, for the reasons set forth in connection with the rejection of claim 1 above. The claim is not patent eligible.

Regarding claim 4, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein the switch policy classifier, the instruction policy classifier and the augmented flat policy classifier are jointly trained using reinforcement learning that includes evaluation of a binary variable from the switch policy classifier that determines whether to execute the current task by executing the previously-learned task or by performing the selected primitive action. This limitation amounts to additional mental steps and more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does recite the additional element of “jointly trained using reinforcement learning”, however it does not amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception, for the reasons set forth in connection with the rejection of claim 1 above. The claim is not patent eligible.

Regarding claim 5, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein the hierarchical policy network is learned by training the agent on a progression of task sets, beginning with the terminal task set and continuing with the intermediate task set and with the top task set. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 6, the rejection of claim 5 is further incorporated, and further, the claim recites: wherein the terminal task set is formulated by selecting a set of primitive actions from the library of primitive actions, the intermediate task set is formulated by making available the formulated terminal task set as the base task set of the intermediate task set, and the top task set is formulated by making available the formulated intermediate task set as the base task set of the top task set. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 7, the rejection of claim 6 is further incorporated, and further, the claim recites: wherein task complexity varies between the terminal task set, the intermediate task set, and the top task set. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 8, the rejection of claim 7 is further incorporated, and further, the claim recites: wherein the task complexity increases from the terminal task set to the intermediate task set and the top task set. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 9, the rejection of claim 8 is further incorporated, and further, the claim recites: wherein respective tasks of the terminal task set, the intermediate task set, and the top task set are randomly selected. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 10, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein the hierarchical policy network comprises a plurality of intermediate policies learned by training the agent on a plurality of intermediate task sets. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 11, the rejection of claim 10 is further incorporated, and further, the claim recites: wherein a lower intermediate policy serves as a base policy of a higher intermediate policy and a lower intermediate task set serves as a base task set of a higher intermediate task set. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.
 Regarding claim 12, the rejection of claim 3 is further incorporated, and further, the claim recites: wherein the visual encoder includes a convolutional neural network (abbreviated CNN). This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 13, the rejection of claim 3 is further incorporated, and further, the claim recites: wherein the instruction encoder includes an embedding network and a BOW network. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 14, the rejection of claim 3 is further incorporated, and further, the claim recites: wherein the switch policy classifier includes a fully-connected (abbreviated FC) network, followed by a softmax classification layer. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 15, the rejection of claim 3 is further incorporated, and further, the claim recites: wherein the instruction policy classifier includes a first pair of a FC network and a successive softmax classification layer for selecting the previously-learned task from the corresponding base task set, and a second pair of a FC network and a successive softmax classification layer for emitting the natural language description of the selected previously-learned task. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 16, the rejection of claim 3 is further incorporated, and further, the claim recites: wherein the augmented flat policy classifier includes a FC network, followed by a softmax classification layer. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 17, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein the terminal policy is learned by training the agent on the terminal task set over twenty thousand episodes. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 18, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein the intermediate policy is learned by training the agent on the intermediate task set over twenty thousand episodes. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 19, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein the top policy is learned by training the agent on the top task set over twenty thousand episodes. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.
Regarding claim 20, 
Step 1 Analysis: Claim 20 is directed to a process, which falls within one of the four statutory categories. 
Step 2A Prong 1 Analysis: Claim 20 recites, in part, accomplishing the objective by traversal of the hierarchical policy network, decomposing one or more tasks in the top task set into tasks in the intermediate task set, and further decomposing one or more tasks in the intermediate task set into tasks in the terminal task set; and wherein, during the decomposing, a current task in a current task set is executed by executing a previously-learned task selected from a corresponding base task set governed by a corresponding base policy, or performing a primitive action selected from a library of primitive actions. The limitations of accomplishing the objective by traversal of the hierarchical policy network, decomposing one or more tasks in the top task set into tasks in the intermediate task set, and further decomposing one or more tasks in the intermediate task set into tasks in the terminal task set; and wherein, during the decomposing, a current task in a current task set is executed by executing a previously-learned task selected from a corresponding base task set governed by a corresponding base policy, or performing a primitive action selected from a library of primitive actions, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2 Analysis: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional elements – “a hierarchical policy network”, “a terminal policy is learned by training the agent”, “an intermediate policy is learned by training the agent”, and “a top policy is learned by training the agent”. These elements that are recited are only generally linked to the judicial exception. Additionally, the claim recites the additional element – “agent”. The element in the claim is recited at a high level of generality (i.e. as a generic processor performing a generic computer function of generating an index) such that it amounts to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
The claim further recites: wherein the terminal policy serves as a base policy of the intermediate policy and the terminal task set serves as a base task set of the intermediate task set, and the intermediate policy serves as a base policy of the top policy and the intermediate task set serves as a base task set of the top task set; These limitations are more specifics of the judicial exception and thus, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limitations on practicing the abstract idea. The claim as a whole is directed to an abstract idea.
Step 2B Analysis: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of utilizing a hierarchical policy network, a terminal policy learned by training the agent, an intermediate policy learned by training the agent, and a top policy learned by training the agent to perform the steps of the claimed process amount to no more than generally linking the elements to the judicial exception. Additionally, the additional elements of utilizing an agent to perform the steps of the claimed process amount to no mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible.  

Regarding claim 21, the rejection of claim 20 is further incorporated, and further, the claim recites: wherein the selected primitive action is a novel primitive action that is performed when the current task is a novel task. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 20 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 22, 
Step 1 Analysis: Claim 22 is directed to a process, which falls within one of the four statutory categories. 
Step 2A Prong 1 Analysis: Claim 22 recites, in part, accomplishing the objective by traversal of the hierarchical policy network, decomposing one or more tasks in the top task set into tasks in the intermediate task set, and further decomposing one or more tasks in the intermediate task set into tasks in the terminal task set; and wherein, during the decomposing, a current task in a current task set is executed by executing a previously-learned task selected from a corresponding base task set governed by a corresponding base policy, or performing a primitive action selected from a library of primitive actions. The limitations of accomplishing the objective by traversal of the hierarchical policy network, decomposing one or more tasks in the top task set into tasks in the intermediate task set, and further decomposing one or more tasks in the intermediate task set into tasks in the terminal task set; and wherein, during the decomposing, a current task in a current task set is executed by executing a previously-learned task selected from a corresponding base task set governed by a corresponding base policy, or performing a primitive action selected from a library of primitive actions, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2 Analysis: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional elements – “a hierarchical policy network”, “a terminal policy is learned by training the agent”, “an intermediate policy is learned by training the agent”, and “a top policy is learned by training the agent”. These elements that are recited are only generally linked to the judicial exception. Additionally, the claim recites the additional elements – “non-transitory computer readable storage medium”, “processor”, and “agent”. The elements in the claim are recited at a high level of generality (i.e. as a generic processor performing a generic computer function of generating an index) such that it amounts to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
The claim further recites: wherein the terminal policy serves as a base policy of the intermediate policy and the terminal task set serves as a base task set of the intermediate task set, and the intermediate policy serves as a base policy of the top policy and the intermediate task set serves as a base task set of the top task set; These limitations are more specifics of the judicial exception and thus, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limitations on practicing the abstract idea. The claim as a whole is directed to an abstract idea.
Step 2B Analysis: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of utilizing a hierarchical policy network, a terminal policy learned by training the agent, an intermediate policy learned by training the agent, and a top policy learned by training the agent to perform the steps of the claimed process amount to no more than generally linking the elements to the judicial exception. Additionally, the additional elements of utilizing a non-transitory computer readable storage medium, processor, and an agent to perform the steps of the claimed process amount to no mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible.  

Regarding claim 23, 
Step 1 Analysis: Claim 23 is directed to a process, which falls within one of the four statutory categories. 
Step 2A Prong 1 Analysis: Claim 23 recites, in part, an input path that indicates whether to use a previously learned task or to apply an augmented flat policy…, wherein previously learned tasks are arranged in a hierarchy… and has a natural language label applied to the task and to a branch node under which the task is organized, receives a natural language label applied to the newly discovered primitive action and to a branch node under which the newly discovered primitive action is organized, and a query responder that receives a request for a plan of execution and articulates the natural language labels as a plan for consideration and approval or rejection. The limitations of an input path that indicates whether to use a previously learned task or to apply an augmented flat policy…, wherein previously learned tasks are arranged in a hierarchy… and has a natural language label applied to the task and to a branch node under which the task is organized, receives a natural language label applied to the newly discovered primitive action and to a branch node under which the newly discovered primitive action is organized, and a query responder that articulates the natural language labels as a plan for consideration and approval or rejection, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2 Analysis: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional elements – “task plan articulation subsystem”, “hierarchical task processing system”, “processor”, and “memory”. The elements in the claim are recited at a high level of generality (i.e. as a generic processor performing a generic computer function of generating an index) such that it amounts to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
The claim further recites: an input path that receives a selected output and a query responder that receives a request for a plan of execution under the branch node of the selected output. These limitations are mere data gathering steps and thus are insignificant extra-solution activities. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limitations on practicing the abstract idea. The claim as a whole is directed to an abstract idea.
Step 2B Analysis: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of utilizing a task plan articulation subsystem, hierarchical task processing system, processor, and memory to perform the steps of the claimed process amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Additionally, the limitations of an input path that receives a selected output and a query responder that receives a request for a plan of execution under the branch node of the selected output are well-understood, routine, and conventional, as evidenced by MPEP §2106.05(d)(II)(iv), “Storing and retrieving information in memory”. These limitations therefore remain insignificant extra-solution activity even upon reconsideration, and does not amount to significantly more. Even when considered in combination, these additional elements amount to mere instructions to apply the exception using generic computer components and insignificant extra-solution activity, which cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 24, the rejection of claim 23 is further incorporated, and further, the claim recites: wherein the hierarchical task processing system interacts with a supplementary stochastic temporal grammar model that uses history of switches and instructions in positive episodes to modulate when to use the previously learned task and when to discover the primitive action. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 23 above.
The claim does recite the additional element of “a supplementary stochastic temporal grammar model”, however it does not amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception, because the element is only generally linked to the judicial exception. The claim is not patent eligible.

Regarding claim 25, the rejection of claim 23 is further incorporated, and further, the claim recites: wherein the hierarchical task processing system is trained using a two-phase curriculum learning. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 23 above.
The claim does recite the additional element of “two-phase curriculum learning”, however it does not amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception, because the element is only generally linked to the judicial exception. The claim is not patent eligible.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 2, 5-8, 10, 11, and 17-22 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Andreas et al. ("Modular Multitask Reinforcement Learning with Policy Sketches", cited by the Applicant in the IDS filed 07/26/2018, hereinafter "Andreas").


Regarding claim 1, Andreas teaches A hierarchical policy network, running on numerous parallel processors coupled to memory, for use by an agent running on a processor to accomplish an objective that requires execution of multiple tasks (“Our contributions are: • A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.” [pg. 2, left col, first bullet; See further: “We consider three families of tasks: a 2-D Minecraft inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the proper order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff” [pg. 2, right col, ¶2; games/robots implies use of processors and memory]]), comprising: 
a terminal policy learned by training the agent on a terminal task set, an intermediate policy learned by training the agent on an intermediate task set, and a top policy learned by training the agent on a top task set (“Given a fixed sketch (b1, b2, . . .), a task-specific policy Πτ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a subpolicy index i (initially 0), and executes actions from πbi until the STOP symbol is emitted, at which point control is passed to πbi” [pg. 4, left col, ¶3; As noted under the 112(b) rejection, the examiner will interpret the terminal policy to be the first action that the reinforcement agent is trained on, the intermediate policy as the subsequent action the agent is being trained on, and the top policy as the action after the intermediate action. See Figure 3(a)/(b)]); 
wherein the terminal policy serves as a base policy of the intermediate policy and the terminal task set serves as a base task set of the intermediate task set, and the intermediate policy serves as a base policy of the top policy and the intermediate task set serves as a base task set of the top task set (“
    PNG
    media_image1.png
    431
    324
    media_image1.png
    Greyscale
” [pg. 6, Figure 3 Caption; Examiner interprets Figure 3(a)/(b) to be equivalent to the training of the agent on a terminal/intermediate/top task set.]); 
wherein the agent is configurable to accomplish the objective by traversal of the hierarchical policy network, decomposition of one or more tasks in the top task set into tasks in the intermediate task set, and further decomposition of one or more tasks in the intermediate task set into tasks in the terminal task set (“This makes it possible to evaluate our approach under a variety of different data conditions: (1) learning the full collection of tasks jointly via reinforcement, (2) in a zero-shot setting where a policy sketch is available for a held-out task, and (3) in a adaptation setting, where sketches are hidden and the agent must learn to adapt a pretrained policy to reuse high-level actions in a new task. In all cases, our approach substantially outperforms previous approaches based on explicit decomposition of the Q function along subtasks” [pg. 2, right col, top para; See further: “The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. Some doors require that the agent first pick up a key to open them. For our experiments, each task corresponds to a goal room (always at the same position relative to the agent’s starting position) that the agent must reach by navigating through a sequence of intermediate rooms. The agent has one sensor on each side of its body, which reports the distance to keys, closed doors, and open doors in the corresponding direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist” [pg. 6, § The maze environment]]); and 
wherein, during the decomposition, a current task in a current task set is executed by executing a previously-learned task selected from a corresponding base task set governed by a corresponding base policy (“(3) in an adaptation setting, where sketches are hidden and the agent must learn to adapt a pretrained policy to reuse high-level actions in a new task” [pg. 2, right col, top para; note: The following claim limitation recites “or” thus the examiner is interpreting the citation to correspond to, in part, a current task in a current task set is executed by executing a previously learned task…]), or 
performing a primitive action selected from a library of primitive actions (“The learning problem we describe in this paper is in some sense the direct dual to the problem of learning these meta-level policies: there, the agent begins with an inventory of complex primitives and must learn to model their behavior and select among them;” pg. 3, left col, ¶2; As noted above, this limitation is not required under the BRI of the claim.]).

Regarding claim 2, Andreas teaches The hierarchical policy network of claim 1, wherein the selected primitive action is a novel primitive action that is performed when the current task is a novel task (“Experiments show that using our approach to learn policies guided by sketches gives better performance than existing techniques for learning task-specific or shared policies, while naturally inducing a library of interpretable primitive behaviors that can be recombined to rapidly adapt to new tasks.” [Abstract]).

Regarding claim 5, Andreas teaches The hierarchical policy network of claim 1, wherein the hierarchical policy network is learned by training the agent on a progression of task sets, beginning with the terminal task set and continuing with the intermediate task set and with the top task set (“
    PNG
    media_image1.png
    431
    324
    media_image1.png
    Greyscale
” [pg. 6, Figure 3 Caption; Figure 3(a)/(b) shows a progression of tasks.]).

Regarding claim 6, Andreas teaches The hierarchical policy network of claim 5, wherein the terminal task set is formulated by selecting a set of primitive actions from the library of primitive actions (“The learning problem we describe in this paper is in some sense the direct dual to the problem of learning these meta-level policies: there, the agent begins with an inventory of complex primitives and must learn to model their behavior and select among them;” [pg. 3, left col, ¶2]), the intermediate task set is formulated by making available the formulated terminal task set as the base task set of the intermediate task set, and the top task set is formulated by making available the formulated intermediate task set as the base task set of the top task set (“Given a fixed sketch (b1, b2, . . .), a task-specific policy Πτ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a subpolicy index i (initially 0), and executes actions from πbi until the STOP symbol is emitted, at which point control is passed to πbi” [pg. 4, left col, ¶3; Concatenating subpolicies would include using the terminal task as the base task for the intermediate task set and then using the intermediate task as the base task for the top task. See Figure 3(a)/(b)])

Regarding claim 7, Andreas teaches The hierarchical policy network of claim 6, wherein task complexity varies between the terminal task set, the intermediate task set, and the top task set (“Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks” [pg. 7, § 4.3. Multi Task Learning, ¶5]).

Regarding claim 8, Andreas teaches The hierarchical policy network of claim 7, wherein the task complexity increases from the terminal task set to the intermediate task set and the top task set (“It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks” [pg. 7, § 4.3. Multi Task Learning, ¶5]).

Regarding claim 10, Andreas teaches The hierarchical policy network of claim 1, wherein the hierarchical policy network comprises a plurality of intermediate policies learned by training the agent on a plurality of intermediate task sets (“Interacting with raw materials initially scattered around the environment causes them to be added to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the environment.” [pg. 6, § 4.2. Environments, ¶1; See further §3.2. Policy Optimization, Figure 2 for subpolicies of a task]).

Regarding claim 11, Andreas teaches The hierarchical policy network of claim 10, wherein a lower intermediate policy serves as a base policy of a higher intermediate policy and a lower intermediate task set serves as a base task set of a higher intermediate task set (“For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time to focus on “easy” tasks, where many rollouts will result in high reward from which appropriate subpolicy behavior can be inferred. But there is a fundamental tradeoff involved here: if the learner spends too much time on easy tasks before being made aware of the existence of harder ones, it may overfit and learn subpolicies that no longer generalize or exhibit the desired structural properties… To avoid both of these problems, we use a curriculum learning scheme (Bengio et al., 2009) that allows the model to smoothly scale up from easy tasks to more difficult ones while avoiding overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented.” [pg. 5, 3.3. Curriculum Learning, ¶1; See further Appendix A. Tasks and Sketches]). 

Regarding claim 17, Andreas teaches The hierarchical policy network of claim 1, wherein the terminal policy is learned by training the agent on the terminal task set over twenty thousand episodes (“Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.” [pg. 8, left col, top para; See further: Figure 5).

Regarding claim 18, Andreas teaches The hierarchical policy network of claim 1, wherein the intermediate policy is learned by training the agent on the intermediate task set over twenty thousand episodes (“Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.” [pg. 8, left col, top para; See further: Figure 5).

Regarding claim 19, Andreas teaches The hierarchical policy network of claim 1, wherein the top policy is learned by training the agent on the top task set over twenty thousand episodes (“Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.” [pg. 8, left col, top para; See further: Figure 5).

Regarding claim 20, Andreas teaches A method of accomplishing, through an agent, an objective that requires execution of multiple tasks (“Our contributions are: • A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.” [pg. 2, left col, first bullet]), including: 
accessing a hierarchical policy network that comprises a terminal policy, an intermediate policy, and a top policy (“Given a fixed sketch (b1, b2, . . .), a task-specific policy Πτ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a subpolicy index i (initially 0), and executes actions from πbi until the STOP symbol is emitted, at which point control is passed to πbi” [pg. 4, left col, ¶3; As noted under the 112(b) rejection, the examiner will interpret the terminal policy to be the first action that the reinforcement agent is trained on, the intermediate policy as the subsequent action the agent is being trained on, and the top policy as the action after the intermediate action. See Figure 3(a)/(b)]), 
wherein the terminal policy is learned by training the agent on a terminal task set, the intermediate policy is learned by training the agent on an intermediate task set, and the top policy is learned by training the agent on a top task set (“
    PNG
    media_image1.png
    431
    324
    media_image1.png
    Greyscale
” [pg. 6, Figure 3 Caption; Examiner interprets Figure 3(a)/(b) to be equivalent to the training of the agent on a terminal/intermediate/top task set.]); 
wherein the terminal policy serves as a base policy of the intermediate policy and the terminal task set serves as a base task set of the intermediate task set, and the intermediate policy serves as a base policy of the top policy and the intermediate task set serves as a base task set of the top task set (“Given a fixed sketch (b1, b2, . . .), a task-specific policy Πτ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a subpolicy index i (initially 0), and executes actions from πbi until the STOP symbol is emitted, at which point control is passed to πbi” [pg. 4, left col, ¶3; Concatenating subpolicies would include using the terminal task as the base task for the intermediate task set and then using the intermediate task as the base task for the top task. See Figure 3(a)/(b)]); 
accomplishing the objective by traversing the hierarchical policy network, decomposing one or more tasks in the top task set into tasks in the intermediate task set, and further decomposing one or more tasks in the intermediate task set into tasks in the terminal task set (“This makes it possible to evaluate our approach under a variety of different data conditions: (1) learning the full collection of tasks jointly via reinforcement, (2) in a zero-shot setting where a policy sketch is available for a held-out task, and (3) in a adaptation setting, where sketches are hidden and the agent must learn to adapt a pretrained policy to reuse high-level actions in a new task. In all cases, our approach substantially outperforms previous approaches based on explicit decomposition of the Q function along subtasks” [pg. 2, right col, top para; See further: “The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. Some doors require that the agent first pick up a key to open them. For our experiments, each task corresponds to a goal room (always at the same position relative to the agent’s starting position) that the agent must reach by navigating through a sequence of intermediate rooms. The agent has one sensor on each side of its body, which reports the distance to keys, closed doors, and open doors in the corresponding direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist” [pg. 6, § The maze environment]]); and 
during the decomposing, executing a current task in a current task set by executing a previously-learned task selected from a corresponding base task set governed by a corresponding base policy (“(3) in an adaptation setting, where sketches are hidden and the agent must learn to adapt a pretrained policy to reuse high-level actions in a new task” [pg. 2, right col, top para; note: The following claim limitation recites “or” thus the examiner is interpreting the citation to correspond to, in part, a current task in a current task set is executed by executing a previously learned task…]), or 
performing a primitive action selected from a library of primitive actions (“The learning problem we describe in this paper is in some sense the direct dual to the problem of learning these meta-level policies: there, the agent begins with an inventory of complex primitives and must learn to model their behavior and select among them;” pg. 3, left col, ¶2; As noted above, this limitation is not required under the BRI of the claim.]).
Regarding claim 21, Andreas teaches The method of claim 20, wherein the selected primitive action is a novel primitive action that is performed when the current task is a novel task (“Experiments show that using our approach to learn policies guided by sketches gives better performance than existing techniques for learning task-specific or shared policies, while naturally inducing a library of interpretable primitive behaviors that can be recombined to rapidly adapt to new tasks.” [Abstract]).

Regarding claim 22, Andreas teaches A non-transitory computer readable storage medium impressed with computer program instructions to accomplish, through an agent, an objective that requires execution of multiple tasks, the instructions, when executed on a processor (“Our contributions are: • A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.” [pg. 2, left col, first bullet; See further: “We consider three families of tasks: a 2-D Minecraft inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the proper order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff” [pg. 2, right col, ¶2; games/robots implies use of processors and memory]]), implement a method comprising: 
accessing a hierarchical policy network that comprises a terminal policy, an intermediate policy, and a top policy (“Given a fixed sketch (b1, b2, . . .), a task-specific policy Πτ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a subpolicy index i (initially 0), and executes actions from πbi until the STOP symbol is emitted, at which point control is passed to πbi” [pg. 4, left col, ¶3; As noted under the 112(b) rejection, the examiner will interpret the terminal policy to be the first action that the reinforcement agent is trained on, the intermediate policy as the subsequent action the agent is being trained on, and the top policy as the action after the intermediate action. See Figure 3(a)/(b)]), 
wherein the terminal policy is learned by training the agent on a terminal task set, the intermediate policy is learned by training the agent on an intermediate task set, and the top policy is learned by training the agent on a top task set (“
    PNG
    media_image1.png
    431
    324
    media_image1.png
    Greyscale
” [pg. 6, Figure 3 Caption; Examiner interprets Figure 3(a)/(b) to be equivalent to the training of the agent on a terminal/intermediate/top task set.]); 
wherein the terminal policy serves as a base policy of the intermediate policy and the terminal task set serves as a base task set of the intermediate task set, and the intermediate policy serves as a base policy of the top policy and the intermediate task set serves as a base task set of the top task set (“Given a fixed sketch (b1, b2, . . .), a task-specific policy Πτ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a subpolicy index i (initially 0), and executes actions from πbi until the STOP symbol is emitted, at which point control is passed to πbi” [pg. 4, left col, ¶3; Concatenating subpolicies would include using the terminal task as the base task for the intermediate task set and then using the intermediate task as the base task for the top task. See Figure 3(a)/(b)]); 
accomplishing the objective by traversing the hierarchical policy network, decomposing one or more tasks in the top task set into tasks in the intermediate task set, and further decomposing one or more tasks in the intermediate task set into tasks in the terminal task set (“This makes it possible to evaluate our approach under a variety of different data conditions: (1) learning the full collection of tasks jointly via reinforcement, (2) in a zero-shot setting where a policy sketch is available for a held-out task, and (3) in a adaptation setting, where sketches are hidden and the agent must learn to adapt a pretrained policy to reuse high-level actions in a new task. In all cases, our approach substantially outperforms previous approaches based on explicit decomposition of the Q function along subtasks” [pg. 2, right col, top para; See further: “The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. Some doors require that the agent first pick up a key to open them. For our experiments, each task corresponds to a goal room (always at the same position relative to the agent’s starting position) that the agent must reach by navigating through a sequence of intermediate rooms. The agent has one sensor on each side of its body, which reports the distance to keys, closed doors, and open doors in the corresponding direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist” [pg. 6, § The maze environment]]); and 
during the decomposing, executing a current task in a current task set by executing a previously-learned task selected from a corresponding base task set governed by a corresponding base policy (“(3) in an adaptation setting, where sketches are hidden and the agent must learn to adapt a pretrained policy to reuse high-level actions in a new task” [pg. 2, right col, top para; note: The following claim limitation recites “or” thus the examiner is interpreting the citation to correspond to, in part, a current task in a current task set is executed by executing a previously learned task…]), or 
performing a primitive action selected from a library of primitive actions (“The learning problem we describe in this paper is in some sense the direct dual to the problem of learning these meta-level policies: there, the agent begins with an inventory of complex primitives and must learn to model their behavior and select among them;” pg. 3, left col, ¶2; As noted above, this limitation is not required under the BRI of the claim.]).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 3, 4, 12, 13, 15, and 23-25 are rejected under 35 U.S.C. 103 as being unpatentable over Andreas in view of Hermann et al. ("Grounded Language Learning in a Simulated 3D World" cited by Applicant in the IDS filed 07/26/2018, hereinafter "Hermann") and further in view of Si et al. ("Unsupervised Learning of Event AND-OR Grammar and Semantics from Video" cited by Applicant in the IDS filed 07/26/2018, hereinafter "Si").

Regarding claim 3, Andreas teaches The hierarchical policy network of claim 1, where Andreas further teaches an augmented flat policy classifier trained to process the hidden representation when the current task is to be executed by performing the primitive action (“Training details in the crafting domain. (a) Critics: lines labeled “task” include a baseline that varies with task identity, while lines labeled “state” include a baseline that varies with state identity. Estimating a baseline that depends on both the representation of the current state and the identity of the current task is better than either alone or a constant baseline” [pg. 8, Figure 5: Caption]), and select the primitive action from the library of primitive actions (“The learning problem we describe in this paper is in some sense the direct dual to the problem of learning these meta-level policies: there, the agent begins with an inventory of complex primitives and must learn to model their behavior and select among them;” pg. 3, left col, ¶2]);
However Andreas fails to explicitly teach further comprising: 
a visual encoder trained to extract feature maps from an image of an environment view of the agent, and encode the features maps in a visual representation; 
an instruction encoder trained to encode a natural language instruction specifying the current task into embedded vectors, and combine the embedded vectors into a bag-of-words (abbreviated BOW) representation; 
a fusion layer that concatenates the visual representation and the BOW representation and outputs a fused representation; 
a long short-term memory (abbreviated LSTM) trained to process the fused representation and output a hidden representation; 
a switch policy classifier trained to process the hidden representation and determine whether to execute the current task by executing the previously-learned task or by performing the primitive action; 
an instruction policy classifier trained to process the hidden representation when the switch policy classifier determines that the current task is to be executed by executing the previously- learned task, and select the previously-learned task from the corresponding base task set and emit a natural language description of the selected previously-learned task; and
 an action processor that, based on the switch policy classifier's determination, implements one or more primitive actions of the selected previously-learned task or the selected primitive action.
Hermann teaches a visual encoder trained to extract feature maps from an image of an environment view of the agent (“At every time-step t the vision module V receives an 84 × 84 pixel RGB representation of the agent’s (first person) view of the environment (xvt ∈ R3×84×84), which is then processed with a three-layer convolutional neural network” [pg. 20, § A.1 Agent core, ¶1]), and encode the features maps in a visual representation (“The temporal autoencoder auxiliary task tAE is designed to illicit intuitions in our agent about how the perceptible world will change as a consequence of its actions. The objective is to predict the visual environment vt+1 conditioned on the prior visual input vt and the action at (Oh et al., 2015). Our implementation reuses the standard visual module V and combines the representation of vt with an embedded representation of at.” [pg. 6-7, §Temporal autoencoding, ¶1; See further: “The feature representation is then transformed using the action ai” [pg. 20, A.2 Auxiliary Networks, 1]); 
an instruction encoder trained to encode a natural language instruction specifying the current task into embedded vectors (“The language module receives an input xlt ∈ Ns, where s is the maximum instruction length with words represented as indices in a dictionary. For tasks that require sensitivity to the order of words in the language instruction, the language module L encodes xlt with a recurrent (LSTM) architecture” [pg. 20, A.1 Agent core, ¶2]), and combine the embedded vectors into a bag-of-words (abbreviated BOW) representation (“For other tasks, we applied a simpler bag-of-words (BOW) encoder, in which an instruction is represented as the sum of the embeddings of its constituent words, as this resulted in faster training. Both the LSTM and BOW encoders use word embeddings of dimension 128, and the hidden layer of the LSTM is also of dimension 128, resulting in both cases in an output representation lt ∈ R128.” [pg. 20, A.1 Agent core, ¶2]);
a fusion layer that concatenates the visual representation and the BOW representation and outputs a fused representation (“In the mixing module M, outputs vt and lt are combined by flattening vt into a single vector and concatenating the two resultant vectors into a shared representation mt. The output from M at each time-step is fed to the action module A which maintains the agent state ht ∈ Rd. ht is updated using an LSTM network combining output mt from M and ht−1 from the previous time-step. By default we set d = 256 in all our experiments.” [pg. 20, A.1 Agent core, ¶2]);
a long short-term memory (abbreviated LSTM) trained to process the fused representation and output a hidden representation (“A mixing module M determines how these signals are combined before they are passed to a two-layer LSTM action module A. The hidden state st of the upper LSTM in A is fed to a policy function, which computes a probability distribution over possible motor actions π(at|st), and a state-value function approximator Val(st), which computes a scalar estimate of the agent value function for optimization” [pg. 4, § 4. Agent design, ¶1]);
an instruction policy classifier trained to process the hidden representation when the current task is to be executed by executing the previously-learned task, and select the previously-learned task from the corresponding base task set and emit a natural language description of the selected previously-learned task (“In this environment, the Selection task involved instructions of the form pick the X object or pick all X, where X denotes a colour term. The Next to task involved instructions of the form pick the X object next to the Y object, where X and Y refer to objects. Finally, the In room task involved instructions of the form pick the X in the Y room, where Y referred to the colour of the floor in the target room. Both the Next to and the In room task employed large degrees of ambiguity, i.e. a given Next to level may contain several objects X and Y , but in a constellation that only one X would be located next to a Y… By learning these tasks, this agent demonstrates an ability to ground language referring not only to single (concrete) objects, but also to (more abstract) sequences of actions, plans and inter-entity relationships. Moreover, in mastering the Next to and In room tasks, the agent exhibits sensitivity to a critical facet of many natural languages, namely the dependence of utterance meaning on word order. The ability to solve more complex tasks by curriculum training emphasises the generality of the emergent semantic representations acquired by the agent, allowing it to transfer learning from one scenario to a related but more complex environment.” [pg. 15-16, § 5.5 Multi-Task Learning, ¶2-5; Examiner is interpreting this to be equivalent to the function of an “instruction policy classifier”. See Fig. 6: “For the agent to learn to retrieve an object in a particular room as instructed, a four-lesson training curriculum was required. Each lesson involved a more complex layout or a wider selection of objects and words, and was only solved by an agent that had successfully solved the previous lesson.”]);
Andreas and Hermann are both in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’ teachings with Hermann’s visual and instruction encoding modules. One would have been motivated to make this modification in order to instruct and guide artificial agents with human language. [Abstract, Hermann]
Andreas/Hermann fails to explicitly teach a switch policy classifier trained to process the hidden representation and determine whether to execute the current task by executing the previously-learned task or by performing the primitive action; 
an instruction policy classifier trained to process the hidden representation when the switch policy classifier determines that the current task is to be executed by executing the previously-learned task, and select the previously-learned task from the corresponding base task set and emit a natural language description of the selected previously-learned task; and
 an action processor that, based on the switch policy classifier's determination, implements one or more primitive actions of the selected previously-learned task or the selected primitive action.
Although Andreas/Hermann teaches training the agent on a previously-learned task and selecting the previously-learned task, a current task, performing the primitive action and selecting the primitive actions from a library of actions, the references do not go into details of using a switch policy classifier to make a determination.
Si teaches a switch policy classifier trained to process the hidden representation and determine whether to execute the current task by executing the previously-learned task or by performing the primitive action (“The location of the agent is computed by combining foreground segmentation and skin color detection that locates the head and hands of the agent. The real valued location is then quantized into a categorical variable by finding its nearest region of interest (e.g. desk, door). The agent pose is inferred by a nearest neighbor classifier using both pixels and foreground segmentation map within the estimated bounding box for the agent. An illustration of four poses using segmented foreground mask is shown in Fig. 3. The binary relations touch(agent, keyboard) and touch(agent, phone) are detected by checking whether there is enough skin color within the designated area for the laptop and phone, which are static objects in the office environment. The relation touch(agent, mug) is also detected using skin color, and the unique color and shape of the mug. The background relation occlude(soccer match, screen) is determined by checking whether there is large amount of green color occluding the laptop. Using the techniques described above, we detect grounded relations for every video frame. The detection result is organized as a spatial temporal table where each row corresponds to a time frame. Each column corresponds to a grounded relation.” [pg. 43, left col, ¶3; See table 3]);
when the switch policy classifier determines (“The agent pose is inferred by a nearest neighbor classifier using both pixels and foreground segmentation map within the estimated bounding box for the agent.” [pg. 43, left col, ¶3])
an action processor that, based on the switch policy classifier's determination, implements one or more primitive actions of the selected previously-learned task or the selected primitive action (“The agent pose is inferred by a nearest neighbor classifier using both pixels and foreground segmentation map within the estimated bounding box for the agent. An illustration of four poses using segmented foreground mask is shown in Fig. 3. The binary relations touch(agent, keyboard) and touch(agent, phone) are detected by checking whether there is enough skin color within the designated area for the laptop and phone, which are static objects in the office environment. The relation touch(agent, mug) is also detected using skin color, and the unique color and shape of the mug. The background relation occlude(soccer match, screen) is determined by checking whether there is large amount of green color occluding the laptop. Using the techniques described above, we detect grounded relations for every video frame. The detection result is organized as a spatial temporal table where each row corresponds to a time frame. Each column corresponds to a grounded relation. [pg. 43, left col, ¶3]).
Andreas, Hermann and Si are both in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. Si discloses an unsupervised learning of event grammar and semantics from videos. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’/Hermann’s teachings with Si’s classifier. One would have been motivated to make this modification in order to train the agent to learn from certain environments and organize actions in a hierarchal manner. [pg. 48, § Conclusion, Si]

Regarding claim 4, Andreas/Hermann/Si teaches The hierarchical policy network of claim 3, where Si teaches wherein the switch policy classifier, the instruction policy classifier and the augmented flat policy classifier are jointly trained using reinforcement learning that includes evaluation of a binary variable from the switch policy classifier that determines whether to execute the current task by executing the previously-learned task or by performing the selected primitive action (“The agent pose is inferred by a nearest neighbor classifier using both pixels and foreground segmentation map within the estimated bounding box for the agent. An illustration of four poses using segmented foreground mask is shown in Fig. 3. The binary relations touch(agent, keyboard) and touch(agent, phone) are detected by checking whether there is enough skin color within the designated area for the laptop and phone, which are static objects in the office environment. The relation touch(agent, mug) is also detected using skin color, and the unique color and shape of the mug. The background relation occlude(soccer match, screen) is determined by checking whether there is large amount of green color occluding the laptop. Using the techniques described above, we detect grounded relations for every video frame. The detection result is organized as a spatial temporal table where each row corresponds to a time frame. Each column corresponds to a grounded relation.” [pg. 3, left col, ¶3]; See Table 3]).
Andreas, Hermann and Si are both in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. Si discloses an unsupervised learning of event grammar and semantics from videos. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’/Hermann’s teachings with Si’s classifier. One would have been motivated to make this modification in order to train the agent to learn from certain environments and organize actions in a hierarchal manner. [pg. 48, § Conclusion, Si]

Regarding claim 12, Andreas/Hermann/Si teaches The hierarchical policy network of claim 3, where Hermann teaches wherein the visual encoder includes a convolutional neural network (abbreviated CNN) (“At every time-step t the vision module V receives an 84 × 84 pixel RGB representation of the agent’s (first person) view of the environment (xvt ∈ R3×84×84), which is then processed with a three-layer convolutional neural network” [pg. 20, § A.1 Agent core, ¶1]).
Andreas, Hermann and Si are both in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. Si discloses an unsupervised learning of event grammar and semantics from videos. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’/Si’s teachings by implementing a CNN as a visual encoder as taught by Hermann. One would have been motivated to make this modification in order to yield predictable results. 

Regarding claim 13, Andreas/Hermann/Si teaches The hierarchical policy network of claim 3, where Hermann teaches wherein the instruction encoder includes an embedding network and a BOW network (“For other tasks, we applied a simpler bag-of-words (BOW) encoder, in which an instruction is represented as the sum of the embeddings of its constituent words, as this resulted in faster training. Both the LSTM and BOW encoders use word embeddings of dimension 128, and the hidden layer of the LSTM is also of dimension 128, resulting in both cases in an output representation lt ∈ R128” [pg. 20, § A.1 Agent core, ¶2]).
Andreas, Hermann and Si are both in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. Si discloses an unsupervised learning of event grammar and semantics from videos. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’/Si’s teachings by implementing an embedding network and a BOW network as taught by Hermann. One would have been motivated to make this modification in order to yield predictable results.

Regarding claim 15, Andreas/Hermann/Si teaches The hierarchical policy network of claim 3, where Hermann teaches wherein the instruction policy classifier includes a first pair of a FC network and a successive softmax classification layer for selecting the previously-learned task from the corresponding base task set (“Our agent consists of four inter-connected modules optimised as a single neural network… “
    PNG
    media_image2.png
    70
    596
    media_image2.png
    Greyscale
” [pg. 4-6, § 4. Agent Design, ¶1-5]), and a second pair of a FC network and a successive softmax classification layer for emitting the natural language description of the selected previously-learned task (“The output activations are fed through a Softmax activation function to yield a probability distribution over words in the vocabulary, and the negative log likelihood of the instruction word lt is computed as the loss. Note that this objective requires a single meaningful word to be extracted from the instruction as the target.” [pg. 20-21, § Language Prediction, ¶1]).
Andreas, Hermann and Si are both in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. Si discloses an unsupervised learning of event grammar and semantics from videos. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’/Si’s teachings by implementing a FC network followed by a softmax classification layer as taught by Hermann. One would have been motivated to make this modification in order to yield predictable results.

Regarding claim 23, Andreas teaches A task plan articulation subsystem that articulates a plan formulated by a hierarchical task processing system, including: 
a processor and memory coupled to the processor (“Our contributions are: • A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.” [pg. 2, left col, first bullet; See further: “We consider three families of tasks: a 2-D Minecraft inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the proper order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff” [pg. 2, right col, ¶2; games/robots implies use of processors and memory]]); 
wherein previously learned tasks are arranged in a hierarchy comprising top tasks, intermediate tasks, and terminal tasks (“Given a fixed sketch (b1, b2, . . .), a task-specific policy Πτ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a subpolicy index i (initially 0), and executes actions from πbi until the STOP symbol is emitted, at which point control is passed to πbi” [pg. 4, left col, ¶3; As noted under the 112(b) rejection, the examiner will interpret the terminal policy to be the first action that the reinforcement agent is trained on, the intermediate policy as the subsequent action the agent is being trained on, and the top policy as the action after the intermediate action. See Figure 3(a)/(b)]),
However Andreas fails to explicitly teach an input path that receives a selected output that indicates whether to use a previously learned task or to apply an augmented flat policy to discover a primitive action in order to respond to a natural language instruction that specifies an objective that requires execution of multiple tasks to accomplish;
Hermann teaches an input path that receives a selected output that indicates whether to use a previously learned task or to apply an augmented flat policy to discover a primitive action in order to respond to a natural language instruction that specifies an objective that requires execution of multiple tasks to accomplish (“In this environment, the Selection task involved instructions of the form pick the X object or pick all X, where X denotes a colour term. The Next to task involved instructions of the form pick the X object next to the Y object, where X and Y refer to objects. Finally, the In room task involved instructions of the form pick the X in the Y room, where Y referred to the colour of the floor in the target room. Both the Next to and the In room task employed large degrees of ambiguity, i.e. a given Next to level may contain several objects X and Y , but in a constellation that only one X would be located next to a Y… By learning these tasks, this agent demonstrates an ability to ground language referring not only to single (concrete) objects, but also to (more abstract) sequences of actions, plans and inter-entity relationships. Moreover, in mastering the Next to and In room tasks, the agent exhibits sensitivity to a critical facet of many natural languages, namely the dependence of utterance meaning on word order. The ability to solve more complex tasks by curriculum training emphasises the generality of the emergent semantic representations acquired by the agent, allowing it to transfer learning from one scenario to a related but more complex environment.” [pg. 15-16, § 5.5 Multi-Task Learning, ¶2-5; under BRI, the claim recites “or” thus the examiner has provided a correspond citation to an input path that receives a selected output that indicates whether to use a previously learned task.])
and each previously learned task in the hierarchy has a natural language label applied to the task (“Two important facets of natural language understanding are the ability to compose the meanings of known words to interpret otherwise unfamiliar phrases, and the ability to generalise linguistic knowledge learned in one setting to make sense of new situations. To examine these capacities in our agent, we trained it in settings where its (linguistic or visual) experience was constrained to a training set, and simultaneously as it learned from the training set, tested the performance of the agent on situations outside of this set (Figure 5).” [pg. 10, §5.3 One-shot learning experiments, ¶1])
and a query responder that receives a request for a plan of execution and articulates the natural language labels for the branch node and the tasks and primitive actions under the branch node of the selected output as a plan for consideration and approval or rejection (“We have taken an important step towards this goal by describing an agent that learns to execute a large number of multi-word instructions in a simulated three-dimensional world, with no pre-programming or hard-coded knowledge. The agent learns simple language by making predictions about the world in which that language occurs, and by discovering which combinations of words, perceptual cues and action decisions result in positive outcomes.” [pg. 16, §6. Conclusion, ¶1; See further: “It is important to emphasise the complexity of the learning challenge faced by the agent, even for a simple reference task such as this. To obtain positive rewards across multiple training episodes, the agent must learn to efficiently explore the environment and inspect candidate objects (requiring the execution of hundreds of inter-dependent actions) while simultaneously learning the (compositional) meanings of multi-word expressions and how they pertain to visual features of different objects (Figure 1)” [pg. 4, § 3. The 3D language learning environment, ¶3]])
Andreas and Hermann are both in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’ teachings with Hermann’s visual and instruction encoding modules. One would have been motivated to make this modification in order to instruct and guide artificial agents with human language. [Abstract, Hermann]
Andreas/Hermann fails to explicitly teach and to a branch node under which the task is organized;
wherein a newly discovered primitive action receives a natural language label applied to the newly discovered primitive action and to a branch node under which the newly discovered primitive action is organized
Si teaches and to a branch node under which the task is organized (See Figure 7, pg. 46);
wherein a newly discovered primitive action receives a natural language label applied to the newly discovered primitive action and to a branch node under which the newly discovered primitive action is organized (“The learning of event grammar is carried out into two stages. (1) Learn a set of terminal nodes as blocks on the data matrix of grounded relations. These terminal nodes account for atomic actions which directly specify spatial temporal configurations of grounded relations. This is done by clustering. (2) Learn non-terminal nodes as blocks on the data matrix of atomic actions, to account for longer events composed of atomic actions. This is done by step-wise pursuit.” [pg. 44, right col, top para; See Figure 7]).
Andreas, Hermann and Si are both in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. Si discloses an unsupervised learning of event grammar and semantics from videos. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’/Hermann’s teachings with Si’s Event grammar model. One would have been motivated to make this modification in order to train the agent to learn from certain environments and organize actions in a hierarchal manner. [pg. 48, § Conclusion, Si]

Regarding claim 24, Andreas/Hermann/Si teaches The task plan articulation subsystem of claim 23, where Si teaches wherein the hierarchical task processing system interacts with a supplementary stochastic temporal grammar model (“The unsupervised learning of stochastic event grammar is conducted under the information projection and minimum description length principle.” [pg. 43, § 3.1. Information projection, ¶1])
Although Si teaches using a stochastic temporal grammar model, the reference doesn’t go into details of using a history of switches/instructions in positive episodes.
Hermann teaches that uses history of switches and instructions in positive episodes to modulate when to use the previously learned task and when to discover the primitive action (“We have taken an important step towards this goal by describing an agent that learns to execute a large number of multi-word instructions in a simulated three-dimensional world, with no pre-programming or hard-coded knowledge. The agent learns simple language by making predictions about the world in which that language occurs, and by discovering which combinations of words, perceptual cues and action decisions result in positive outcomes. Its knowledge is distributed across language, vision and policy networks, and pertains to modifiers, relational concepts and actions, as well as concrete objects. Its semantic representations enable the agent to productively interpret novel word combinations, to apply known relations and modifiers to unfamiliar objects and to re-use knowledge pertinent to the concepts it already has in the process of acquiring new concepts.” [pg. 16, § 6. Conclusion, ¶1]).
Andreas, Hermann and Si are both in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. Si discloses an unsupervised learning of event grammar and semantics from videos. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’/Hermann’s teachings with Si’s Event grammar model. One would have been motivated to make this modification in order to train the agent to learn from certain environments and organize actions in a hierarchal manner. [pg. 48, § Conclusion, Si]

Regarding claim 25, Andreas/Hermann/Si teaches The task plan articulation subsystem of claim 23, where Hermann teaches wherein the hierarchical task processing system is trained using a two-phase curriculum learning (“Figure 8: Multi-task learning via an efficient curriculum of two steps. A single agent can learn to solve a number of different tasks following a two-lesson training curriculum. The different tasks cannot be distinguished based on visual information alone, but require the agent to use the language input to identify the task in question.” [pg. 15, Figure 8 Caption; note: Examiner is interpreting two-lesson training curriculum to be equivalent to using a two-phase curriculum learning.)
Andreas, Hermann and Si are both in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. Si discloses an unsupervised learning of event grammar and semantics from videos. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’/Si’s teachings with Hermann’s two lesson curriculum learning. One would have been motivated to make this modification in order to instruct and guide artificial agents with human language. [Abstract, Hermann]


Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Andreas in view of Hermann.

Regarding claim 9, Andreas teaches The hierarchical policy network of claim 8, however fails to explicitly teach wherein respective tasks of the terminal task set, the intermediate task set, and the top task set are randomly selected.
Hermann teaches wherein respective tasks of the terminal task set, the intermediate task set, and the top task set are randomly selected (“Objects and instructions were sampled at random from the full set of factors available in the simulation environment.” [pg. 7, § 5.1 Role of unsupervised learning, ¶1; See further: “the training instructions were either unigrams or bigrams. Possible unigrams were the 40 shape and the 13 colour terms listed in Appendix B. The possible bigrams were any colour-shape combination except those containing the shapes ice lolly, ladder, mug, pencil, suitcase or the colours red, magenta, grey, purple (subsets selected randomly).” [pg. 10, 5.3 One-shot learning experiments, ¶2]]).
Andreas and Hermann are both in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’ teachings with Hermann’s random sampling method. One would have been motivated to make this modification in order to instruct and guide artificial agents with human language. [Abstract, Hermann]

Claims 14 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Andreas in view of Hermann and Si and further in view of Yin et al. ("Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay", hereinafter "Yin").

Regarding claim 14, Andreas/Hermann/Si teaches The hierarchical policy network of claim 3, where Si teaches the switch policy classifier (“The agent pose is inferred by a nearest neighbor classifier using both pixels and foreground segmentation map within the estimated bounding box for the agent.” [pg. 43, left col, ¶3]) 
However Andreas/Hermann/Si fails to explicitly teach wherein the switch policy classifier includes a fully-connected (abbreviated FC) network, followed by a softmax classification layer 
Yin teaches wherein the switch policy classifier includes a fully-connected (abbreviated FC) network (“In this paper, we propose a new multi-task policy distillation architecture as shown in Figure 1. In the new architecture, each task preserves its own convolutional filters to generate task-specific high-level features. Each task-specific part consists of three convolutional layers with each followed by a rectifier layer. The outputs of the last rectifier layer are used as the inputs to the multi-task policy network. A set of fully-connected layers are defined as the shared multitask policy layers” [pg. 1642, § Proposed Multi-task Policy Distillation, ¶1]), followed by a softmax classification layer (“f(·) is the softmax function, τ is the temperature to soften the distribution, and · is the dot product.” [pg. 1642, § Policy Distillation, ¶1]).
	Andreas, Hermann, Si, and Yin are all in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. Si discloses an unsupervised learning of event grammar and semantics from videos. Yin discloses a multi-task policy distillation architecture. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’/Hermann’s/Si’s teachings to include a FC network followed by a softmax classification layer as taught by Yin. Fully connected networks and softmax classification layers are well-known in machine learning and thus one would have been motivated to make this modification in order to yield predictable results. 

	Regarding claim 16, Andreas/Hermann/Si teaches The hierarchical policy network of claim 3, where Andreas teaches the augmented flat policy classifier (“Training details in the crafting domain. (a) Critics: lines labeled “task” include a baseline that varies with task identity, while lines labeled “state” include a baseline that varies with state identity. Estimating a baseline that depends on both the representation of the current state and the identity of the current task is better than either alone or a constant baseline” [pg. 8, Figure 5: Caption]).
However Andreas/Hermann/Si teaches wherein the augmented flat policy classifier includes a FC network, followed by a softmax classification layer.
Yin teaches wherein the augmented flat policy classifier includes a FC network (“In this paper, we propose a new multi-task policy distillation architecture as shown in Figure 1. In the new architecture, each task preserves its own convolutional filters to generate task-specific high-level features. Each task-specific part consists of three convolutional layers with each followed by a rectifier layer. The outputs of the last rectifier layer are used as the inputs to the multi-task policy network. A set of fully-connected layers are defined as the shared multitask policy layers” [pg. 1642, § Proposed Multi-task Policy Distillation, ¶1]), followed by a softmax classification layer (“f(·) is the softmax function, τ is the temperature to soften the distribution, and · is the dot product.” [pg. 1642, § Policy Distillation, ¶1]).
	Andreas, Hermann, Si, and Yin are all in the same field of endeavor of reinforcement learning and thus are analogous. Andreas discloses multitask learning reinforcement learning with policy sketches. Hermann discloses training a reinforcement learning agent in a 3D world. Si discloses an unsupervised learning of event grammar and semantics from videos. Yin discloses a multi-task policy distillation architecture. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Andreas’/Hermann’s/Si’s teachings to include a FC network followed by a softmax classification layer as taught by Yin. Fully connected networks and softmax classification layers are well-known in machine learning and thus one would have been motivated to make this modification in order to yield predictable results. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL H HOANG whose telephone number is (571)272-8491. The examiner can normally be reached Mon-Fri 8:30AM-4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/M.H.H./Examiner, Art Unit 2122                                                                                                                                                                                                        

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122