DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Information Disclosure Statement
The information disclosure statement (IDS) submitted on 02/01/2019 and 02/10/2020 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.


Claim Objections
Claim 13 is objected to because of the following informalities:  
Claim 13 recites “cloud brain”, the terms “cloud brain” is not a term of art and the specification does not explain the definition of “cloud brain” is. Examiner suggested to further clarify what “cloud brain” means. Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-2, 7-12, 14 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Mnih et al. (US Pat No. 9679258 B2) in view of Berenson et al. (“A Robot Path Planning Framework that Learns from Experience”). 
Regarding claim 1
Mnih teaches a computer-implemented method for generating an action for a robot, (abstract “Training data is generated by operating on the system with a Succession of actions and used to train a second neural network.”)
the method comprising: collecting a first experience for the robot, (col 13 lines 53-59 “In the above algorithms we store the agent's experiences at each timestep, e, (S., a r. s.) in a data-set De... ex pooled over many episodes into a replay memory. During the inner loop of the algorithm, Q-learning updates, or minibatch updates, are applied to samples of experience, el D, drawn at random from the pool of stored samples.”)
the first experience representing: a first state of the robot at a first time, (col 4 lines 32-40 “In embodiments the experience data is used in conjunction with the first neural network for training the second neural network. More particularly a transition comprising a first, starting State, action, and next state is sampled from the stored experience data. This is used to generate a target action-value parameter (Q-value) from the first neural network (which is, in embodiments, a previously made copy of the second neural network), for training the second neural network.” also see col 9 lines 12-17 “Since the agent only observes images of the current screen, the task is partially observed and many emulator states are perceptually aliased, i.e. it is impossible to fully understand the current situation from only the current Screen X. We therefore consider sequences of actions and observations, S, X1, a1, X2,..., at-1, Xt, and learn game strategies that depend upon these sequences.”)
a first action taken by the robot at the first time, (col 8 lines 66-67 “At each time-step the agent selects an action a, from the set of legal game actions.”)
a first reward received by the robot in response to the first action, (col 11 lines 28-35 “The procedure then again inputs state (image sequence) data and stores experience data (S204). The stored experience data comprises the before and after States, the action taken, and the reward earned. At step S206, the procedure draws a transition from the stored experience data, either randomly or according to a prioritised strategy, and provides the end, after state of the transition to the first neural network (neural network O).”)
and a second state of the robot in response to the first action at a second time after the first time; (Examiner notes that Mnih teaches having multiple states and actions to move from one state to the next[corresponds to second state] see col 8 lines 66-67 “At each time-step the agent selects an action a, from the set of legal game actions.” and also see col 4 lines 18-23 “In embodiments the experience data includes reward (or cost) data relating to a reward (or cost) of the action in moving from a current state to a subsequent state[corresponds to second state]. The reward/cost may be measured from the system, for example by inputting data defining a reward or cost collected/incurred by the action.”)
…
pruning the plurality of experiences in the memory based on the degree of … to form a pruned plurality of experiences stored in the memory; (col 11 lines 1-10 “The procedure begins by inputting state data from a controlled system (S200). For the test system of an AtariTM game emulator this comprised a sequence of image frames from the game. As described later, in this test environment frame-skipping was employed, and the captured images were down-sampled to reduce[corresponds to pruned] the quantity of data to be processed. One of the advantages of the approach we describe is that the procedure is able to accept image pixel data as an input rather than relying on a hand-constructed representation of the system under control.” also see col 12 lines 1-4 “The procedure then loops back from step S212 to step S202 to select a further action. In embodiments the size of the experience data store is limited and therefore, as new experience data is stored, older experience data may be discarded, for example using a FIFO (first in first out) strategy.”)
training a neural network associated with the robot with the pruned plurality of experiences; (col 12 lines 1-11 “the experience data store is limited and therefore, as new experience data is stored, older experience data may be discarded, for example using a FIFO (first in first out) strategy. After a defined number of training steps, for example every 10, 10, or 10 steps, the weights from the second, trained neural network are copied across to the first neural network (S214) so that, in effect, the neural network for the Q-values becomes the neural network for the Q-values, and the training of the second neural network proceeds. The training procedure may continue indefinitely or, in other approaches, the training may be terminated, for example after a predetermined number of training steps and/or based on a training metric Such as an average predicted state-value function for a defined set of states.” Also see col 16 lines 16-24 “During a learning phase module 110 samples the transition from the experience data store 108 and adjusts the weights of neural network 150 (neural network 1) based on a target from neural network 0, an earlier copy of neural network 1 having weights stored in module 110. Thus in embodiments the actions selected by neural network 1 provide stored experience data from which neural network 0 draws, to provide targets for training neural network 1.”)
and generating a second action for the robot using the neural network. (Col 15 lines 20-29 “Thus instead preferred embodiments employed an architecture in which there is a separate output unit for each possible action, and only the State representation is an input to the neural network. The outputs correspond to the predicted Q-values of the individual action for the input state, as shown schematically for neural network 150b in FIG. 3b. One advantage of this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network.” Also see col 8 lines 64-67 “We consider tasks in which an agent interacts with an environment E, in this case the Atari emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action at, from the set of legal game actions A={1, . . . . K}.”)
Mnih does not teach determining a degree of similarity between the first experience and a plurality of experiences stored in a memory for the robot; 
…
…plurality of experiences in the memory based on the degree of similarity between the first experience and the plurality of experiences.
Berenson teaches determining a degree of similarity between the first experience  (pg. 1 left col “Our framework, which we call Lightning for its ability to plan quickly, leverages the generality of PFS to produce solutions in new situations and the efficiency of re-using previous experience in situations similar to previously-encountered ones. After a path is generated, a library manager decides whether to store the path based on the computation times of the two modules and the generated path’s similarity to the retrieved path.”)
and a plurality of experiences stored in a memory for the robot; (pg. 1 right col “The Lightning framework (see Figure 1) consists of two main modules, which are run in parallel: PFS, and a module that retrieves and repairs paths stored in a path library, which we call Retrieve-Repair (RR). Given a new query, both modules are started simultaneously and the first path produced by either module is executed on the robot while the other module is stopped. After a path is generated, a library manager decides whether to store the path based on the computation times of the two modules and the generated path’s similarity to the retrieved path.”)
…
…plurality of experiences in the memory based on the degree of similarity between the first experience and the plurality of experiences. (Pg. 1 left col “Our framework, which we call Lightning for its ability to plan quickly, leverages the generality of PFS to produce solutions in new situations and the efficiency of re-using previous experience in situations similar to previously-encountered ones. After a path is generated, a library manager decides whether to store the path based on the computation times of the two modules and the generated path’s similarity to the retrieved path.”)
Mnih and Berenson are analogous art because they are both directed to learning robot/agent experience. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified reinforcement learning for subject system having multiple states of Mnih to include determining a degree of similarity between the first experience and a plurality of experiences of Berenson in order to improve the computation time in generating new path quickly as disclosed by Berenson (abstract “After a path is generated for a new query, a library manager decides whether to store the path based on computation time and the generated path’s similarity to the retrieved path. To retrieve an appropriate path from the library we use two heuristics that exploit two key aspects of the problem: (i) A correlation between the amount a path violates constraints and the amount of time needed to repair that path, and (ii) the implicit division of constraints into those that vary across environments in which the robot operates and those that do not.”). 

Regarding claim 2
Mnih in view of Berenson teaches the computer-implemented method of claim 1. 
Mnih further teaches and removing a second experience from the memory based on the comparison, (col 12 lines 1-5 “The procedure then loops back from step S212 to step S202 to select a further action. In embodiments the size of the experience data store is limited and therefore, as new experience data is stored, older experience data may be discarded, for example using a FIFO (first in first out) strategy.”)
the second experience being at least one of the first experience and an experience from the plurality of experiences. (Examiner notes that the procedure loops back from step 212-202 and the second time it loops back they system gets another input from s200 which corresponds to second experience see col 12 lines 1-5 “The procedure then loops back from step S212 to step S202 to select a further action. In embodiments the size of the experience data store is limited and therefore, as new experience data is stored, older experience data may be discarded, for example using a FIFO (first in first out) strategy.”)
Berenson further teaches wherein the pruning further comprises: for each experience in the plurality of experiences: computing a distance from the first experience; (pg. 1 right col “These features allow us to formulate two heuristics to retrieve a path from the library: the first quickly selects n candidate paths from the library by measuring the distance between the endpoints of the path and the query start and goal, and the second retrieves the path from the n candidates which has the least constraint violation.”)
and comparing the distance to another distance of that experience from each other experience in the plurality of experiences; (pg. 3 right col “To do this, we first define a function which evaluates the distance between a path p and the set P(t)… where l is a line segment between the two input configurations. The sum of the lengths of the line segments that need to be added to p to achieve the task can be used as a distance function between p and t:”)
Mnih and Berenson are analogous art because they are both directed to learning robot/agent experience. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified reinforcement learning for subject system having multiple states of Mnih to include determining a degree of similarity between the first experience and a plurality of experiences of Berenson in order to improve the computation time in generating new path quickly as disclosed by Berenson (abstract “After a path is generated for a new query, a library manager decides whether to store the path based on computation time and the generated path’s similarity to the retrieved path. To retrieve an appropriate path from the library we use two heuristics that exploit two key aspects of the problem: (i) A correlation between the amount a path violates constraints and the amount of time needed to repair that path, and (ii) the implicit division of constraints into those that vary across environments in which the robot operates and those that do not.”). 

Regarding claim 14
Claim 14 recites analogous limitations to claim 2 and therefore is rejected on the same ground as claim 2.



Regarding claim 7
Mnih in view of Berenson teaches the computer-implemented method of claim 1.
Mnih further teaches wherein at a first input state the neural network generates an output based at least in part on the pruned plurality of experiences. (Col 15 lines 56-65 “In the illustrated example the first set of 4x16 8x8 pixel filters (kernels) operating on the set of (84x84) (x4) input frames generates a set of 16 20x20 feature maps for each set of 4 frames, and the second set of 16x32 4x4 pixel filters operating on these generates 329x9 feature maps for each frame. The neural network structure of FIG. 4 corresponds to the arrangement shown in FIG. 3b, in which state data 152 presented at the input of the neural network generates a set of Q-value outputs on output units 164, one for each action.”)
Regarding claim 19
Claim 19 recites analogous limitations to claim 8 and therefore is rejected on the same ground as claim 8.

Regarding claim 8
Mnih in view of Berenson teaches the computer-implemented method of claim 1.
Mnih further teaches wherein the pruned plurality of experiences includes a diverse set of states of the robot. (Col 11 lines 28-35 “The procedure then again inputs state (image sequence) data and stores experience data (S204). The stored experience data comprises the before and after States, the action taken, and the reward earned. At step S206, the procedure draws a transition from the stored experience data, either randomly or according to a prioritised strategy, and provides the end, after state of the transition to the first neural network (neural network O).”)

Regarding claim 9
Mnih in view of Berenson teaches the computer-implemented method of claim 1.
Mnih further teaches wherein the generating the second action for the robot includes determining that the robot is in the first state (col 4 lines 40-46 “Thus the next state, resulting from the action, is input to the first neural network and the maximum (or minimum) action-value parameter (Q-value) is identified, is optionally discounted by a discount factor between 0 and 1. and the reward in moving from the starting state to the next state is added (or the cost Subtracted) to generate a target action-value parameter for the starting state given the action.”)
and selecting the second action to be different than the first action. (Col 8 lines 64-67 “We consider tasks in which an agent interacts with an environment E, in this case the Atari emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action at, from the set of legal game actions A={1, . . . . K}. Also see col 14 lines 53-62 “Embodiments also use a frame-skipping technique: The agent sees and selects actions on every kth frame instead of every frame, and its last action is repeated on skipped frames. It is coincidental that the number of skipped frames is the same as the number of frames constituting a state representation: this need not be the case. Since running the emulator forward for one step requires much less computation than having the agent select an action, this technique allows the agent to play roughly k times more games without significantly increasing the runtime.”)

Regarding claim 10
Mnih in view of Berenson teaches the computer-implemented method of claim 9.
Mnih further teaches the computer-implemented method further comprising: receiving a second reward by the robot in response to the second action. (Col 4 lines 40-46 “Thus the next state, resulting from the action, is input to the first neural network and the maximum (or minimum) action-value parameter (Q-value) is identified, is optionally discounted by a discount factor between 0 and 1. and the reward in moving from the starting state to the next state is added (or the cost Subtracted) to generate a target action-value parameter for the starting state given the action.”)

Regarding claim 11
Mnih in view of Berenson teaches the computer-implemented method of claim 1.
Mnih further teaches the computer-implemented method further comprising: collecting a second experience for the robot, (col 5 lines 60-67 “Thus in a related aspect the invention provides a processor configured to perform reinforcement learning, the system comprising: an input to receive training data from a system having a plurality of states and, for each state, a set of actions to move from one of said states to next said state; wherein said training data is generated by operating on said system with a succession of said actions and comprises starting State data, action data and next state data defining, respectively…”)
the second experience representing: a second state of the robot, the second action taken by the robot in response to the second state, 29.Attorney Docket No. NRLA-009US01(col 9 lines 12-17 “Since the agent only observes images of the current screen, the task is partially observed and many emulator states are perceptually aliased, i.e. it is impossible to fully understand the current situation from only the current Screen X. We therefore consider sequences of actions and observations, S, X1, a1, X2,..., at-1, Xt, and learn game strategies that depend upon these sequences.”)
a second reward received by the robot in response to the second action, (col 8 lines 64-67 “We consider tasks in which an agent interacts with an environment E, in this case the Atari emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action a, from the set of legal game actions.” also see col 11 lines 29-35 “The procedure then again inputs state (image sequence) data and stores experience data (S204). The stored experience data comprises the before and after States, the action taken, and the reward earned. At step S206, the procedure draws a transition from the stored experience data, either randomly or according to a prioritised strategy, and provides the end, after state of the transition to the first neural network (neural network O).”)
and a third state of the robot in response to the second action; (Examiner notes that Mnih teaches having multiple states and actions to move from one state to the next see col 8 lines 66-67 “At each time-step the agent selects an action a, from the set of legal game actions.” and also see col 4 lines 18-23 “In embodiments the experience data includes reward (or cost) data relating to a reward (or cost) of the action in moving from a current state to a subsequent state. The reward/cost may be measured from the system, for example by inputting data defining a reward or cost collected/incurred by the action.” Also see col 9 lines 12-17 “Since the agent only observes images of the current screen, the task is partially observed and many emulator states are perceptually aliased, i.e. it is impossible to fully understand the current situation from only the current Screen X. We therefore consider sequences of actions and observations, S, X1, a1, X2,..., at-1, Xt, and learn game strategies that depend upon these sequences.”)
Berenson further teaches determining a degree of similarity between the second experience and the pruned plurality of experiences; (pg. 1 left col “Our framework, which we call Lightning for its ability to plan quickly, leverages the generality of PFS to produce solutions in new situations and the efficiency of re-using previous experience in situations similar to previously-encountered ones. After a path is generated, a library manager decides whether to store the path based on the computation times of the two modules and the generated path’s similarity to the retrieved path.”)
and …experiences in the memory based on the degree of similarity between the second experience and the …plurality of experiences. (Pg. 1 left col “Our framework, which we call Lightning for its ability to plan quickly, leverages the generality of PFS to produce solutions in new situations and the efficiency of re-using previous experience in situations similar to previously-encountered ones. After a path is generated, a library manager decides whether to store the path based on the computation times of the two modules and the generated path’s similarity to the retrieved path.”)
Mnih and Berenson are analogous art because they are both directed to learning robot/agent experience. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified reinforcement learning for subject system having multiple states of Mnih to include determining a degree of similarity between the first experience and a plurality of experiences of Berenson in order to improve the computation time in generating new path quickly as disclosed by Berenson (abstract “After a path is generated for a new query, a library manager decides whether to store the path based on computation time and the generated path’s similarity to the retrieved path. To retrieve an appropriate path from the library we use two heuristics that exploit two key aspects of the problem: (i) A correlation between the amount a path violates constraints and the amount of time needed to repair that path, and (ii) the implicit division of constraints into those that vary across environments in which the robot operates and those that do not.”). 

Regarding claim 12
Mnih teaches a system for generating a second action for a robot, (col 8 lines 64-67 “We consider tasks in which an agent interacts with an environment E, in this case the Atari emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action a, from the set of legal game actions.”)
the system comprising: an interface to collect a first experience for the robot, (col 13 lines 53-59 “In the above algorithms we store the agent's experiences at each timestep, e, (S., a r. s.) in a data-set De... pooled over many episodes into a replay memory. During the inner loop of the algorithm, Q-learning updates, or minibatch updates, are applied to samples of experience, el D, drawn at random from the pool of stored samples.”)
the first experience representing: a first state of the robot at a first time, (col 4 lines 32-40 “In embodiments the experience data is used in conjunction with the first neural network for training the second neural network. More particularly a transition comprising a first, starting State, action, and next state is sampled from the stored experience data. This is used to generate a target action-value parameter (Q-value) from the first neural network (which is, in embodiments, a previously made copy of the second neural network), for training the second neural network.” also see col 9 lines 12-17 “Since the agent only observes images of the current screen, the task is partially observed and many emulator states are perceptually aliased, i.e. it is impossible to fully understand the current situation from only the current Screen X. We therefore consider sequences of actions and observations, S, X1, a1, X2,..., at-1, Xt, and learn game strategies that depend upon these sequences.”)
a first action taken by the robot at the first time, (col 8 lines 66-67 “At each time-step the agent selects an action a, from the set of legal game actions.”)
a first reward received by the robot in response to the first action, (col 11 lines 28-35 “The procedure then again inputs state (image sequence) data and stores experience data (S204). The stored experience data comprises the before and after States, the action taken, and the reward earned. At step S206, the procedure draws a transition from the stored experience data, either randomly or according to a prioritised strategy, and provides the end, after state of the transition to the first neural network (neural network O).”)
and a second state of the robot in response to the first action at a second time after the first time; (Examiner notes that Mnih teaches having multiple states and actions to move from one state to the next[corresponds to second state] see col 8 lines 66-67 “At each time-step the agent selects an action a, from the set of legal game actions.” and also see col 4 lines 18-23 “In embodiments the experience data includes reward (or cost) data relating to a reward (or cost) of the action in moving from a current state to a subsequent state[corresponds to second state]. The reward/cost may be measured from the system, for example by inputting data defining a reward or cost collected/incurred by the action.”)
a memory to store at least one of a plurality of experiences and a pruned plurality of experiences for the robot; (col 11 lines 28-31 “The procedure then again inputs state (image sequence) data and stores experience data (S204). The stored experience data comprises the before and after States, the action taken, and the reward earned.” Also see col 13 lines 53-56 “In the above algorithms we store the agent's experiences at each time-step, e, (S., a r. s.) in a data… pooled over many episodes into a replay memory”)
a processor, (col 15 lines 66-67 “FIG. 5a shows a schematic block diagram of a data processor 100 configured to implement a neural network based reinforcement learning”)
in digital communication with the interface and the memory, (col 16 lines 24-34 “FIG. 5b shows a general purpose computer system 100 programmed to implement corresponding functions to those illustrated in FIG. 5b. Thus the system comprises a deep Q-learner 122 incorporating a processor, working memory, and non-volatile program memory 124. The program memory stores, interalia, neural network code, action select code experience store code, target Q generation code and weight update code. Parameter memory 126 stores the weights of the neural networks and the experience data. The code 124 may be provided on a physical carrier medium such as disk 128.”)
…
update the memory to store the pruned plurality of experiences; (col 13 lines 56-61 “During the inner loop of the algorithm, Q-learning updates, or minibatch updates, are applied to samples of experience, el D, drawn at random from the pool of stored samples. After performing experience replay, the agent selects and executes an action according to an e-greedy policy (where 0ses 1 and may change over time).”)
train a neural network associated with the robot with the pruned plurality of experiences; (col 12 lines 1-11 “the experience data store is limited and therefore, as new experience data is stored, older experience data may be discarded, for example using a FIFO (first in first out) strategy. After a defined number of training steps, for example every 10, 10, or 10 steps, the weights from the second, trained neural network are copied across to the first neural network (S214) so that, in effect, the neural network for the Q-values becomes the neural network for the Q-values, and the training of the second neural network proceeds. The training procedure may continue indefinitely or, in other approaches, the training may be terminated, for example after a predetermined number of training steps and/or based on a training metric Such as an average predicted state-value function for a defined set of states.” Also see col 16 lines 16-24 “During a learning phase module 110 samples the transition from the experience data store 108 and adjusts the weights of neural network 150 (neural network 1) based on a target from neural network 0, an earlier copy of neural network 1 having weights stored in module 110. Thus in embodiments the actions selected by neural network 1 provide stored experience data from which neural network 0 draws, to provide targets for training neural network 1.”)
and generate the second action for the robot using the neural network. (Col 15 lines 20-29 “Thus instead preferred embodiments employed an architecture in which there is a separate output unit for each possible action, and only the State representation is an input to the neural network. The outputs correspond to the predicted Q-values of the individual action for the input state, as shown schematically for neural network 150b in FIG. 3b. One advantage of this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network.” Also see col 8 lines 64-67 “We consider tasks in which an agent interacts with an environment E, in this case the Atari emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action at, from the set of legal game actions A={1, . . . . K}.”)
Mnih does not teach to: determine a degree of similarity between the first experience and the plurality of experiences stored in the memory; 
prune the plurality of experiences in the memory based on the degree of similarity between the first experience and the plurality of experiences to form the pruned plurality of experiences.
Berenson teaches to: determine a degree of similarity between the first experience (pg. 1 left col “Our framework, which we call Lightning for its ability to plan quickly, leverages the generality of PFS to produce solutions in new situations and the efficiency of re-using previous experience in situations similar to previously-encountered ones. After a path is generated, a library manager decides whether to store the path based on the computation times of the two modules and the generated path’s similarity to the retrieved path.”)
and the plurality of experiences stored in the memory; (pg. 1 right col “The Lightning framework (see Figure 1) consists of two main modules, which are run in parallel: PFS, and a module that retrieves and repairs paths stored in a path library, which we call Retrieve-Repair (RR). Given a new query, both modules are started simultaneously and the first path produced by either module is executed on the robot while the other module is stopped. After a path is generated, a library manager decides whether to store the path based on the computation times of the two modules and the generated path’s similarity to the retrieved path.”)
…plurality of experiences in the memory based on the degree of similarity between the first experience and the plurality of experiences to form the …plurality of experiences; (Pg. 1 left col “Our framework, which we call Lightning for its ability to plan quickly, leverages the generality of PFS to produce solutions in new situations and the efficiency of re-using previous experience in situations similar to previously-encountered ones. After a path is generated, a library manager decides whether to store the path based on the computation times of the two modules and the generated path’s similarity to the retrieved path.”)
Mnih and Berenson are analogous art because they are both directed to learning robot/agent experience. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified reinforcement learning for subject system having multiple states of Mnih to include determining a degree of similarity between the first experience and a plurality of experiences of Berenson in order to improve the computation time in generating new path quickly as disclosed by Berenson (abstract “After a path is generated for a new query, a library manager decides whether to store the path based on computation time and the generated path’s similarity to the retrieved path. To retrieve an appropriate path from the library we use two heuristics that exploit two key aspects of the problem: (i) A correlation between the amount a path violates constraints and the amount of time needed to repair that path, and (ii) the implicit division of constraints into those that vary across environments in which the robot operates and those that do not.”). 

Claims 3 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over 
Mnih et al. (US Pat No. 9679258 B2) in view of Berenson et al. (“A Robot Path Planning Framework that Learns from Experience”) and further in view of Emery-Montemerlo et al. (“Game Theoretic Control for Robot Teams”, hereinafter: Emery-Montemerlo).
Regarding claim 3
Mnih in view of Berenson teaches the computer-implemented method of claim 2. 
Mnih in view of Berenson does not teach the computer-implemented method further comprising removing the second experience from the memory based on a probability that the distance of the second experience 28.Attorney Docket No. NRLA-009US01from the first experience and each experience in the plurality of experiences is less than a user-defined threshold.  
Emery-Montemerlo teaches the computer-implemented method further comprising removing the second experience from the memory based on a probability that the distance of the second experience 28.Attorney Docket No. NRLA-009US01from the first experience and each experience in the plurality of experiences is less than a user-defined threshold. (Pg. 1166 “In this approach, we initially order the single-history clusters at random. Then, we make a single pass through the list of clusters. For each cluster, we test whether its probability is below a threshold; if so, we remove it from the list and merge it with its nearest remaining neighbor as determined by the worst-case expected loss between their representative histories.”)
Mnih, Berenson and Emery-Montemerlo are analogous art because they are all directed to learning robot/agent experience. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified reinforcement learning for subject system having multiple states of Mnih in view of Berenson to include removing the second experience from the memory based on a probability of Emery-Montemerlo in order to improve quickly filter unnecessary experiences and reduce computation time as disclosed by Emery-Montemerlo (pg. 1166 “In this approach, we initially order the single-history clusters at random. Then, we make a single pass through the list of clusters. For each cluster, we test whether its probability is below a threshold; if so, we remove it from the list and merge it with its nearest remaining neighbor as determined by the worst-case expected loss between their representative histories.”)
Regarding claim 15
Claim 15 recites analogous limitations to claim 3 and therefore is rejected on the same ground as claim 3.
Claims 4 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over 
Mnih et al. (US Pat No. 9679258 B2) in view of Berenson et al. (“A Robot Path Planning Framework that Learns from Experience”) and further in view of Schaul et al. (US 2017/0140269 A1).
Regarding claim 4
Mnih in view of Berenson teaches the computer-implemented method of claim 1.
Mnih in view of Berenson does not teach where the pruning further includes ranking the first experience and each experience in the plurality of experiences.  
Schaul teaches where the pruning further includes ranking the first experience and each experience in the plurality of experiences. (Para [0053] “In some other implementations, the priority for a piece of experience data is a fraction having a predetermined positive value as a numerator and a rank of the piece of experience data in a ranking of the pieces of experience data in the replay memory according to their expected learning progress measures as a denominator.”)
Mnih, Berenson and Schaul are analogous art because they are all directed to learning robot/agent experience. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified reinforcement learning for subject system having multiple states of Mnih in view of Berenson to include determining a degree of similarity between the first experience and a plurality of experiences of Schaul in order to improve the computation time in generating new path quickly as disclosed by Schaul (abstract “After a path is generated for a new query, a library manager decides whether to store the path based on computation time and the generated path’s similarity to the retrieved path. To retrieve an appropriate path from the library we use two heuristics that exploit two key aspects of the problem: (i) A correlation between the amount a path violates constraints and the amount of time needed to repair that path, and (ii) the implicit division of constraints into those that vary across environments in which the robot operates and those that do not.”). 
Regarding claim 16
Claim 16 recites analogous limitations to claim 4 and therefore is rejected on the same ground as claim 4.

Claims 5 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over 
Mnih et al. (US Pat No. 9679258 B2) in view of Berenson et al. (“A Robot Path Planning Framework that Learns from Experience”) and further in view of Schaul et al. (US 2017/0140269 A1) and further in view of Alnajjar et al. (“A Hierarchical Autonomous Robot Controller for Learning and Memory: Adaptation in a Dynamic Environment”, hereinafter: Alnajjar).
Regarding claim 5
Mnih in view of Berenson with Schaul teaches the computer-implemented method of claim 4. 
Mnih further teaches … automatically discarding the first experience (col 11 lines 1-7 “The procedure begins by inputting state data from a controlled system (S200). For the test system of an AtariTM game emulator this comprised a sequence of image frames from the game. As described later, in this test environment frame-skipping was employed, and the captured images were down-sampled to reduce the quantity of data to be processed.”)
Mnih in view of Berenson with Schaul does not teach wherein the ranking includes creating a plurality of clusters based at least in part on synaptic weights and …the first experience upon determining that the first experience fits one of the plurality of clusters.  
Alnajjar teaches wherein the ranking includes creating a plurality of clusters based at least in part on synaptic weights (pg. 187 left col “If after deleting all the existing UCTN the memory is still full, the dynamic clustering mechanism starts to operate, that is, the connections between the nodes[corresponds to synaptic weights] are reorganized. It starts clustering all similar networks in the level ESM1. The network that has a longer experience time and better fitness than others will survive”)
and …the first experience upon determining that the first experience fits one of the plurality of clusters. (Pg. 193 right col “in our model we developed a hierarchical adaptive controller with dynamic memory that can: (a) remember the maximum possible experiences that the robot has gone through; (b) learn from the relation between stored experiences to predict advance synaptic weights that are close to the optimal ones, to speed up the adaptation time for new environments; (c) forget only the networks that are not well trained or that initially have bad synaptic weights and take a long time to train; (d) cluster the trained networks that are, to some degree, performing similar behaviors with an online changeable threshold value (ESM) that is based on the current memory capacity.”)
Mnih, Berenson, Schaul and Alnajjar are analogous art because they are all directed to learning robot/agent experience. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified reinforcement learning for subject system having multiple states of Mnih in view of Berenson with Schaul to include ranking plurality of clusters based on synaptic weights of Alnajjar in order to improve the computation time by performing clustering similar network between the nodes as disclosed by Alnajjar (pg. 187 left col “If after deleting all the existing UCTN the memory is still full, the dynamic clustering mechanism starts to operate, that is, the connections between the nodes are reorganized. It starts clustering all similar networks in the level ESM1.”)
Regarding claim 17
Claim 17 recites analogous limitations to claim 5 and therefore is rejected on the same ground as claim 5.

Claim(s) 6 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over 
Mnih et al. (US Pat No. 9679258 B2) in view of Berenson et al. (“A Robot Path Planning Framework that Learns from Experience”) in view of Schaul et al. (US 2017/0140269 A1) in view of Alnajjar et al. (“A Hierarchical Autonomous Robot Controller for Learning and Memory: Adaptation in a Dynamic Environment”, hereinafter: Alnajjar) and further in view of Chetbotar et al. (“Learning Robot Tactile Sensing for Object Manipulation”, hereinafter: Chetbotar).
Regarding claim 6
Mnih in view of Berenson with Schaul and Alnajjar teaches the computer-implemented method of claim 5. 
Mnih in view of Berenson with Schaul and Alnajjar does not teach wherein the ranking includes encoding each experience in the plurality of experiences, encoding the first experience, and comparing the encoded experiences to the plurality of clusters.  
Chetbotar teaches wherein the ranking includes encoding each experience in the plurality of experiences, encoding the first experience, (pg. 3374 left col “For each principal component, we learned three weights for the three phases of the scraping action. The robot movements were encoded in a three dimensional Cartesian space with a separate DMP for each dimension.”)
and comparing the encoded experiences to the plurality of clusters. (Pg. 3371 right col “Further reduction of the number of tactile feedback weights can be achieved by dividing the action into phases and learning only a single weight for each phase [28]. Recognition of the action phases can be accomplished by clustering tactile images based on their similarity. Each action phase will have similar tactile images belonging to a specific cluster.”)
Mnih, Berenson, Schaul, Alnajjar and Chetbotar are analogous art because they are all directed to learning robot/agent experience. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified reinforcement learning for subject system having multiple states of Mnih in view of Berenson with Schaul and Alnajjar to include encoding each experience in the plurality of experiences and comparing the encoded experiences of Chetbotar in order to quickly determined desired tactile trajectory from the encoded information and performed high dimensionality reduction to save computation time as disclosed by as Chetbotar (pg. 3368 “In particular, we employ dynamic motor primitives (DMPs) [3] for learning a movement from human demonstration by kinesthetic teach-in. The tactile information is encoded as a desired tactile trajectory and tactile feedback is added to the system through perceptual coupling [4]. Third, the parameters of tactile feedback are learned with the relative entropy policy search (REPS) reinforcement learning algorithm [5]. We face the problem of high dimensionality of the tactile data and therefore perform dimensionality reduction with spectral clustering [6] and principal component analysis [7].”). 
Regarding claim 18
Claim 18 recites analogous limitations to claim 6 and therefore is rejected on the same ground as claim 6.

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Mnih et al. (US Pat No. 9679258 B2) in view of Berenson et al. (“A Robot Path Planning Framework that Learns from Experience”) and further in view of Meier et al. (US Pat No. 9440352 B2).
Regarding claim 13
Mnih in view of Berenson teaches claim 12. 
Mnih further teaches and the robot, to transmit the second action to the robot. (Col 8 64-67 “We consider tasks in which an agent interacts with an environment E, in this case the Atari emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action a, from the set of legal game actions.”)
Mnih in view of Berenson does not teach the system further comprising: a cloud brain, in digital communication with the processor.
Meier teaches the system further comprising: a cloud brain, in digital communication with the processor. (Col 14 lines 21-29 “State information may be shared among a plurality of users. In some implementations, such as illustrated in FIG. 12, a cloud-based repository 1200 of robotic device “brain images” (e.g., neural network State information) may be introduced. The repository may comprise cloud server depository 1206. In FIG. 12, one or more remote user devices 1210 may connect via a remote link 1214 to the depository 1206 in order to save, load, update, and/or perform other operation on a network configuration.”)
Mnih, Berenson and Meier are analogous art because they are all directed to learning robot/agent experience. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified reinforcement learning for subject system having multiple states of Mnih in view of Berenson to include cloud based digital communication with processor of Meier in order to provide remote access to storage repository and allow to save, load, update, and/or perform other operation on a network configuration as disclosed by Meier (Col 14 lines 21-29 “State information may be shared among a plurality of users. In some implementations, such as illustrated in FIG. 12, a cloud-based repository 1200 of robotic device “brain images” (e.g., neural network State information) may be introduced. The repository may comprise cloud server depository 1206. In FIG. 12, one or more remote user devices 1210 may connect via a remote link 1214 to the depository 1206 in order to save, load, update, and/or perform other operation on a network configuration.”)

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over 
Mnih et al. (US Pat No. 9679258 B2) in view of Alnajjar et al. (“A Hierarchical Autonomous Robot Controller for Learning and Memory: Adaptation in a Dynamic Environment”, hereinafter: Alnajjar).
Regarding claim 20
Mnih teaches a computer-implemented method for updating a memory, (col 13 lines 45-52 “The stored experience may then still be used to update the first neural network, which in turn generates targets for training the second neural network. In practice it is convenient to store the experience of actions selected by the second neural network so that the first neural network can draw from these later, thus providing a self-contained system, but this is not essential.”)
the memory storing a plurality of experiences received from a computer-based application, (col 13 lines 53-59 “In the above algorithms we store the agent's experiences at each timestep, e, (S., a r. s.) in a data-set De... ex pooled over many episodes into a replay memory. During the inner loop of the algorithm, Q-learning updates, or minibatch updates, are applied to samples of experience, el D, drawn at random from the pool of stored samples.”)
the method comprising: receiving a new experience from the computer-based application; (col 13 lines 53-59 “In the above algorithms we store the agent's experiences at each timestep, e, (S., a r. s.) in a data-set De... ex pooled over many episodes into a replay memory. During the inner loop of the algorithm, Q-learning updates, or minibatch updates, are applied to samples of experience, el D, drawn at random from the pool of stored samples.”)
…
removing at least one of the new experience (col 11 lines 1-10 “The procedure begins by inputting state data from a controlled system (S200). For the test system of an AtariTM game emulator this comprised a sequence of image frames from the game. As described later, in this test environment frame-skipping was employed, and the captured images were down-sampled to reduce the quantity of data to be processed. One of the advantages of the approach we describe is that the procedure is able to accept image pixel data as an input rather than relying on a hand-constructed representation of the system under control.” also see col 12 lines 1-4 “The procedure then loops back from step S212 to step S202 to select a further action. In embodiments the size of the experience data store is limited and therefore, as new experience data is stored, older experience data may be discarded, for example using a FIFO (first in first out) strategy.”)
…
and sending an updated version of the plurality of experiences to the computer-based application. (Col 3 lines 1-5 “Potentially in a locally connected network different portions of the network could be updated at different times, but this is less preferable. In one embodiment the first neural network is updated after a defined number of actions, for example every 10 steps.” Also see col 12 lines 22-28 “An example algorithm for deep Q-learning with experience replay is shown below. In order to improve the stability of the algorithm we decouple the network used to generate the targets y, from the network being trained. More precisely, a copy of the Q network being trained is made after every L. parameter updates and used to generate the targets y, for the next L training updates.”)
Mnih does not teach determining a degree of similarity between the new experience and the plurality of experiences; 
adding the new experience based on the degree of similarity; 
…
and an experience from the plurality of experiences based on the degree of similarity. 
Alnajjar teaches determining a degree of similarity between the new experience and the plurality of experiences; (pg. 185 right col “The robot therefore needs to control its memory size by knowing what to forget, what to remember, and how to manage its experiences by clustering similar ones to minimize its need for storage space. Information that has greater correlation with others can be mentally connected to the existing related information in the memory. This correlation can be measured by ESM (Equations 4 and 5).”)
adding the new experience based on the degree of similarity; (pg. 187 left col “If after deleting all the existing UCTN the memory is still full, the dynamic clustering mechanism starts to operate, that is, the connections between the nodes are reorganized. It starts clustering all similar networks in the level ESM1. The network that has a longer experience time and better fitness than others will survive. The clustering range gradually grows wider (ESMn + 1) as the memory gets full, that is, more trained networks with less correlated environments could cluster together to form a group of networks with a wider range of adaptation.”)
…
and an experience from the plurality of experiences based on the degree of similarity. (Pg. 193 right col “(d) cluster the trained networks that are, to some degree, performing similar behaviors with an online changeable threshold value (ESM) that is based on the current memory capacity. Although our algorithm guarantees to find a space for huge numbers of new environments”)
Mnih, and Alnajjar are analogous art because they are both directed to learning robot/agent experience. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified reinforcement learning for subject system having multiple states of Mnih to include clustering experiences based on degree of similarity between the new experience and the plurality of experience of Alnajjar in order to improve the computation time by performing clustering similar network between the nodes as disclosed by Alnajjar (pg. 187 left col “If after deleting all the existing UCTN the memory is still full, the dynamic clustering mechanism starts to operate, that is, the connections between the nodes are reorganized. It starts clustering all similar networks in the level ESM1.”)


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VAN C MANG whose telephone number is (571)270-7598. The examiner can normally be reached Mon - Fri 8:00-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on 5712729767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/V.M./Examiner, Art Unit 2126       
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126