Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
This action is in response to the amendments filed 06/09/2022. Claims 1, 12, and 20 have been amended, claims 1-4, 6-7, 9, 11-15, 17-18, and 20 are currently pending.
	
Response to Arguments
Applicant’s arguments regarding the prior art rejection have been fully considered but they are not persuasive. Applicant argues on pages 10-12 that the prior art does not teach the concept of an agent dying. MPEP 2111.01(I) recites “Under a broadest reasonable interpretation (BRI), words of the claim must be given their plain meaning, unless such meaning is inconsistent with the specification. The plain meaning of a term means the ordinary and customary meaning given to the term by those of ordinary skill in the art at the time of the invention”. MPEP 2111.01(IV) recites “The only exceptions to giving the words in a claim their ordinary and customary meaning in the art are (1) when the applicant acts as their own lexicographer; and (2) when the applicant disavows or disclaims the full scope of a claim term in the specification. To act as their own lexicographer, the applicant must clearly set forth a special definition of a claim term in the specification that differs from the plain and ordinary meaning it would otherwise possess. The specification may also include an intentional disclaimer, or disavowal, of claim scope”. Applicant’s specification does not provide a special definition for an agent dying or being in an “unrecoverable state”. Neither the claims nor the specification provide a specific definition of what it means for an agent to die or a definition for what it means to be in an “unrecoverable state”. Neither the claims nor the specification describe a “catastrophic car accident with grave injuries”, nor do they describe that an agent that has fallen in a hole could be retrieved. Paragraph [0019] of the specification merely states “the term "critical events" refer to respective sets of steps that directly or indirectly result in a reward that is well below (e.g., by a threshold amount) an average award amount. In an embodiment, steps that result in the agent dying can be considered critical events. Of course, other outcomes can be involved in critical events, depending upon the implementation.” Paragraphs [0041] and [0065] provide an example of an agent dying as falling in a hole, but this is not a limiting definition. The specification and the claims (specifically claim 1 and claim 9) use both the term “critical event” and “agent dying” to describe an experience with a reward that falls below an average award about by a threshold amount without distinguishing further differences between the two terms. Lastly, Applicant’s argument on page 10 that Hans categorizes actions as fatal and not the state of the agent is not clear, given that the claims are describing excluding storing “events where an agent dies” and “avoid[ing] the action becoming an event where the agent dies”. On of ordinary skill would find it obvious that both Hans and the claim language are categorizing actions regarding the state of the observed agent. Therefore, the broadest reasonable interpretations of “an agent dying” and an “unrecoverable state” include any kind of critical event or failure state that is determined when a reward for the action or experience is considered to be low based on a threshold.
Applicant also argues on page 12 that the graph disclosed in Hans is not equivalent to the buffers as recited in claim 1. MPEP 2111.01(I) recites “Under a broadest reasonable interpretation (BRI), words of the claim must be given their plain meaning, unless such meaning is inconsistent with the specification. The plain meaning of a term means the ordinary and customary meaning given to the term by those of ordinary skill in the art at the time of the invention”. MPEP 2111.01(IV) recites “The only exceptions to giving the words in a claim their ordinary and customary meaning in the art are (1) when the applicant acts as their own lexicographer; and (2) when the applicant disavows or disclaims the full scope of a claim term in the specification. To act as their own lexicographer, the applicant must clearly set forth a special definition of a claim term in the specification that differs from the plain and ordinary meaning it would otherwise possess. The specification may also include an intentional disclaimer, or disavowal, of claim scope”. Applicant’s specification does not provide a special definition for a buffer; paragraphs [0021], [0057], and fig. 2 provide the only definitions of the experience and event buffers. Using these definitions, Examiner interprets that these buffers are sections of a computer memory used to store {state, action, reward} triples data. Therefore, one of ordinary skill would understand that the buffers in the claims are referring to a segment of computer memory. Given this interpretation, one of ordinary skill would understand that when Hans stores state data in a graph (see col. 3 lines 42-52 and col. 4 lines 36-40) that this data is being stored in a segment of computer memory. Alternatively, one of ordinary skill could look to the combination of Hans and Yoshiike, wherein Yoshiike teaches a history storage unit and a model storage unit in relation to an expanded Hidden Marcov Model to store state data in memory.
The prior art rejections have been updated to include the amended limitations and to clarify the reasoning given for the limitations that were not amended.

Claim Objections
Claims 1, 12, and 20 are objected to because the scope of the preamble is not consistent with the scope of the body of the claim for each of these claims. Claim 1 recites “A computer-implemented method for reinforcement learning training”, claim 12 recites “A computer program product for reinforcement learning training”, and claim 20 recites “A computer processing system for reinforcement learning training”, but each of these claims includes an exploration step and a step to perform an action that fall outside the scope of reinforcement learning training. As per page 3 of the previous office action “the claims do not explain how reinforcement learning training occurs. The claims are directed to deciding when and what kind of training data to store in experience and event buffers, which are sampled by a learning loop that trains the model. Figure 3 and paragraph [0038] show this process as three parts - a populating buffers stage 391, an exploration stage 392, and a learning loop 393. Therefore, the claims are directed to determining what kind of data to store in the buffers and not the reinforcement learning training which uses the buffers”. Applicant argues on page 9 that “The trained RL model is incorporated into the vehicle such as in an Advanced Driving Assistance System (ADAS) in order to control the vehicle. The buffers arc populated during the training stage with driving experiences some of which may be copied into the event buffer for future use during the exploration stage. The RL system is part of the vehicle in order to learn from vehicle experiences and apply those experiences in the exploration stage in order to avoid an agent dying”, however, this explanation is not present in either the claims nor the specification. Examiner is interpreting that this action not taken by the reinforcement learning system, but that the reinforcement learning system transmits a command with the selected action to the vehicle and the vehicle performs the action. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-4, 6-7, 9, 11-15, 17-18, and 20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claims contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. Claims 1, 12, and 20 refer to “the agent being in an unrecoverable state” as a condition for excluding an event from the experience buffer; however, this condition is not supported by Applicant’s original disclosure. The disclosure only refers to at least “the term "critical events" refer to respective sets of steps that directly or indirectly result in a reward that is well below (e.g., by a threshold amount) an average award amount. In an embodiment, steps that result in the agent dying can be considered critical events. Of course, other outcomes can be involved in critical events, depending upon the implementation” (see paragraph [0019] of Applicant’s specification); wherein the “event buffer is a memory that stores the experiences about (known) critical events, where the agent did not die” (see paragraph [0021]); “Every experience {q, a, r} is inserted into the experience buffer 381, except the experiences where the agent dies (e.g., falling in the hole). If a similar event (same q) exist in the event buffer 382 with different action or reward, then this experience is also inserted into the event buffer 382. If the agent dies and a similar experience exists in the experience buffer 381, then the similar experience is copied to event buffer 382 so that another agent (or the same agent, if revived) can survive if the other agent (or the same revived agent) faces the (same) situation at a subsequent time” (see paragraph [0041] – [0042]). For purposes of prior art examination, Examiner is interpreting that an unrecoverable state refers to a part of the calculation to determine the value of the reward for a given action when the reward is below a threshold (i.e. a critical or fatal event).  
Dependent claims 2-4, 6-7, 9, 11, 13-15, and 17-18 are also rejected because they fail to correct the deficiencies of independent claims 1 and 12 on which they depend.
	
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 6-7, 9, 12-15, 17-18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Hans et al (US 8494980 B2, herein Hans) in view of Yoshiike et al (US 20100318478 A1, herein Yoshiike), in further view of Shalev-Shwartz et al (US 20180032082 A1, herein Shalev-Shwartz).
Regarding claim 1, Hans teaches a computer-implemented method (col. 1 lines 61-67 recite the object of the invention is therefore to create a method for computer-assisted exploration of states of a technical system with which the assumption of unsafe states can be avoided and simultaneously for the state space to be run so that a good database for executing a subsequent method for determining an optimum adjustment of the technical system is created) for reinforcement learning training (col. 3 lines 14-22 recite instead of or in addition to a predetermined rule, the function, the states in accordance with the backup policy will backup policy can also be determined with a reinforcement learning process, taking account of the rewards of the actions. The reinforcement learning process is preferably based in such cases on an optimality criterion in accordance with which the minimum of the expected value of all future rewards is maximized. In this way it is ensured that the backup policy rapidly returns the system to the states that are known and safe) performed by a processor, the method comprising:
obtaining, from an environment, a given experience that includes an action, a state and a reward (col. 5 lines 22-30 recite the policy is defined in a Markov decision process consisting of a state space S, a set of actions A and a dynamic. The latter is produced from the transition probabilities Ps,s'a: SxAxS[Wingdings font/0xE0][0,1] from the current state s to the new follow-up state s' as well as the action a, which leads to the follow-up states'. With each corresponding transition the agent receives the reward already mentioned above. The rewards likewise follow a distribution and are only deterministic in special cases so that R represents a function. Col. 5 lines 8-12 recite action selection rule is described with the aid of a so-called reward function which allocates a reward to the action executed as a function of the state in which the action is executed as well as of the subsequent state resulting there from, Rs,s'a, which corresponds to a reward as defined in the claims (i.e. an experience including an action, a state, and a reward is obtained from an environment));
during training, storing the given experience in an experience buffer responsive to a value of the reward included in the given experience not being below an average award amount for a plurality of experiences by a first threshold amount, while excluding from the experience buffer events where an agent dies corresponding to the value of the reward included in the given experience being below the average award amount by the first threshold amount and the agent being in an unrecoverable state; (col.3 lines 42-54 recite a graph is constructed while the states are being run of which the nodes correspond to the states run and of which the edges correspond to the actions executed and in which, for each node, the category of the corresponding state is stored, or whereby on reaching a state in which all possible actions have already been explored, i.e. executed and/or classified with the safety function as impermissible a search is made in the graph for a path to a state in the same category in which actions can still be explored and whenever such a path is found this state is reached via this path. In the event of no path to a state in the same category being found in which there are still actions to be explored, the states of the subsequent category are run (i.e. experiences are stored if they are found to be safe when compared to the safety function and excluded if they are found to be impermissible or unsafe). Col. 6 lines 20-27 recite the safety function used in the method must provide information for a state-action pair about their safety status which can be divided up into the categories "safe", "critical" and "hypercritical". In addition an action can be divided into the categories "fatal" and "not fatal". A non-fatal action for transition from a state s into a state s' is present if the following applies: Rs,s’a>=T, with T being a predetermined limit value. By contrast an action is fatal if the following applies: Rs,s’a<T (i.e. the safety function determines safe experiences based on the reward associated with the action. Examiner’s Note: the broadest reasonable interpretation of an “unrecoverable state” would include an action categorized as “fatal”, see MPEP 2111.01, the Examiner’s Response to arguments, and the 112(a) interpretation of claim 1));
responsive to obtaining another experience having another reward that is less than or equal to the average award amount by the first threshold amount, searching the experience buffer for a candidate experience with a similar state to the other experience (col. 6 lines 47-50 recite from the above definitions of safe, critical and hypercritical it emerges that an agent can be transferred from critical states-with safe executions of subsequent actions-back into safe states. Col. 7 lines 31-36 recite the task of the backup policy is to return the agent used in the execution of the method to a known area if the agent can no longer make any secure decision because it has got into a new state in which it cannot sufficiently well estimate the safety of individual actions. The backup policy in this case may not itself lead into critical states (i.e. finding a similar safe state when an action corresponds to a low reward)); 
However, Hans does not teach copying the candidate experience into an event buffer, and during exploration, selecting an action to be taken to the environment from the event buffer with a predetermined probability.
Yoshiike teaches copying the candidate experience into an event buffer (para. [0113] recites the state transition probability of the HMM (Hidden Marcov Model) is expanded to state transition probability for each action performed by the agent, and the HMM of which the state transition probability is thus expanded (hereafter, also referred to as "expanded HMM") is employed as a learning object by the learning unit 21. Para. [0114] recites that the model storage unit 22 stores (the state transition probability, observation probability, and the like that are model parameters stipulating) the expanded HMM (i.e. the event buffer, as compared to the history storage unit 14 from fig. 4 which stores non-critical experiences));
and during exploration, selecting an action to be taken to the environment from the event buffer with a predetermined probability (fig. 47 step 345 and para. [0738] recite that the action determining unit 24 determines the next action from the candidates for the next action, based on the action suitability regarding the actions Um obtained for each of the one or more current state series candidates from the state recognizing unit 23 (i.e. selecting an action with a predetermined probability from the event buffer during exploration)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by using the expanded HMM (hidden Markov model) storage from Yoshiike to separate critical experiences from non-critical experiences and so that the safe alternatives determined by the backup policy from Hans are stored instead. Hans and Yoshiike are both directed to using reinforcement learning to transition an agent through a plurality of states, while Hans is directed towards exploring states of a technical system, the abstract recites “The method allows a large number of states and actions relating to the technical system to be collected and may be used for any technical system”. One of ordinary skill would benefit from this combination, since Hans does not teach storing any failed states when applying its safety function and backup policies, the method of selecting and switching to a more suitable action stored in the expanded HMM from Yoshiike would improve the performance of the backup policy from Hans.
However, the combination of Hans and Yoshiike does not teach performing the action taken from the event buffer to avoid the action becoming an event where the agent dies, the action being a controlling of a motor vehicle for accident avoidance.
Shalev-Shwartz teaches performing the action taken from the event buffer to avoid the action becoming an event where the agent dies, the action being a controlling of a motor vehicle for accident avoidance (para. [0221] recites considering a reward function for which R.(s)=-r for trajectories that represent a rare "corner" event to be avoided (e.g., such as an accident), and R.(s) El-1, 1] for the rest of the trajectories, one goal for the learning system may be to learn to perform an overtake maneuver. Normally, in an accident free trajectory, R(s) would reward successful, smooth, takeovers and penalize staying in a lane without completing the takeover-hence the range [-1, 1]. If a sequence, S, represents an accident, the reward, -r, should provide a sufficiently high penalty to discourage such an occurrence (i.e. performing an action to avoid a critical event). Para. [0332] recites after a selection is made from among potential actions in response to a sensed navigational state, at step 1627, the at least one processor may cause at least one adjustment of a navigational actuator of the host vehicle in response to the selected potential navigational action. The navigational actuator may include any suitable device for controlling at least one aspect of the host vehicle. For example, the navigational actuator may include at least one of a steering mechanism, a brake, or an accelerator (i.e. controlling a motor vehicle)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by using the methods from Hans (as modified by Yoshiike) to help determine the best navigation action in Shalev-Shwartz. Hans and Shalev-Shwartz are both directed to using reinforcement learning to determine a next best action, but Hans does not teach that the action involves controlling a motor vehicle. One of ordinary skill would benefit from using the methods from Hans to help determine the desired navigational action from Shalev-Shwartz in order to avoid suggesting navigational actions that might cause an accident or other harm.
Regarding claim 2, the combination of Hans, Yoshiike and Shalev-Shwartz teaches the method according to claim 1, wherein said storing step comprises:
searching the event buffer for a similar experience having a same state as the given experience (Yoshiike fig. 48 shows a method to “search for situation similar to current situation from own structured knowledge” (i.e. searching the event buffer for a similar experience). Yoshiike para. [0744] recites that in fig. 48, the action determining unit 24 determines, as the next action, an action wherein there is generated state transition from the last state s, of one or more current state series candidates from the state recognizing unit 23, to an immediately preceding state S,_1 immediately before the last state s); and storing the similar experience into the event buffer responsive to any of the action and the reward of the similar experience being different from those of the given experience (Hans col. 7 lines 31-36 recite the task of the backup policy is to return the agent used in the execution of the method to a known area if the agent can no longer make any secure decision because it has got into a new state in which it cannot sufficiently well estimate the safety of individual actions. The backup policy in this case may not itself lead into critical states. Hans col. 7 lines 46-55 recite the safety of actions is expressed via a corresponding reward, with actions with rewards smaller than the limit value corresponding to a fatal transition. In learning the backup policy from exploration data the backup policy must thus take account of the reward. In a possible variant of the invention the backup policy is determined by means of a conventional RL method, with the value function defined at the start now not being used however since the optimum policy determined therefrom is generally not also simultaneously safe (i.e. determining that two actions are similar but have different rewards). Yoshiike para. [0115] recites that the state recognizing unit 23 recognizes the current situation of the agent based on the expanded HMM stored in the model storage unit 22 (i.e. the similar experience found by searching the event buffer) using the action series and the observation value series stored in the history storage unit 14, and obtains (recognizes) the current state that is the state of the expanded HMM corresponding to the current situation thereof (i.e. stores the similar experience back into the event buffer) Yoshiike para. [0741] recites that in the event of determining an action following the first strategy, the agent performs an action which the agent has performed under a known situation similar to the current situation).
Regarding claim 3, the combination of Hans, Yoshiike and Shalev-Shwartz teaches the method according to claim 1, further comprising stopping the selecting of the action from the event buffer after a pre-defined number of steps during a training stage of the reinforcement learning (Yoshiike fig. 5 and para. [0146] recites that in the case that determination is made in step S17 that the agent has performed an action by already specified number of times, i.e., in the case that the point-in-time t is equal to the already specified number of times, the processing in the reflective action mode ends (i.e. stopping the selection of the action after a pre-defined number of steps)).
Regarding claim 4, the combination of Hans, Yoshiike and Shalev-Shwartz teaches the method according to claim 3, further comprising performing random exploration responsive to said stopping step (Yoshiike fig. 4 and para. [0129] recite that the random target generating unit 35 selects one state out of the states of the expanded HMM stored in the model storage unit 22 at random as a random target, and supplies the random target thereof to the target selecting unit 31 as the internal target serving as the target state).
Regarding claim 6, the combination of Hans, Yoshiike and Shalev-Shwartz teaches the method according to claim 1, further comprising storing in the experience buffer any experiences previously observed (Yoshiike fig. 4 and para. [0106] recite that the series of the observation values (observation value series), and the series of the actions (action series) are stored in the history storage unit 14 (i.e. the experience buffer)) except for the experiences that resulted in a corresponding reward that fails to exceed the first threshold (Yoshiike para. [0736] recites the action determining unit 24 sets the action suitability obtained regarding an action Um of which the action suitability is below a threshold to 0.0 (i.e. experiences where the corresponding reward fails to exceed a first threshold), thereby eliminating actions Um of which the action suitability is below a threshold from candidates for the next action to be performed).
Regarding claim 7, the combination of Hans, Yoshiike and Shalev-Shwartz teaches the method according to claim 1, wherein the state represents a local state in a low-dimensional space (Yoshiike para. [0153] recites that the state transition probability of a common HMM can be represented by a two-dimensional table (i.e. a low-dimensional space) where the state transition probability aij of the state transition from the state Si to the state Sj is disposed at the i'th from the top and the j'th from the left).
Regarding claim 9, the combination of Hans, Yoshiike and Shalev-Shwartz teaches the method according to claim 1, wherein the method is applied to the plurality of experiences, wherein the first threshold is used to identify critical events (Hans col. 5 lines 7-13 recite the optimum action selection rule is described with the aid of a so-called reward function which allocates a reward to the action executed as a function of the state in which the action is executed as well as of the subsequent state resulting therefrom, Rs,s'a which corresponds to a reward as defined in the claims. Col. 6 lines 22-26 recite an action can be divided into the categories "fatal" and "not fatal". A non-fatal action for transition from a state s into a state s' is present if the following applies: Rs,s’a >= T, with T being a predetermined limit value (i.e. a threshold). By contrast an action is fatal if the following applies: Rs,s’a <= T. Col. 6 lines 50-55 recite an action which is classified as safe (i.e. of which the follow-up state is safe) can always be executed in the exploration of the state space since it always has a reward which is greater than T. If rewards occur with values below the limit value T, as a rule this generally leads to damage or to an incorrect operation of the technical system (i.e. a threshold is used to identify critical events)), and wherein any of the plurality of experiences unrelated to the critical events are stored in the experience buffer and any of the plurality of experiences related to the critical events are stored in the event buffer (Yoshiike para. [0115] recites that the state recognizing unit 23 recognizes the current situation of the agent based on the expanded HMM stored in the model storage unit 22 (i.e. the event buffer) using the action series and the observation value series stored in the history storage unit 14 (i.e. the experience buffer), and obtains (recognizes) the current state that is the state of the expanded HMM corresponding to the current situation thereof).
Claim 12 is a non-transitory computer readable storage medium claim and its limitation is included in claim 1. The only difference is that claim 12 requires a non-transitory computer readable storage medium (Hans col. 4 lines 36-39 recite the invention further relates to a computer program product with program code stored on a machine-readable medium for executing the inventive method when the program runs on a computer). Therefore, claim 12 is rejected for the same reasons as claim 1.
Claim 13 is a non-transitory computer readable storage medium claim and its limitation is included in claim 2. Claim 13 is rejected for the same reasons as claim 2.
Claim 14 is a non-transitory computer readable storage medium claim and its limitation is included in claim 3. Claim 14 is rejected for the same reasons as claim 3.
Claim 15 is a non-transitory computer readable storage medium claim and its limitation is included in claim 4. Claim 15 is rejected for the same reasons as claim 4.
Claim 17 is a non-transitory computer readable storage medium claim and its limitation is included in claim 6. Claim 17 is rejected for the same reasons as claim 6.
Claim 18 is a non-transitory computer readable storage medium claim and its limitation is included in claim 7. Claim 18 is rejected for the same reasons as claim 7.
Claim 20 is a system claim and its limitation is included in claim 1. The only difference is that claim 12 requires a system (Yoshiike fig. 4 and para. [0089] recite a configuration example of an embodiment of the agent to which the information processing device (i.e. a system) according to the present invention). Therefore, claim 20 is rejected for the same reasons as claim 1.
	
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Hans et al (US 8494980 B2, herein Hans) in view of Yoshiike et al (US 20100318478 A1, herein Yoshiike), in further view of Shalev-Shwartz et al (US 20180032082 A1, Shalev-Shwartz) in further view of Mnih et al (US 20150100530 A1, herein Mnih). 
Regarding claim 11, the combination of Hans, Yoshiike and Shalev-Shwartz teaches the computer-implemented method of claim 1.
However, the combination of Hans, Yoshiike and Shalev-Shwartz does not explicitly teach plotting, on a display device, a plurality of experiences in a visualization, each of the plurality of experiences having a respective reward to form a plurality of rewards across the plurality of experiences, wherein said plotting step uses the plurality of rewards as weights for the visualization.
Mnih teaches plotting, on a display device, a plurality of experiences in a visualization, each of the plurality of experiences having a respective reward to form a plurality of rewards across the plurality of experiences, wherein said plotting step uses the plurality of rewards as weights for the visualization (para. [0088] describes a series of figures: fig. 6a and 6c recite visualizations of average reward per episode during training (i.e. plotting each experience in a visualization using the respective reward as a weight for the visualization), whereas fig. 6b and 6d recite visualizations of the average maximum predicted action-value of a set of states).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by use the visualization method from Mnih to plot the experiences from Hans (as modified by Yoshiike and Shalev-Shwartz) based on their respective rewards. The specification from Hans includes a visualization (i.e. fig. 2) to show how the rewards of a given experience determine which actions the agent takes, but does not describe if or how these visualizations are provided to the user. Using the method from Mnih to plot the experiences from Hans would provide the user with additional context to understand why the agent follows a specific strategy or whether the agent is repeatedly making similar mistakes, which would allow one of ordinary skill to correct errors or improve overall performance.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20120084237 A1 (Hasuo et al) teaches using reinforcement learning to enable an agent that can autonomously perform various actions to efficiently perform learning of an unknown environment.
US 20090327011 A1 (Petroff) teaches using reinforcement learning to dispatch and optimize path planning for a plurality of vehicles.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEAH M FEITL whose telephone number is (571)272-8350. The examiner can normally be reached on M-F 0800-1700.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B. Zhen can be reached on (571) 272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll- free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
	/L.M.F./             Examiner, Art Unit 2121                                                                                                                                                                                           

/DANIEL T PELLETT/Primary Examiner, Art Unit 2121