Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
This action is in response to the amendments filed 08/02/2021. Claims 1, 2, 4, 9, 12, 13, and 20 have been amended, claims 5, 8, 10, 16, and 19 have been cancelled. Claims 1-4, 6-7, 9, 11-15, 17-18, and 20 are currently pending.

Response to Arguments
In light of Applicant’s amendments to the specification, the objection to paragraph [0054] has been withdrawn.
In light of Applicant’s amendment, the objection to claim 1 has been withdrawn.
In light of Applicant’s amendment and arguments regarding the written description rejection, the 112(a) rejection of claims 1-4, 6-7, 9, 11-15, 17-18, and 20 has been withdrawn.
In light of Applicant’s amendment and arguments regarding the written description rejection, the 112(a) rejection of claims 1-4, 6-7, 9, 11-15, 17-18, and 20 has been withdrawn.
Claim 10 has been cancelled, therefore the 112(a) rejection of claim 10 no longer stands.
In light of Applicant’s amendment, the 112(b) rejections of claims 2, 4, 13, and 15 have been withdrawn.
Applicant’s amendments and arguments regarding the 101 rejection have been fully considered but they are not persuasive. Applicant's argues (page 10) that the claims are 
In addition, Applicant argues (pages 10-11) that the claims recite significantly more than an abstract idea in that claim 1 recites performing an action taken from the event buffer, the action being controlling a motor vehicle. Applicant also argues that this performing of an action rises to a practical application. However, paragraph [0056] of the specification recites “The object can be controlled to perform the next action {a} itself or can be controlled to perform another action in response to the next action {a}. The object can be an agent, a vehicle, a robot, and so forth. In the case of the object being a vehicle, the next action {a} can involve obstacle avoidance maneuvers (braking, steering) being automatically performed (by a machine, and not the vehicle operator).” The claims and the specification do not explain how a reinforcement learning system would perform this action, so Examiner is interpreting that this action not taken by the reinforcement learning system, but that the reinforcement learning system transmits a command with the selected action to the vehicle and the vehicle performs the action. 
In light of the amendments made to claims 1, 2, 4, 9, 12, 13, and 20, Applicant has argued (pages 12-15) that the prior art does not teach the amended features. Applicant’s argument is not persuasive and the rejection has been modified in light of the amendments.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101. Claims 1-4, 6-7, 9, and 11 are directed to a method, claims 12-15 and 17-18 are directed to a non-transitory computer-readable medium, and claim 20 is directed to a system; therefore, claims 1-4, 6-7, 9, 11-15, 17-18, and 20 fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter). However, claims 1-4, 6-7, 9, 11-15, 17-18, and 20 fall within the judicial exception of an abstract idea, specifically the abstract ideas of “Mental Processes” (including observation, evaluation, and opinion) and “Mathematical Concepts (including mathematical calculations and relationships)”.
	Claim 1:
Step 1: Claim 1 is directed to a method; therefore the claim does fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).
Step 2A, Prong 1: Claim 1 recites the following abstract ideas:
obtaining, from an environment, a given experience that includes an action, a state and a reward (mental process directed to observation);

during exploration, selecting an action to be taken to the environment from the event buffer with a predetermined probability (mental process directed to evaluation).
Step 2A, Prong 2: Claim 1 recites the following additional elements:
a processor (generic computer components), 
during training, storing the given experience in an experience buffer responsive to a value of the reward included in the given experience not being below exceeding an average award amount for a plurality of experiences by a first threshold amount, while excluding from the experience buffer events where an agent dies corresponding to the value of the reward included in the given experience being below the average award amount by the first threshold (receiving and transmitting data), 
 copying the candidate experience into an event buffer (receiving and transmitting data), 
and performing the action taken from the event buffer to avoid the action becoming an event where the agent dies, the action being a controlling of a motor vehicle for accident avoidance (receiving and transmitting data – see response to arguments). 
These elements do not integrate the abstract idea into a practical application.
Step 2B, Prong 2: Claim 1 recites the following additional elements:
a processor (generic computer components), 

 copying the candidate experience into an event buffer (receiving and transmitting data), 
and performing the action taken from the event buffer to avoid the action becoming an event where the agent dies, the action being a controlling of a motor vehicle for accident avoidance (receiving and transmitting data – see response to arguments). 
These elements do not amount to significantly more (see MPEP 2106.05(f) and MPEP 2106.05(d)(II)).
Claim 12 is a non-transitory computer readable storage medium claim and its limitation is included in claim 1. Claim 12 is rejected for the same reasons as claim 1. Claim 12 recites the following additional elements: a computer program product and program instructions executable by a computer having the processor to cause the computer to perform a method. These are interpreted as generic computer components, which do not amount to significantly more and not integrate the abstract idea into a practical application (see MPEP 2106.05(f)).
	Claim 20 is a system claim and its limitation is included in claim 1. Claim 20 is rejected for the same reasons as claim 1. Claim 20 recites the following additional elements: a computer processing system, a memory for storing program code, and a processor, operatively coupled to 
	The independent claims are not patent eligible.
Dependent claims 2-11 and 13-19, when analyzed as a whole, are held to be patent ineligible under 35 U.S.C. 101 because the additional recited limitations fail to establish that the claims are not directed to an abstract idea, as they recite further embellishment of the judicial exception.
Claim 2:
Step 1: Claim 2 is directed to a method; therefore the claim does fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).
Step 2A, Prong 1: Claim 2 recites the following abstract ideas:
searching the event buffer for a similar experience having a same state as the given experience (mental process directed to observation, evaluation).
Step 2A, Prong 2: Claim 2 recites the following additional elements:
storing the similar experience into the event buffer responsive to any of the action and the reward of the similar experience being different from those of the given experience. These are interpreted as transmitting and receiving data, which does not integrate the abstract idea into a practical application.
Step 2B, Prong 2: Claim 2 recites the following additional elements:
storing the similar experience into the event buffer responsive to any of the action and the reward of the similar experience being different from those of the given experience. These 
Claim 3:
Step 1: Claim 3 is directed to a method; therefore the claim does fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).
Step 2A, Prong 1: Claim 3 recites the following abstract ideas:
stopping the selecting of the action from the event buffer after a pre-defined number of steps during a training stage of the reinforcement learning (mental process directed to observation, evaluation).
Step 2A, Prong 2: Claim 3 does not recite any additional elements and therefore does not integrate the abstract idea into a practical application.
Step 2B, Prong 2: Claim 3 does not recite any additional elements and therefore does not amount to significantly more. 
Claim 4:
Step 1: Claim 4 is directed to a method; therefore the claim does fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).
Step 2A, Prong 1: Claim 4 recites the following abstract ideas:
performing random exploration responsive to said stopping step (mental process directed to evaluation).
Step 2A, Prong 2: Claim 4 does not recite any additional elements and therefore does not integrate the abstract idea into a practical application.

	Claim 6:
Step 1: Claim 6 is directed to a method; therefore the claim does fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).
Step 2A, Prong 1: Claim 6 recites the abstract ideas from claim 1 on which it depends.
	
Step 2A, Prong 2: Claim 6 recites the following additional elements:
storing in the experience buffer any experiences previously observed except for the experiences that resulted in a corresponding reward that fails to exceed the first threshold. These are interpreted as transmitting and receiving data, which does not integrate the abstract idea into a practical application.
Step 2B, Prong 2: Claim 6 recites the following additional elements:
storing in the experience buffer any experiences previously observed except for the experiences that resulted in a corresponding reward that fails to exceed the first threshold. These are interpreted as transmitting and receiving data, which does not amount to significantly more (see MPEP 2106.05(d)(II)).
	Claim 7:
Step 1: Claim 7 is directed to a method; therefore the claim does fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).
Step 2A, Prong 1: Claim 7 recites the following abstract ideas:
the state represents a local state in a low-dimensional space (mathematical representation).

Step 2B, Prong 2: Claim 7 does not recite any additional elements and therefore does not amount to significantly more.
	Claim 9:
Step 1: Claim 9 is directed to a method; therefore the claim does fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).
Step 2A, Prong 1: Claim 9 recites the following abstract ideas:
the first threshold is used to identify critical events (mental process directed to observation, evaluation).
Step 2A, Prong 2: Claim 9 recites the following additional elements:
any of the plurality of experiences unrelated to the critical events are stored in the experience buffer and any of the plurality of experiences related to the critical events are stored in the event buffer. These are interpreted as transmitting and receiving data, which does not integrate the abstract idea into a practical application.
Step 2B, Prong 2: Claim 9 recites the following additional elements:
any of the plurality of experiences unrelated to the critical events are stored in the experience buffer and any of the plurality of experiences related to the critical events are stored in the event buffer. These are interpreted as transmitting and receiving data, which does not amount to significantly more (see MPEP 2106.05(d)(II)).
	Claim 11:

Step 2A, Prong 1: Claim 11 recites the abstract ideas from claim 1 on which it depends.
	
Step 2A, Prong 2: Claim 11 recites the following additional elements:
plotting, on a display device, a plurality of experiences in a visualization, each of the plurality of experiences having a respective reward to form a plurality of rewards across the plurality of experiences, wherein said plotting step uses the plurality of rewards as weights for the visualization. These are interpreted as generic computer components and transmitting and receiving data, which does not integrate the abstract idea into a practical application.
Step 2B, Prong 2: Claim 11 recites the following additional elements:
plotting, on a display device, a plurality of experiences in a visualization, each of the plurality of experiences having a respective reward to form a plurality of rewards across the plurality of experiences, wherein said plotting step uses the plurality of rewards as weights for the visualization. These are interpreted as generic computer components and transmitting and receiving data, which does not amount to significantly more (see MPEP 2106.05(f) and MPEP 2106.05(d)(II)).
Claim 13 is a non-transitory computer readable storage medium claim and its limitation is included in claim 2. Claim 13 is rejected for the same reasons as claim 2.
	Claim 14 is a non-transitory computer readable storage medium claim and its limitation is included in claim 3. Claim 14 is rejected for the same reasons as claim 3.
	Claim 15 is a non-transitory computer readable storage medium claim and its limitation is included in claim 4. Claim 15 is rejected for the same reasons as claim 4.

Claim 18 is a non-transitory computer readable storage medium claim and its limitation is included in claim 7. Claim 18 is rejected for the same reasons as claim 7.
Viewed as a whole, these additional claim elements do not provide meaningful limitations to transform the abstract idea into a patent eligible application of the abstract idea such that the claims amount to significantly more than the abstract idea itself. Therefore, the claims are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 6-7, 9, 12-15, 17-18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Yoshiike et al (US 20100318478 A1, herein Yoshiike) in view of Shalev-Shwartz et al (US 20180032082 A1, herein Shalev-Shwartz).
Regarding claim 1, Yoshiike teaches a computer-implemented method (para. [0009] recites an information processing method according to an embodiment of the present invention) for reinforcement learning training (para. [0091] recites that learning, recognition of situations, and planning of actions ( determination of actions) that the agent performs can be applied to a problem that can be formulated with the framework of Marcov decision process (MDP) that is commonly taken as a reinforcement learning problem) performed by a processor (para. [0827] recites that the program may be processed by a single computer (processor), or may be processed by decentralized processing by multiple computers), the method comprising:
obtaining, from an environment, a given experience that includes an action, a state and a reward (fig. 17 and para. [0392] recite the state transition probability aij(Um) (i.e. an experience) regarding each of the state S (i.e. a state), in the i-axis direction of the three-dimensional state transition probability table A made up of the i axis, j axis, and action axis for each of the action Um (i.e. an action). Para. [0735] recites that the action determining unit 24 obtains the sum of state transition probabilities arrayed in the j-axial direction (horizontal direction) on the state transition probability plane for each action Um, as the action suitability (i.e. a reward));
during training, storing the given experience in an experience buffer responsive to a value of the reward included in the given experience not being below an average award amount for a plurality of experiences by a first threshold amount, (fig. 47 step 343 and para. [0736] recites selecting actions Um of which the action suitability is at or above the threshold candidates for the next action to be performed following the first strategy. Fig. 4 and para. [0106] recite that the series of the observation values (observation value series), and the series of the actions (action series) are stored in the history storage unit 14 (i.e. the experience buffer)) while excluding from the experience buffer events where an agent dies corresponding to the value of the reward included in the given experience being below the (para. [0736] recites the action determining unit 24 sets the action suitability obtained regarding an action Um of which the action suitability is below a threshold to 0.0 (i.e. events where the corresponding reward is below the average award amount), thereby eliminating actions Um of which the action suitability is below a threshold from candidates for the next action to be performed);
responsive to obtaining another experience having another reward that is less than or equal to the average award amount by the first threshold amount (fig. 47 step 343 and para. [0736] recite that the action determining unit 24 sets the action suitability obtained regarding an action Um of which the action suitability is below a threshold to 0.0, thereby eliminating actions Um of which the action suitability is below a threshold from candidates for the next action to be performed following the first strategy with regard to the state series of interest (i.e. the experience is less below a threshold). Para. [0743] recites if the agent is desired to return to a known location, or if the agent is desired to develop an unknown location, an action where the agent wanders through the action environment is far from desirable. Thus, the action determining unit 24 is arranged so as to be able to determine the next action based on, in addition to the first strategy, a second and third strategy which are described below (i.e. when the first strategy does not result in a desirable action, the action unit turns to a second or third strategy), searching the experience buffer for a candidate experience with a similar state to the other experience (fig 49 and para. [0747] recite the second strategy, wherein there is no state which immediately precedes the last state, the action determining unit 24 refers to the expanded HMM (or the state transition probability thereof) stored in the model storage unit 22 to obtain states for which the last state can serve as a transition destination of state transition (i.e. searching for a similar experience). Para. [0749] recites that the action determining unit 24 sets the action suitability for actions other than the action regarding which the action suitability is the greatest, to 0.0, consequently selecting the action with the greatest action suitability as a candidate for the next action to be performed) and copying the candidate experience into an event buffer (the state transition probability of the HMM (Hidden Marcov Model) is expanded to state transition probability for each action performed by the agent, and the HMM of which the state transition probability is thus expanded (hereafter, also referred to as "expanded HMM") is employed as a learning object by the learning unit 21. Para. [0114] recites that the model storage unit 22 stores (the state transition probability, observation probability, and the like that are model parameters stipulating) the expanded HMM (i.e. the event buffer));
and during exploration, selecting an action to be taken to the environment from the event buffer with a predetermined probability (fig. 47 step 345 and para. [0738] recite that the action determining unit 24 determines the next action from the candidates for the next action (i.e. selects an action with a predetermined probability from the event buffer), based on the action suitability regarding the actions Um obtained for each of the one or more current state series candidates from the state recognizing unit 23).
However, Yoshiike does not teach performing the action taken from the event buffer to avoid the action becoming an event where the agent dies, the action being a controlling of a motor vehicle for accident avoidance.
Shalev-Shwartz teaches performing the action taken from the event buffer to avoid the action becoming an event where the agent dies, the action being a controlling of a motor (para. [0221] recites considering a reward function for which R.(s)=-r for trajectories that represent a rare "corner" event to be avoided (e.g., such as an accident), and R.(s) El-1, 1] for the rest of the trajectories, one goal for the learning system may be to learn to perform an overtake maneuver. Normally, in an accident free trajectory, R(s) would reward successful, smooth, takeovers and penalize staying in a lane without completing the takeover-hence the range [-1, 1]. If a sequence, S, represents an accident, the reward, -r, should provide a sufficiently high penalty to discourage such an occurrence (i.e. performing an action to avoid a critical event). Para. [0332] recites after a selection is made from among potential actions in response to a sensed navigational state, at step 1627, the at least one processor may cause at least one adjustment of a navigational actuator of the host vehicle in response to the selected potential navigational action. The navigational actuator may include any suitable device for controlling at least one aspect of the host vehicle. For example, the navigational actuator may include at least one of a steering mechanism, a brake, or an accelerator (i.e. controlling a motor vehicle)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by using the methods from Yoshiike to help determine the best navigation action in Shalev-Shwartz. Yoshiike and Shalev-Shwartz are both directed to using reinforcement learning to determine a next best action, but while Yoshiike describes performing an action taken from the event buffer to avoid a critical event (Examiner’s note: one of ordinary skill would understand that the agent dying and a critical event are analogous, as noted in paragraph [0019] of the specification) Yoshiike does not teach that the action involves controlling a motor vehicle. One of ordinary skill would 
Regarding claim 2, the combination of Yoshiike and Shalev-Shwartz teaches the method according to claim 1, wherein said storing step comprises:
searching the event buffer for a similar experience having a same state as the given experience (Yoshiike fig. 48 shows a method to “search for situation similar to current situation from own structured knowledge” (i.e. searching the event buffer for a similar experience). Yoshiike para. [0744] recites that in fig. 48, the action determining unit 24 determines, as the next action, an action wherein there is generated state transition from the last state s, of one or more current state series candidates from the state recognizing unit 23, to an immediately preceding state S,_1 immediately before the last state s); and storing the similar experience into the event buffer responsive to any of the action and the reward of the similar experience being different from those of the given experience (Yoshiike para. [0115] recites that the state recognizing unit 23 recognizes the current situation of the agent based on the expanded HMM stored in the model storage unit 22 (i.e. the similar experience found by searching the event buffer) using the action series and the observation value series stored in the history storage unit 14, and obtains (recognizes) the current state that is the state of the expanded HMM corresponding to the current situation thereof (i.e. stores the similar experience back into the event buffer) Yoshiike para. [0741] recites that in the event of determining an action following the first strategy, the agent performs an action which the agent has performed under a known situation similar to the current situation)
Regarding claim 3, the combination of Yoshiike and Shalev-Shwartz teaches the method according to claim 1, further comprising stopping the selecting of the action from the event buffer after a pre-defined number of steps during a training stage of the reinforcement learning (Yoshiike fig. 5 and para. [0146] recites that in the case that determination is made in step S17 that the agent has performed an action by already specified number of times, i.e., in the case that the point-in-time t is equal to the already specified number of times, the processing in the reflective action mode ends (i.e. stopping the selection of the action after a pre-defined number of steps)).
Regarding claim 4, the combination of Yoshiike and Shalev-Shwartz teaches the method according to claim 3, further comprising performing random exploration responsive to said stopping step (Yoshiike fig. 4 and para. [0129] recite that the random target generating unit 35 selects one state out of the states of the expanded HMM stored in the model storage unit 22 at random as a random target, and supplies the random target thereof to the target selecting unit 31 as the internal target serving as the target state).
Regarding claim 6, the combination of Yoshiike and Shalev-Shwartz teaches the method according to claim 1, further comprising storing in the experience buffer any experiences previously observed (Yoshiike fig. 4 and para. [0106] recite that the series of the observation values (observation value series), and the series of the actions (action series) are stored in the history storage unit 14 (i.e. the experience buffer)) except for the experiences that resulted in a corresponding reward that fails to exceed the first threshold (Yoshiike para. [0736] recites the action determining unit 24 sets the action suitability obtained regarding an action Um of which the action suitability is below a threshold to 0.0 (i.e. experiences where the corresponding reward fails to exceed a first threshold), thereby eliminating actions Um of which the action suitability is below a threshold from candidates for the next action to be performed).
Regarding claim 7, the combination of Yoshiike and Shalev-Shwartz teaches the method according to claim 1, wherein the state represents a local state in a low-dimensional space (Yoshiike para. [0153] recites that the state transition probability of a common HMM can be represented by a two-dimensional table  (i.e. a low-dimensional space) where the state transition probability aij of the state transition from the state Si to the state Sj is disposed at the i'th from the top and the j'th from the left).
Regarding claim 9, the combination of Yoshiike and Shalev-Shwartz teaches the method according to claim 1, wherein the method is applied to the plurality of experiences, wherein the first threshold is used to identify critical events (Yoshiike para. [0365] recites comparing the state transition probability of a certain state, and the state transition probability of another state (i.e. a plurality of experiences) to which observation probability for observing the same observation value as with that state is assigned (a value other than (not regarded as) 0.0), a state is equivalent to the open edge wherein regardless of understanding that state transition to the next state can be performed when a certain action is performed, in this state this action has not been performed, and accordingly, state transition probability has not been assigned thereto (deemed to be 0.0), and state transition is incapable of being performed (i.e. a critical event), and wherein any of the plurality of experiences unrelated to the critical events are stored in the experience buffer and any of the plurality of experiences related to the critical events are stored in the event buffer (Yoshiike para. [0115] recites that the state recognizing unit 23 recognizes the current situation of the agent based on the expanded HMM stored in the model storage unit 22 (i.e. the event buffer) using the action series and the observation value series stored in the history storage unit 14 (i.e. the experience buffer), and obtains (recognizes) the current state that is the state of the expanded HMM corresponding to the current situation thereof).
Claim 12 is a non-transitory computer readable storage medium claim and its limitation is included in claim 1. The only difference is that claim 12 requires a non-transitory computer readable storage medium (Yoshiike para. [0823] recites upon a command being input by an input unit 107 being operated by the user or the like via the input/output interface 110, the CPU 102 executes a program stored in ROM (Read Only Memory) 103, or loads a program stored in the hard disk 105 to RAM (Random Access Memory) 104 and executes the program (i.e. ROM and RAM are examples of non-transitory computer readable storage media). Therefore, claim 12 is rejected for the same reasons as claim 1.
Claim 13 is a non-transitory computer readable storage medium claim and its limitation is included in claim 2. Claim 13 is rejected for the same reasons as claim 2.
Claim 14 is a non-transitory computer readable storage medium claim and its limitation is included in claim 3. Claim 14 is rejected for the same reasons as claim 3.
Claim 15 is a non-transitory computer readable storage medium claim and its limitation is included in claim 4. Claim 15 is rejected for the same reasons as claim 4.
Claim 17 is a non-transitory computer readable storage medium claim and its limitation is included in claim 6. Claim 17 is rejected for the same reasons as claim 6.
Claim 18 is a non-transitory computer readable storage medium claim and its limitation is included in claim 7. Claim 18 is rejected for the same reasons as claim 7.
Claim 20 is a system claim and its limitation is included in claim 1. The only difference is that claim 12 requires a system (Yoshiike fig. 4 and para. [0089] recite a configuration example of an embodiment of the agent to which the information processing device (i.e. a system) according to the present invention). Therefore, claim 20 is rejected for the same reasons as claim 1.
	
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Yoshiike et al (US 20100318478 A1, herein Yoshiike) in view of Shalev-Shwartz et al (US 20180032082 A1, Shalev-Shwartz) in further view of Mnih et al (US 20150100530 A1, herein Mnih). 
Regarding claim 11, the combination of Yoshiike and Shalev-Shwartz teaches the computer-implemented method of claim 1 (para. [0009] recites an information processing method according to an embodiment of the present invention. Para. [0091] recites that learning, recognition of situations, and planning of actions ( determination of actions) that the agent performs can be applied to a problem that can be formulated with the framework of Marcov decision process (MDP) that is commonly taken as a reinforcement learning problem).
However, the combination of Yoshiike and Shalev-Shwartz does not explicitly teach plotting, on a display device, a plurality of experiences in a visualization, each of the plurality of experiences having a respective reward to form a plurality of rewards across the plurality of 
Mnih teaches plotting, on a display device, a plurality of experiences in a visualization, each of the plurality of experiences having a respective reward to form a plurality of rewards across the plurality of experiences, wherein said plotting step uses the plurality of rewards as weights for the visualization (para. [0088] describes a series of figures: fig. 6a and 6c recite visualizations of average reward per episode during training (i.e. plotting each experience in a visualization using the respective reward as a weight for the visualization ), whereas fig. 6b and 6d recite visualizations of the average maximum predicted action-value of a set of states).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by use the visualization method from Mnih to plot the experiences from Yoshiike (as modified by Shalev-Shwartz) based on their respective rewards. The specification from Yoshiike includes a number of visualizations to show how the rewards of a given experience determine which path the agent follows, but does not describe if these visualizations are provided to the user. Using the method from Mnih to plot the experiences from Yoshiike (as modified by Shalev-Shwartz) would provide the user with additional context to understand why the agent follows a specific strategy or whether the agent is repeatedly making similar mistakes, which would allow one of ordinary skill to correct errors or improve overall performance.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20190061147 A1 (Luciw et al) teaches using deep neural network based Q-learning to collect a new experience by an agent, compare the new experience to experiences stored in the agent's memory, and either discard the new experience or overwrite an experience in the memory with the new experience based on the comparison.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEAH M FEITL whose telephone number is (571)272-8350. The examiner can normally be reached on M-F 0800-1700.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is 
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Li B. Zhen can be reached on (571) 272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll- free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
	/L.M.F./
             Examiner, Art Unit 2121   



/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121