DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending, of which claims 1, 5, and 13 are independent.

Acknowledgement of References Cited By Applicant
As required by MPEP 609 (c), the Applicants’ submission of the Information Disclosure Statement is acknowledged by the examiner and the cited references have been considered in the examination of the claims now pending. 
As required by MPEP 609 (c)(2), a copy of each PTOL-1449, initialed and dated by the Examiner, is attached to the instant office action. Applicant is respectfully reminded of the requirements of MPEP 609 (b)(1) and 37 CFR 1.97 listing the requirements for an Information Disclosure Statement. 

Examiner Notes
Examiner cites particular columns, paragraphs, figures and line numbers in the references as applied to the claims below for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner. The entire reference is considered to provide disclosure relating to the claimed invention. The claims & only the claims form the metes & bounds of the invention. Office personnel are to give the claims their broadest reasonable interpretation in light of the supporting disclosure. Unclaimed limitations appearing in the specification are not read into the claim. Prior art was referenced using terminology familiar to one of ordinary skill in the art. Such an approach is broad in concept and can be either explicit or implicit in meaning. Examiner's Notes are provided with the cited references to assist the applicant to better understand how the examiner interprets the applied prior art. Such comments are entirely consistent with the intent & spirit of compact prosecution.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-4, 13-14, and 18-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Mitchell Spryn, et al., “Distributed Deep Reinforcement Learning on the Cloud for Autonomous Driving,” 2018 ACM/IEEE 1st International Workshop on Software Engineering for AI in Autonomous Systems, May 28, 2018.

Claim 1. Spryn discloses A computer-implemented method, comprising (Spryn, Abstract, the systems and method are for cloud computing and virtual machines):
running a simulation of a system in a simulation environment at a first compute node, the simulation of the system comprising an agent representing the system using a reinforcement learning model to operate within the simulation environment (Spryn, Abstract, “This paper proposes an architecture for leveraging cloud computing technology to reduce training time for deep reinforcement learning models for autonomous driving by distributing the training process across a pool of virtual machines.”; Fig. 2 Each agent (any one of which teaches a first compute node) runs a copy of the simulator as well as a local trainer.); 
obtaining data indicating how the agent performed in the simulation environment (Spryn, p. 17 column 2 paragraph 1 “For our problem, we define the state as a single RGB frame input from a front-facing web-cam on the car. Given this state information, the agent then takes the action of selecting a steering control signal from five possible values: hard left, soft left, straight ahead, soft right and hard right. Once the selection is made, the agent is then given a reward relative to its position in the environment. The details of the reward function are discussed in more detail in section 3.3.”; p. 19 section 3 EXPERIMENT DESIGN “We used Microsoft AirSim [12] as our simulator for the experiments presented here. In addition to having high-quality environments with realistic vehicle physics, it has a python API which allows for easy data extraction and control.” Examiner’s Note (EN): The data collected an given a reward is an indication of agent performance in the environment.); 
transmitting the data to a second compute node running a training application to train the reinforcement learning model to result in an updated reinforcement learning model (Spryn,
    PNG
    media_image1.png
    693
    796
    media_image1.png
    Greyscale

EN: The bidirectional arrows indicate the transmission of data to and from at least a first node to a second node. As indicated, the master model is updated by the Parameter server); and 
providing the updated reinforcement learning model to the agent to run the simulation of the system according to the updated model (Spryn, Fig. 2 updates from Parameter server to agents; p. 18 column 2 paragraph 3 “Our job distribution paradigm is shown in Figure 2. We start with a pool of virtual machine nodes. At the start of training, one node is designated the parameter server node and all other nodes are designated as agent nodes. The parameter server is responsible for keeping the master copy of the model, accepting asynchronous updates from each of the agent nodes, and controlling the annealing rate. The agent nodes are responsible for running the simulator and performing local model training. After an agent completes an episode, it performs a training iteration on its local copy of the model with the collected data. Once the training completes, it then computes the change in weights of each of the layers of the model (the gradient). It then sends the gradient to the parameter server, and waits for a response. When the parameter server receives the gradient from the agent node, it adds the gradient to the master copy of the model, and then sends the updated model back to the agent node.”).

Claim 2. Spryn discloses The computer-implemented method of claim 1, wherein the data includes at least an initial state of the system, an action performed by the system in response to the initial state, a new state resulting from the action, and a reward value corresponding to the action performed and a change from the initial state to the new state (Spryn p. 17 column 2 paragraph 1 “During each iteration of the training process, the agent is presented with a set of actions A, from which it selects one to perform. This action takes the agent from its current state S to a new state S’. As a result of selecting this action, the environment provides the agent with a reward R(S, S’,A). These actions are repeated until a terminal state is reached in which there are no further choices for actions which the agent can take. This marks the end of an episode, after which the agent is placed into a new random state and the training continues.”).

Claim 3. Spryn discloses The computer-implemented method of claim 1, wherein the method further comprises establishing, between the first compute node and the second compute node, a communication channel to allow transmission of the data to the second compute node and to obtain the updated reinforcement learning model (Spryn, Fig. 2 the bidirectional arrows represent a communication channel permitting transmission of data between nodes).

Claim 4. Spryn discloses The computer-implemented method of claim 1, wherein the method further comprises: obtaining second data indicating how the agent performed in the simulation environment, the agent running the simulation of the system according to the updated model; transmitting the second data to the second compute node to train the updated reinforcement learning model to result in a second updated reinforcement learning model 
    PNG
    media_image1.png
    693
    796
    media_image1.png
    Greyscale
; obtaining a notification from the second compute node that indicates that the second updated reinforcement learning model satisfies a simulation termination requirement; and terminating, in response to the notification, the simulation of the system (Spryn, p. 18 column 2 paragraph 3-p. 19 paragraph 1 “We start with a pool of virtual machine nodes. At the start of training, one node is designated the parameter server node and all other nodes are designated as agent nodes. The parameter server is responsible for keeping the master copy of the model, accepting synchronous updates from each of the agent nodes, and controlling the annealing rate. The agent nodes are responsible for running the simulator and performing local model training. After an agent completes an episode, it performs a training iteration on its local copy of the model with the collected data. Once the training completes, it then computes the change in weights of each of the layers of the model (the gradient). It then sends the gradient to the parameter server, and waits for a response. When the parameter server receives the gradient from the agent node, it adds the gradient to the master copy of the model, and then sends the updated model back to the agent node. This may be a different model than the agent currently has, as it may include gradients received from other nodes as well. This process repeats until the parameter server has received a set number of iterations.” EN: The cession of the parameter server (i.e. second node) is based on a notification that a count or number of iterations is reached.).

Claim 13. Spryn discloses A non-transitory computer-readable storage medium having stored thereon executable instructions that, as a result of being executed by one or more processors of a computer system, cause the computer system to at least (Spryn, “Our experiments run on the Microsoft Azure cloud and utilize the NV-series virtual machines. These machines contain NVIDIA Tesla GPUs and are optimized for visualization tasks, which allow us to run our simulator to receive photo-realistic images for training. We used Azure Batch to manage the virtual machines and coordinate the distribution jobs.” The disclosed computing devices include standard non-transitory computer-readable storage media.): 
obtain, from a simulation of a second system using a reinforcement learning model in a simulation environment, data indicating how the simulation of the second system performed in the simulation environment (Spryn, Fig. 2 showing distributed system with a plurality of agents; p. 17 column 2 paragraph 1 “For our problem, we define the state as a single RGB frame input from a front-facing web-cam on the car. Given this state information, the agent then takes the action of selecting a steering control signal from five possible values: hard left, soft left, straight ahead, soft right and hard right. Once the selection is made, the agent is then given a reward relative to its position in the environment. The details of the reward function are discussed in more detail in section 3.3.”; p. 19 section 3 EXPERIMENT DESIGN “We used Microsoft AirSim [12] as our simulator for the experiments presented here. In addition to having high-quality environments with realistic vehicle physics, it has a python API which allows for easy data extraction and control.” EN: The data collected an given a reward is an indication of agent performance in the environment, and any subsequent agent after the first teaches a second compute system and third, etc..); 
transmit the data to another computer system to cause the another computer system to train the reinforcement learning model (Fig. 2, the bidirectional arrows indicate transmission and as indicated each agent has local training); 
obtain, from the another computer system, an updated reinforcement learning model, the updated reinforcement learning model having incorporated the data (Spryn, Fig. 2 updates from Parameter server to agents; p. 18 column 2 paragraph 3 “Our job distribution paradigm is shown in Figure 2. We start with a pool of virtual machine nodes. At the start of training, one node is designated the parameter server node and all other nodes are designated as agent nodes. The parameter server is responsible for keeping the master copy of the model, accepting asynchronous updates from each of the agent nodes, and controlling the annealing rate. The agent nodes are responsible for running the simulator and performing local model training. After an agent completes an episode, it performs a training iteration on its local copy of the model with the collected data. Once the training completes, it then computes the change in weights of each of the layers of the model (the gradient). It then sends the gradient to the parameter server, and waits for a response. When the parameter server receives the gradient from the agent node, it adds the gradient to the master copy of the model, and then sends the updated model back to the agent node.” EN: The parameter server incorporates the data as part of the update sent back to the agent node.); and 
update the simulation of the second system to cause the simulation of the second system to utilize the updated reinforcement learning model (Spryn, Fig. 2 updates from Parameter server to agents; p. 19 column 1 paragraph 1 “This process repeats until the parameter server has received a set number of iterations.” EN: Thus, the updated agent utilizes the updated model.).

Claim 14. Spyrn discloses The non-transitory computer-readable storage medium of claim 13, wherein the data includes at least an initial state of the second system in the simulation environment, an action performed by the second system in response to the initial state, a new state of the second system in the simulation environment resulting from the action, and a reward value corresponding to the action performed (Spryn p. 17 column 2 paragraph 1 “During each iteration of the training process, the agent is presented with a set of actions A, from which it selects one to perform. This action takes the agent from its current state S to a new state S’. As a result of selecting this action, the environment provides the agent with a reward R(S, S’,A). These actions are repeated until a terminal state is reached in which there are no further choices for actions which the agent can take. This marks the end of an episode, after which the agent is placed into a new random state and the training continues.”).

Claim 18. Spyrn discloses The non-transitory computer-readable storage medium of claim 13, wherein the instructions that cause the computer system to obtain the data further cause the computer system to: select, from a set of pairings of states and actions, a pairing comprising a state of the second system in the simulation environment and an action performable in response to the state (Spryn, p. 17 column 2 paragraph 2 “During each iteration of the training process, the agent is presented with a set of actions A, from which it selects one to perform. This action takes the agent from its current state S to a new state S’. As a result of selecting this action, the environment provides the agent with a reward R(S, S’,A). These actions are repeated until a terminal state is reached in which there are no further choices for actions which the agent can take. This marks the end of an episode, after which the agent is placed into a new random state and the training continues.” EN: The state and action are paired for the episode.); 
utilize the pairing as input to the simulation to cause the simulation to perform the action in response to the state; and obtain, in response to the action, a reward value corresponding to performance of the action in the simulation environment in response to the state (Spryn, p. 17 column 2 paragraph 2 “For our problem, we define the state as a single RGB frame input from a front-facing web-cam on the car. Given this state information, the agent then takes the action of selecting a steering control signal from five possible values: hard left, soft left, straight ahead, soft right and hard right. Once the selection is made, the agent is then given a reward relative to its position in the environment. The details of the reward function are discussed in more detail in section 3.3.” EN: The reward value corresponds to performance to train the system to achieve the desired response (e.g. turning). P. 18 column 2 paragraph 2 “Over time, as the model improves, we decrease the amount of exploration and increase the amount of exploitation as the model improves. Towards the end of training, we are almost exclusively using an exploitative strategy, but we still occasionally choose to explore in case the model has converged to a suboptimal strategy.”).

Claim 19. The non-transitory computer-readable storage medium of claim 13, wherein the instructions that cause the computer system to obtain the data further cause the computer system to: select, from a set of states, a state for the second system in the simulation environment (Spryn, p. 17 column 2 paragraph 2 “During each iteration of the training process, the agent is presented with a set of actions A, from which it selects one to perform. This action takes the agent from its current state S to a new state S’. As a result of selecting this action, the environment provides the agent with a reward R(S, S’,A). These actions are repeated until a terminal state is reached in which there are no further choices for actions which the agent can take. This marks the end of an episode, after which the agent is placed into a new random state and the training continues.” EN: action is selected that then is selecting the states corresponding to that action for the simulations.); 
utilize the state as input to the simulation to cause the simulation to perform an action in response to the state; and obtain, in response to the action, a reward value corresponding to performance of the action in the simulation environment in response to the state (Spryn, p. 20 column 2 paragraph 1 “The design of the reward function is critical to the success of the model. For our experiments, we decided to make the reward a function of the distance of the car from the center of the road. We computed the reward using the equation:
R(S, S’,A) = exp(−β ∗ ||xc − xr ||) (2)
where xc is the position of the car, xr is the position of the road, and β is a positive scaling constant that controlled the shape of the function. This reward function has the attractive property that it is in the range [0, 1], making it easier for the model to learn than an unbounded function like the raw distance.” EN: Thus, the reward value is based on performance of the action in the simulation environment).

Claim 20. The non-transitory computer-readable storage medium of claim 13, wherein the instructions further cause the computer system to insert, into the simulation, a reinforcement function, the reinforcement function defining a set of reward values corresponding to actions performable in the simulation environment (Spryn, p. 20 section 3.3. Reward Function, the reward value is used to learn how to perform a desired task (e.g., turning) and as Spryn describes the process (see section 2.2 Training Job Description) as the training goes through a set number of iterations. It is considered obvious to one of ordinary skill in the art to adjust thresholds and convergence criteria as needed as part of routine optimizations.).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 5-12 are rejected under 35 U.S.C. 103 as being unpatentable over Mitchell Spryn, et al., “Distributed Deep Reinforcement Learning on the Cloud for Autonomous Driving,” 2018 ACM/IEEE 1st International Workshop on Software Engineering for AI in Autonomous Systems, May 28, 2018 in view of Fan et al., "SURREAL: Open-Source Reinforcement Learning Framework and Robot Manipulation Benchmark" (Oct. 2018, submitted in IDS dated 4/6/2022).
Claim 5. Spryn teaches A first system, comprising: 
one or more processors; and memory that stores computer-executable instructions that, if executed, cause the system to (Spryn, “Our experiments run on the Microsoft Azure cloud and utilize the NV-series virtual machines. These machines contain NVIDIA Tesla GPUs and are optimized for visualization tasks, which allow us to run our simulator to receive photo-realistic images for training. We used Azure Batch to manage the virtual machines and coordinate the distribution jobs.”): 
execute a simulation of in a simulation environment, the simulation comprising an agent representing a second system using a reinforcement learning model to operate within the simulation environment (Spryn, Abstract, “This paper proposes an architecture for leveraging cloud computing technology to reduce training time for deep reinforcement learning models for autonomous driving by distributing the training process across a pool of virtual machines.”; Fig. 2 Each agent runs a copy of the simulator as well as a local trainer.); 
obtain data indicating how the agent performed in the simulation environment (Spryn, p. 17 column 2 paragraph 1 “For our problem, we define the state as a single RGB frame input from a front-facing web-cam on the car. Given this state information, the agent then takes the action of selecting a steering control signal from five possible values: hard left, soft left, straight ahead, soft right and hard right. Once the selection is made, the agent is then given a reward relative to its position in the environment. The details of the reward function are discussed in more detail in section 3.3.”; p. 19 section 3 EXPERIMENT DESIGN “We used Microsoft AirSim [12] as our simulator for the experiments presented here. In addition to having high-quality environments with realistic vehicle physics, it has a python API which allows for easy data extraction and control.” Examiner’s Note (EN): The data collected an given a reward is an indication of agent performance in the environment.); 
transmit the data to another system to cause the other system to use the data to update the reinforcement learning model (Spryn,
    PNG
    media_image1.png
    693
    796
    media_image1.png
    Greyscale

EN: The bidirectional arrows indicate the transmission of data to and from at least a first node to a second node. As indicated, the master model is updated by the Parameter server); and 
obtain, from the other system, an updated reinforcement learning model (Spryn, Fig. 2 updates from Parameter server to agents; p. 18 column 2 paragraph 3 “Our job distribution paradigm is shown in Figure 2. We start with a pool of virtual machine nodes. At the start of training, one node is designated the parameter server node and all other nodes are designated as agent nodes. The parameter server is responsible for keeping the master copy of the model, accepting asynchronous updates from each of the agent nodes, and controlling the annealing rate. The agent nodes are responsible for running the simulator and performing local model training. After an agent completes an episode, it performs a training iteration on its local copy of the model with the collected data. Once the training completes, it then computes the change in weights of each of the layers of the model (the gradient). It then sends the gradient to the parameter server, and waits for a response. When the parameter server receives the gradient from the agent node, it adds the gradient to the master copy of the model, and then sends the updated model back to the agent node.”).
Spryn does not explicitly disclose a robotic device.
Fan teaches that such a simulation system may include a robotic device (Fan, Abstract “We also introduce SURREAL Robotics Suite, an accessible set of benchmarking tasks in physical simulation for reproducible robot manipulation research.”; p. 2 paragraph 2 “In this paper, we introduce the open-source framework SURREAL (Scalable Robotic REinforcementlearning ALgorithms). Standard approaches to accelerating deep RL training focus on parallelizing the gradient computation [27, 28]. SURREAL decomposes a distributed RL algorithm into four components: generation of experience (actors), storage of experience (buffer), updating parameters from experience (learner), and storage of parameters (parameter server).”). 
Accordingly, it would have been obvious to one of ordinary skill in the art at the time the invention was effectively filed to have combined Spryn (directed to simulation execution environments with reinforcement learning) and Fan (directed to simulation execution environments with reinforcement learning) and arrived at a simulation environment that includes simulations of a robotic device. One of ordinary skill in the art would have been motivated to make such a combination because tackling complex control problems may be pursued using reinforcement learning in simulated environments as taught in Fan (Introduction).

Claim 6. Modified Spryn teaches The first system of claim 5, wherein the computer-executable instructions further cause the first system to: execute a second simulation of the second system in the simulation environment, the agent representing the second system using the updated reinforcement learning to operate within the simulation environment; obtain second data indicating how the agent performed in the simulation environment; transmit the second data to the other system; obtain a notification from the other system that indicates that a termination requirement for the simulation has been satisfied; and make available the updated reinforcement learning model for optimizing an application of the second system in response to the notification (Spryn, Fig. 2 
    PNG
    media_image1.png
    693
    796
    media_image1.png
    Greyscale
EN: Any of the Agents, subsequent to one being the “first” may be the second agent/simulator/trainer. p. 18 column 2 paragraph 3-p. 19 paragraph 1 “We start with a pool of virtual machine nodes. At the start of training, one node is designated the parameter server node and all other nodes are designated as agent nodes. The parameter server is responsible for keeping the master copy of the model, accepting synchronous updates from each of the agent nodes, and controlling the annealing rate. The agent nodes are responsible for running the simulator and performing local model training. After an agent completes an episode, it performs a training iteration on its local copy of the model with the collected data. Once the training completes, it then computes the change in weights of each of the layers of the model (the gradient). It then sends the gradient to the parameter server, and waits for a response. When the parameter server receives the gradient from the agent node, it adds the gradient to the master copy of the model, and then sends the updated model back to the agent node. This may be a different model than the agent currently has, as it may include gradients received from other nodes as well. This process repeats until the parameter server has received a set number of iterations.” EN: The cession of the parameter server (i.e. second node) is based on a notification that a count or number of iterations is reached.).

Claim 7. Modified Spryn teaches The first system of claim 6, wherein the termination requirement is satisfied as a result of a maximum number of simulations of the second system in the simulation environment having been performed to update the reinforcement learning model (Spryn, p. 18 column 2 paragraph 3-p. 19 paragraph 1 “We start with a pool of virtual machine nodes. At the start of training, one node is designated the parameter server node and all other nodes are designated as agent nodes. The parameter server is responsible for keeping the master copy of the model, accepting synchronous updates from each of the agent nodes, and controlling the annealing rate. The agent nodes are responsible for running the simulator and performing local model training. After an agent completes an episode, it performs a training iteration on its local copy of the model with the collected data. Once the training completes, it then computes the change in weights of each of the layers of the model (the gradient). It then sends the gradient to the parameter server, and waits for a response. When the parameter server receives the gradient from the agent node, it adds the gradient to the master copy of the model, and then sends the updated model back to the agent node. This may be a different model than the agent currently has, as it may include gradients received from other nodes as well. This process repeats until the parameter server has received a set number of iterations.” EN: The cession of the training at the second agent is a termination requirement.).

Claim 8. Modified Spryn teaches The first system of claim 6, wherein the termination requirement is satisfied as a result of an average reward value being obtained over performance of a minimum number of simulations of the second system in the simulation environment having been obtained (Spryn, p. 20 column 2 section 3.3 Reward Function describing that the design of the reward function is critical to the particular model. One of ordinary skill in the art designs the reward function to achieve the desired goals and is considered taught by the design of a reward function that is within the routine optimization of the engineer to teach the desired outcomes).

Claim 9. Modified Spryn teaches The first system of claim 5, wherein the data includes at least an initial state of the second system in the simulation environment, an action performed by the second system in response to the initial state, a new state of the second system in the simulation environment resulting from the action, and a reward value corresponding to the action performed (Spryn, Fig. 2 illustrating multiple agents in cloud environment where the agents are updated periodically; p. 17 column 2 paragraph 1 “During each iteration of the training process, the agent is presented with a set of actions A, from which it selects one to perform. This action takes the agent from its current state S to a new state S’. As a result of selecting this action, the environment provides the agent with a reward R(S, S’,A). These actions are repeated until a terminal state is reached in which there are no further choices for actions which the agent can take. This marks the end of an episode, after which the agent is placed into a new random state and the training continues.”).

Claim 10. Modified Spryn teaches The first system of claim 5, wherein the computer-executable instructions that cause the first system to execute the simulation further cause the first system to: select a state for the second system as input to the agent to cause the agent to perform an action in response to the state; and obtain, in response to the action, a reward value corresponding to performance of the action in the simulation environment in response to the state (Spryn, Fig. 2 illustrating multiple agents in cloud environment; p. 17 column 2 paragraph 1 “During each iteration of the training process, the agent is presented with a set of actions A, from which it selects one to perform. This action takes the agent from its current state S to a new state S’. As a result of selecting this action, the environment provides the agent with a reward R(S, S’,A). These actions are repeated until a terminal state is reached in which there are no further choices for actions which the agent can take. This marks the end of an episode, after which the agent is placed into a new random state and the training continues.”).

Claim 11. Modified Spryn teaches The first system of claim 5, wherein the computer-executable instructions that cause the first system to execute the simulation further cause the first system to: select a set of states and a set of actions for the second system as input to the agent; and obtain, from the agent, a reward value corresponding to performance of the set of actions in the simulation environment based on the set of states (Spryn, Fig. 2 illustrating multiple agents in cloud environment; p. 17 column 2 paragraph 1 “During each iteration of the training process, the agent is presented with a set of actions A, from which it selects one to perform. This action takes the agent from its current state S to a new state S’. As a result of selecting this action, the environment provides the agent with a reward R(S, S’,A). These actions are repeated until a terminal state is reached in which there are no further choices for actions which the agent can take. This marks the end of an episode, after which the agent is placed into a new random state and the training continues.”).

Claim 12. Modified Spryn teaches The first system of claim 5, wherein the computer-executable instructions further cause the first system to: obtain computer-executable code defining a reinforcement function for training the reinforcement learning model (Spryn, p. 18 column 2 paragraph 3-p. 19 paragraph 1 “We start with a pool of virtual machine nodes. At the start of training, one node is designated the parameter server node and all other nodes are designated as agent nodes. The parameter server is responsible for keeping the master copy of the model, accepting synchronous updates from each of the agent nodes, and controlling the annealing rate. The agent nodes are responsible for running the simulator and performing local model training. After an agent completes an episode, it performs a training iteration on its local copy of the model with the collected data.”); and inject the reinforcement function into the simulation of the second system to determine a reward value corresponding to actions performed by the agent in the simulation environment (Spryn, p. 18 column 2 paragraph 3-p. 19 paragraph 1 “Once the training completes, it then computes the change in weights of each of the layers of the model (the gradient). It then sends the gradient to the parameter server, and waits for a response. When the parameter server receives the gradient from the agent node, it adds the gradient to the master copy of the model, and then sends the updated model back to the agent node. This may be a different model than the agent currently has, as it may include gradients received from other nodes as well. This process repeats until the parameter server has received a set number of iterations.” EN: Sending the updates the second system is construed as the injection.).

Claims 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over Mitchell Spryn, et al., “Distributed Deep Reinforcement Learning on the Cloud for Autonomous Driving,” 2018 ACM/IEEE 1st International Workshop on Software Engineering for AI in Autonomous Systems, May 28, 2018 in view of Vilches et al., “robot gym: accelerated robot training through simulation in the cloud with ROS and Gazebo,” (IDS dated 4/6/2022) (Aug. 2018).
Claim 15. The non-transitory computer-readable storage medium of claim 13, wherein the instructions further comprise instructions that, as a result of being executed by the one or more processors, cause the computer system to: obtain, from a second simulation of the second system using the updated reinforcement learning model in the simulation environment, second data indicating how the second simulation of the second system performed in the simulation environment (Spryn, Fig. 2 showing distributed system with a plurality of agents; p. 17 column 2 paragraph 1 “For our problem, we define the state as a single RGB frame input from a front-facing web-cam on the car. Given this state information, the agent then takes the action of selecting a steering control signal from five possible values: hard left, soft left, straight ahead, soft right and hard right. Once the selection is made, the agent is then given a reward relative to its position in the environment. The details of the reward function are discussed in more detail in section 3.3.”; p. 19 section 3 EXPERIMENT DESIGN “We used Microsoft AirSim [12] as our simulator for the experiments presented here. In addition to having high-quality environments with realistic vehicle physics, it has a python API which allows for easy data extraction and control.” EN: The data collected an given a reward is an indication of agent performance in the environment, and any subsequent agent after the first teaches a second compute system and third, etc..); 
transmit the second data to the another computer system to cause the another computer system to train the updated reinforcement learning model (Spryn, Fig. 2 updates from Parameter server to agents; p. 19 column 1 paragraph 1 “When the parameter server receives the gradient from the agent node, it adds the gradient to the master copy of the model, and then sends the updated model back to the agent node.”); 
obtain, from the another computer system, an indication of convergence of the updated reinforcement learning model as a result of a termination condition having been satisfied (Spryn, p. 20 column 1 paragraphs 1-2 describing training the model until convergence, and after convergence transferring the weights to train the final two layers of the Deep-Q-network).
Spryn does not explicitly disclose store the updated reinforcement learning model to allow installation of the updated reinforcement learning model on to a fleet of second systems.
Vilches teaches store the updated reinforcement learning model to allow installation of the updated reinforcement learning model on to a fleet of second systems (Vilches, Fig. 2 stored in container hub for distribution to a fleet via the Internet by the global policy; p. 4 paragraph 1 “All the instances need to run exactly the same code and version, each instance will fetch the latest available container image from a common container hub. A container hub is a service where the container engines fetch the containers that they should run.”; paragraph 3 “Following the orchestration, the robot robot gym framework initializes a global policy  that will be used by all workers”.)
Accordingly, it would have been obvious to one of ordinary skill in the art at the time the invention was effectively filed to have combined Spryn (directed to simulation execution environments with reinforcement learning) and Vilches (directed to reinforcement learning to train distributed robotic devices) and arrived at a simulation environment that includes simulations that once converged and trained are maintained and distributed to a plurality of installations. One of ordinary skill in the art would have been motivated to make such a combination to effectively and efficiently distribute a replica in an “encrypted, protected and secure way” as taught by Vilches.

Claim 16. Modified Spryn teaches The non-transitory computer-readable storage medium of claim 15, wherein the termination condition is satisfied as a result of a maximum number of simulations of the second system in the simulation environment having been performed resulting in the convergence (Spryn, p. 19 column 1 paragraph 1 “This process repeats until the parameter server has received a set number of iterations.” EN: having a set number of iterations, or threshold condition for convergence is within the ordinary skill in the art and is taught and suggested by Spryn.).

Claim 17. Modified Spryn teaches The non-transitory computer-readable storage medium of claim 15, wherein the termination condition is satisfied as a result of an average reward value having been attained over a previous number of iterations of the simulation (Spryn, p. 20 section 3.3. Reward Function, the reward value is used to learn how to perform a desired task and as mentioned above, the training goes through a set number of iterations. It is considered obvious to one of ordinary skill in the art to adjust thresholds and convergence criteria as needed as part of routine optimizations.).


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to STEVEN W CRABB whose telephone number is (571)270-5095. The examiner can normally be reached M-F (6-2:30).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Rehana Perveen can be reached on 571-272-3676. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/STEVEN W CRABB/Primary Examiner, Art Unit 2148