DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claims 1-20 are pending.

Information Disclosure Statement
The references cited in the information disclosure statements (IDS) submitted on 31 may 2022 have been considered by the examiner.


Response to Amendment
The amendment, filed 27 October 2022, is fully responsive.

Applicant’s amendments to the claims 7-9 and 17-19 have overcome each and every 112(b) rejections previously set forth. The 112(b) rejections of the claims 7-9 and 17-19 have been withdrawn.


Response to Arguments
Applicant’s arguments (see Amendment pages 8-9) with respect to the 103 rejections are directed to Milton and Di Cairano, individually or in combination, not teaching a method to “jointly control the machine and update the control policy.”
The claim specifically recites “wherein, for performing the joint control and update, the processor is configured to control the machine using the control policy to collect data including a sequence of control inputs generated using the control policy and a sequence of states of the machine corresponding to the sequence of control inputs; determine a reward for a quality of the control policy on the state of the machine using a reward function of the sequence of control inputs and the sequence of states of the machine augmented with an adaptation term determined as the minimum amount of effort needed for the machine having the state to remain within the CIS; and update the control policy that improves a cost function of operation of the machine according to the determined reward.” (emphasis added)  Accordingly, “control …, determine …, and update …” limitations fulfill “performing the joint control and update.”
Examiner submits that Milton teaches: the processor is configured to control the machine using the control policy to collect data including a sequence of control inputs generated using the control policy and a sequence of states of the machine corresponding to the sequence of control inputs; (Milton: [0094] “The attention weights and the input sequence can then be used in a first feed forward layer of the encoder portion of the transformer. For example, the attention weights and an event log can then be used in a first feed forward layer to weight each event of the event log by their respective attention weight. The output of the feed forward layer can then be used by a decoder portion of the transformer, wherein the decoder portion may include one or more other multi-head self-attention layers having different weights and other values from the first multi-head self-attention layer. The decoder portion may also include one or more feed forward layers having different weights and other values from the first feed forward layer. The output of the decoder portion of transformer can be used to categorize an input or generate inferences. For example, if the input sequence is a time series based on the number of swerves during a drive, an agent executing a neural network having an attention mechanism may determine whether swerves are safe or risky in an interval based on current and past vehicle operations. In response to a determination that the number of risky swerves exceed a threshold number, and as further described below, the agent may induce the local computing layer to determine an adjustment value to change at least one of a LIDAR warning range, a steering wheel responsiveness, or an anti-lock braking system responsiveness.”) [The swerve time series based on the input sequence reads on “to collect data including a sequence of control inputs …”, and the sequence of control inputs based on the adjustment to responsiveness of the system reads on “generated using the control policy”. Determining the risks of the swerves and the number of risky swerves reads on “a sequence of states of the machine …”.]
determine a reward for a quality of the control policy on the state of the machine using a reward function of the sequence of control inputs and the sequence of states of the machine augmented with an adaptation term determined as the minimum amount of effort needed for the machine having the state to remain within the CIS; and (Milton: [0120] “Some embodiments may include elements of reinforcement learning having one or more policy gradient methods. An agent executing on a top-view computing layer may proceed through a Markov decision process, wherein the sensor data from one or more vehicles, other data such as weather, and computed results may be discretized and treated as a finite set of states. In some embodiments, the control-system adjustment values or parameters used during machine-learning operations may be treated by the agent as the available actions in each state. A history of the control-system adjustments and their resulting effects may be evaluated using a reward system when implementing the reinforcement learning method. For example, positive feedback from a vehicle operator resulting from a query or a decrease in accident rates may be used by a reward function to provide reinforcement for an adjustment value change. Similarly, negative feedback from a vehicle operator or an increase in accident rates may be used by the reward function to discourage use of the adjustment value change. In some embodiments, the reinforcements may be provided through simulations of vehicle behavior based on one or more road network graphs or through a third-party system instead of or in addition to physical changes.”; [0121] “In some embodiments, an agent executing a reinforcement learning operation may modify an ensemble learning method to balance weights used in or applied to different machine-learning operations, wherein the reinforcement function may reward vehicle performance improvements, vehicle safety parameters, or a computation speed to different combinations of machine-learning approaches. For example, the agent may use a reinforcement learning approach to determine which of a combination of neural network features are to be implemented at the vehicle computing layer and the local computing layer based on a reward function. In some embodiments, the reward function used may be based on an accuracy parameter and a response time parameter. Use of this reward function during a reinforcement learning operation may result in a selection of a CNN computation on the vehicle layer and a selection of a convolutional LSTM having an attention mechanism on the local computing layer. By performing this meta-analysis and similar meta-analysis of machine-learning systems in a multilayer vehicle learning infrastructure, the agent executing on the central computing layer may be used to further refine and optimize machine-learning operations.”; [0122] “Some embodiments may supplement the reinforcement learning operation with a policy gradient method. An agent executing the reinforcement learning operation may implement a policy to determine a trajectory for a control-system adjustment values and other changes to vehicle behavior, wherein the proposed changes follow a policy gradient between any timestep. In some embodiments, the policy gradient of the policy may be changed based on the REINFORCE method. For example, while executing a reinforcement learning operation, an agent may implement a policy gradient determination operation, wherein one or more observables or quantifiable values such as response time, number of swerves, number of collisions, vehicle sensor data values, or the like may be used to determine a general path likelihood ratio as a gradient estimate.”) [Upon operation of the vehicle, using the reward function of positive or negative feedback from the vehicle operator or the reward function of the accuracy parameter and the response time parameter reads on “determine a reward for a quality … using a reward function”. The adjustment value reads on “an adaptation term”, and encouraging or discouraging of the adjustment value change of the vehicle operation based on the reward function reads on “… the control policy on the state of the machine using a reward function”. The adjustment value to keep positive feedback or certain accuracy parameter or the response time parameter reads on “the state to remain within the CIS”.]
update the control policy that improves a cost function of operation of the machine according to the determined reward. (Milton: [0037] “Some embodiments may use one or more agents executing on the vehicle computing layer to perform a machine-learning operation. In some embodiments, the machine-learning operation may comprise a training operation to use sensor data as an input with the goal of minimizing an objective function. For example, some embodiments may apply a machine-learning operation trained to predict the response profile of a braking event in response to the detection of an object or an arrival at a particular geolocation using a gradient descent method (e.g. batch gradient descent, stochastic gradient descent, etc.) to minimize an objective function. As another example, some embodiments may apply a machine-learning operation trained to predict the riskiest or most operationally vulnerable portion of a vehicle based on sensor data comprising internal engine temperature and pressure sensors, motion-based sensors, and control-system sensors.”; [0120] “In some embodiments, the control-system adjustment values or parameters used during machine-learning operations may be treated by the agent as the available actions in each state. A history of the control-system adjustments and their resulting effects may be evaluated using a reward system when implementing the reinforcement learning method. For example, positive feedback from a vehicle operator resulting from a query or a decrease in accident rates may be used by a reward function to provide reinforcement for an adjustment value change. Similarly, negative feedback from a vehicle operator or an increase in accident rates may be used by the reward function to discourage use of the adjustment value change. In some embodiments, the reinforcements may be provided through simulations of vehicle behavior based on one or more road network graphs or through a third-party system instead of or in addition to physical changes.”; [0122] “Some embodiments may supplement the reinforcement learning operation with a policy gradient method. An agent executing the reinforcement learning operation may implement a policy to determine a trajectory for a control-system adjustment values and other changes to vehicle behavior, wherein the proposed changes follow a policy gradient between any timestep. In some embodiments, the policy gradient of the policy may be changed based on the REINFORCE method. For example, while executing a reinforcement learning operation, an agent may implement a policy gradient determination operation, wherein one or more observables or quantifiable values such as response time, number of swerves, number of collisions, vehicle sensor data values, or the like may be used to determine a general path likelihood ratio as a gradient estimate.”) [The accident rate reads on “a cost function”, and the decrease in accident rates reads on “improves a cost function”. Implementing the policy for the control-system adjustment values and other changes to vehicle based on reinforcement learning reads on “update the control policy”.]
Accordingly, Examiner submits that Milton teaches a method to “jointly control the machine and update the control policy.”
The 103 rejections are maintained.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-6, 10-13, 15-16 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Milton (US 2020/0017117 A1), hereinafter ‘Milton’, in view of Di Cairano et al. (US 2018/0017971 A1), hereinafter ‘Di Cairano’.

Regarding claim 1, Milton teaches:
A system for controlling an operation of a machine subject to state constraints in continuous state space of the machine and subject to control input constraints in continuous control input space of the machine, comprising: (Milton: Abstract “Provided is a system configured to determine and push adjustments to vehicle operations using machine-learning systems across multiple computing layers.”; [0006] “Some aspects include a process that includes processing and transferring sensor data from vehicles across different computing levels for machine-learning operations to compute adjustments for the vehicle control systems.”)
an input interface to accept data indicative of a state of the machine; (Milton: [0006] “Some aspects include a process that includes processing and transferring sensor data from vehicles across different computing levels for machine-learning operations to compute adjustments for the vehicle control systems.”; [0062] “In some embodiments, the computing environment 200 includes the vehicle data analytics system 250 configured to receive data from any of the above-described components. The vehicle data analytics system 250 may be executed on a single server of a local computing layer, a distributed computing network operating as part of a top-view computing layer, a cloud-based application on the top-view computing layer, some combination thereof In some embodiments, the vehicle data analytics system 250 may be configured to determine and store attributes of vehicles, vehicle operators, vehicle passengers, and places visited by those vehicles based on this data.”) [The vehicle data analytics system 250 receiving inputs through the network, as illustrated in figure 2, reads on “an input interface”, The sensor data from vehicles reads on “data indicative of a state of the machine”.]
a processor configured to iteratively perform a reinforcement learning (RL) algorithm to jointly control the machine and update the control policy, (Milton: [0139] “Computing system 1000 may include one or more processors (e.g., processors 1010a-1010n) coupled to system memory 1020, an input/output I/O device interface 1030, and a network interface 1040 via an input/output (I/O) interface 1050. A processor may include a single processor or a plurality of processors (e.g., distributed processors).”) (Milton: [0007] “… updating the control-system adjustment value based on the top-view control-system adjustment value; transmitting the control-system adjustment value to the first vehicle; and adjusting a vehicle response to an operator-effected change in the first vehicle based on the control-system adjustment value.”; [0119] “In some embodiments, a top-view computing application being executed on the top-view computing layer may act as an overriding agent that can modify, stop, or expand on operations to train or use machine-learning systems in any of the layers of the multilayer vehicle learning infrastructure. In some embodiments, the top-view computing application may perform one or more recurring activities to enforce a cyclical learning hierarchy by initiating one or more learning tasks to continually pull additional training data from the vehicle computing layer or the local computing layer. In addition, the top-view computing application may modify learning behavior in other machine-learning operations by applying policy gradient reinforcement learning methods, as further described below. Furthermore, the top-view computing application may include program code to start a re-training operation, induce a switch to a different machine-learning system, or rely on non-learning automation systems when a data stream, component, or layer becomes inaccessible to other components in the multilayer vehicle learning infrastructure. While the above embodiment discloses the top-view computing application controlling other layers, some embodiments may restrict the top-view computing application from directly controlling applications executing on other layers.”) [Cyclical or re-training reads on “to iteratively perform …”. Updating and adjusting the vehicle response based on the control-system adjustment value reads on “to jointly control the machine and update the control policy”.]
wherein, for performing the joint control and update, the processor is configured to control the machine using the control policy to collect data including a sequence of control inputs generated using the control policy and a sequence of states of the machine corresponding to the sequence of control inputs; (Milton: [0094] “The attention weights and the input sequence can then be used in a first feed forward layer of the encoder portion of the transformer. For example, the attention weights and an event log can then be used in a first feed forward layer to weight each event of the event log by their respective attention weight. The output of the feed forward layer can then be used by a decoder portion of the transformer, wherein the decoder portion may include one or more other multi-head self-attention layers having different weights and other values from the first multi-head self-attention layer. The decoder portion may also include one or more feed forward layers having different weights and other values from the first feed forward layer. The output of the decoder portion of transformer can be used to categorize an input or generate inferences. For example, if the input sequence is a time series based on the number of swerves during a drive, an agent executing a neural network having an attention mechanism may determine whether swerves are safe or risky in an interval based on current and past vehicle operations. In response to a determination that the number of risky swerves exceed a threshold number, and as further described below, the agent may induce the local computing layer to determine an adjustment value to change at least one of a LIDAR warning range, a steering wheel responsiveness, or an anti-lock braking system responsiveness.”) [The swerve time series based on the input sequence reads on “to collect data including a sequence of control inputs …”, and the sequence of control inputs based on the adjustment to responsiveness of the system reads on “generated using the control policy”. Determining the risks of the swerves and the number of risky swerves reads on “a sequence of states of the machine …”.]
determine a reward for a quality of the control policy on the state of the machine using a reward function of the sequence of control inputs and the sequence of states of the machine augmented with an adaptation term determined as the minimum amount of effort needed for the machine having the state to remain within the CIS; and (Milton: [0120] “Some embodiments may include elements of reinforcement learning having one or more policy gradient methods. An agent executing on a top-view computing layer may proceed through a Markov decision process, wherein the sensor data from one or more vehicles, other data such as weather, and computed results may be discretized and treated as a finite set of states. In some embodiments, the control-system adjustment values or parameters used during machine-learning operations may be treated by the agent as the available actions in each state. A history of the control-system adjustments and their resulting effects may be evaluated using a reward system when implementing the reinforcement learning method. For example, positive feedback from a vehicle operator resulting from a query or a decrease in accident rates may be used by a reward function to provide reinforcement for an adjustment value change. Similarly, negative feedback from a vehicle operator or an increase in accident rates may be used by the reward function to discourage use of the adjustment value change. In some embodiments, the reinforcements may be provided through simulations of vehicle behavior based on one or more road network graphs or through a third-party system instead of or in addition to physical changes.”; [0121] “In some embodiments, an agent executing a reinforcement learning operation may modify an ensemble learning method to balance weights used in or applied to different machine-learning operations, wherein the reinforcement function may reward vehicle performance improvements, vehicle safety parameters, or a computation speed to different combinations of machine-learning approaches. For example, the agent may use a reinforcement learning approach to determine which of a combination of neural network features are to be implemented at the vehicle computing layer and the local computing layer based on a reward function. In some embodiments, the reward function used may be based on an accuracy parameter and a response time parameter. Use of this reward function during a reinforcement learning operation may result in a selection of a CNN computation on the vehicle layer and a selection of a convolutional LSTM having an attention mechanism on the local computing layer. By performing this meta-analysis and similar meta-analysis of machine-learning systems in a multilayer vehicle learning infrastructure, the agent executing on the central computing layer may be used to further refine and optimize machine-learning operations.”; [0122] “Some embodiments may supplement the reinforcement learning operation with a policy gradient method. An agent executing the reinforcement learning operation may implement a policy to determine a trajectory for a control-system adjustment values and other changes to vehicle behavior, wherein the proposed changes follow a policy gradient between any timestep. In some embodiments, the policy gradient of the policy may be changed based on the REINFORCE method. For example, while executing a reinforcement learning operation, an agent may implement a policy gradient determination operation, wherein one or more observables or quantifiable values such as response time, number of swerves, number of collisions, vehicle sensor data values, or the like may be used to determine a general path likelihood ratio as a gradient estimate.”) [Upon operation of the vehicle, using the reward function of positive or negative feedback from the vehicle operator or the reward function of the accuracy parameter and the response time parameter reads on “determine a reward for a quality … using a reward function”. The adjustment value reads on “an adaptation term”, and encouraging or discouraging of the adjustment value change of the vehicle operation based on the reward function reads on “… the control policy on the state of the machine using a reward function”. The adjustment value to keep positive feedback or certain accuracy parameter or the response time parameter reads on “the state to remain within the CIS”.]
update the control policy that improves a cost function of operation of the machine according to the determined reward. (Milton: [0037] “Some embodiments may use one or more agents executing on the vehicle computing layer to perform a machine-learning operation. In some embodiments, the machine-learning operation may comprise a training operation to use sensor data as an input with the goal of minimizing an objective function. For example, some embodiments may apply a machine-learning operation trained to predict the response profile of a braking event in response to the detection of an object or an arrival at a particular geolocation using a gradient descent method (e.g. batch gradient descent, stochastic gradient descent, etc.) to minimize an objective function. As another example, some embodiments may apply a machine-learning operation trained to predict the riskiest or most operationally vulnerable portion of a vehicle based on sensor data comprising internal engine temperature and pressure sensors, motion-based sensors, and control-system sensors.”; [0120] “In some embodiments, the control-system adjustment values or parameters used during machine-learning operations may be treated by the agent as the available actions in each state. A history of the control-system adjustments and their resulting effects may be evaluated using a reward system when implementing the reinforcement learning method. For example, positive feedback from a vehicle operator resulting from a query or a decrease in accident rates may be used by a reward function to provide reinforcement for an adjustment value change. Similarly, negative feedback from a vehicle operator or an increase in accident rates may be used by the reward function to discourage use of the adjustment value change. In some embodiments, the reinforcements may be provided through simulations of vehicle behavior based on one or more road network graphs or through a third-party system instead of or in addition to physical changes.”; [0122] “Some embodiments may supplement the reinforcement learning operation with a policy gradient method. An agent executing the reinforcement learning operation may implement a policy to determine a trajectory for a control-system adjustment values and other changes to vehicle behavior, wherein the proposed changes follow a policy gradient between any timestep. In some embodiments, the policy gradient of the policy may be changed based on the REINFORCE method. For example, while executing a reinforcement learning operation, an agent may implement a policy gradient determination operation, wherein one or more observables or quantifiable values such as response time, number of swerves, number of collisions, vehicle sensor data values, or the like may be used to determine a general path likelihood ratio as a gradient estimate.”) [The accident rate reads on “a cost function”, and the decrease in accident rates reads on “improves a cost function”. Implementing the policy for the control-system adjustment values and other changes to vehicle based on reinforcement learning reads on “update the control policy”.]

Milton does not explicitly teach: a memory configured to store an optimization problem for computing a safety margin of a state and action pair satisfying the state constraints and a control policy mapping the state of the machine within a control invariant set (CIS) to a control input satisfying the control input constraints, wherein a control of the machine having the state within the CIS according to the control policy maintains the state of the machine within the CIS.
Di Cairano teaches:
a memory configured to store an optimization problem for computing a safety margin of a state and action pair satisfying the state constraints and a control policy mapping the state of the machine within a control invariant set (CIS) to a control input satisfying the control input constraints, wherein a control of the machine having the state within the CIS according to the control policy maintains the state of the machine within the CIS. (Di Cairano: Abstract “A method selects from a memory a first model of motion of vehicle, a second model of the motion of the vehicle, a first constraint on the first model for moving along a desired trajectory of the vehicle, and a control invariant set joining states of the first model with states of the second model. For each combination of the states within the control invariant subset there is at least one control action to the second model that maintains the state of the second model within the control invariant set for every modification of the state of the first model satisfying the first constraint. A portion of the desired trajectory satisfying the first constraint is determined using the first model while a sequence of commands for moving the vehicle along the portion of the desired trajectory is determined using the second model. The sequence of commands is determined to maintain the sequence of the states of the second model and a sequence of the states of the first model determined by the portion of the desired trajectory within the control invariant subset. The vehicle is controlled using at least one command from the sequence of commands.”; [0070] “For example, the optimization of the step 830 can be formulated as an optimization problem optimizing a performance of the vehicle subject to constraints including a combination of the future state of the second model of the motion of the vehicle and the first element from the sequence of the states of the first model belonging to the control invariant set. Examples of the performance of the vehicle include reducing lateral acceleration of the vehicle, reducing yaw rate of the vehicle, reducing lateral displacement from the desired trajectory, and reducing steering wheel actuation power. By solving the optimization problem, a command or a sequence of commands for gainfully moving the vehicle can be produced.”)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Milton and Di Cairano before them, to modify the system for computing adjustments for the vehicle control system to incorporate determining the sequence of commands to maintain the sequence of the states determined by the portion of the desired control outcome of the vehicle within the control invariant subset.
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to do this modification because it would improve satisfying the desired control outcome while maintaining the state of the vehicle within the control invariant set (Di Cairano: [0010] “To that end, some embodiments select a first constraint on a desired trajectory of the vehicle, and select a control invariant set joining states of the first model with states of the second model. The first constraint and the control invariant set are determined such that for each combination of the states within the control invariant subset there is at least one control action to the second model that maintains the state of the second model within the control invariant set for every modification of the state of the first model satisfying the first constraint. The first constraint and the control invariant set establish a mutual dependency that if SC generates a desired trajectory satisfying the first constraint, i.e., the property P, the VC can control the vehicle maintaining the state of the vehicle within the control invariant set, i.e., satisfying the measure of performance M.”).

Regarding claim 2, Milton and Di Cairano teach all the features of claim 1.
Milton further teaches:
wherein the RL algorithm is a deep-deterministic policy gradient (DDPG) algorithm. (Milton: [0119] “In addition, the top-view computing application may modify learning behavior in other machine-learning operations by applying policy gradient reinforcement learning methods, as further described below.”; [0120] “Some embodiments may include elements of reinforcement learning having one or more policy gradient methods. An agent executing on a top-view computing layer may proceed through a Markov decision process, wherein the sensor data from one or more vehicles, other data such as weather, and computed results may be discretized and treated as a finite set of states.”)

Regarding claim 3, Milton and Di Cairano teach all the features of claims 1-2.
Milton further teaches:
wherein the DDPG algorithm learns both a critic network to estimate long-term values for a given policy and an actor network to sample optimal actions according to the estimated long-term values. (Milton: [0122] “Some embodiments may supplement the reinforcement learning operation with a policy gradient method. An agent executing the reinforcement learning operation may implement a policy to determine a trajectory for a control-system adjustment values and other changes to vehicle behavior, wherein the proposed changes follow a policy gradient between any timestep. In some embodiments, the policy gradient of the policy may be changed based on the REINFORCE method. For example, while executing a reinforcement learning operation, an agent may implement a policy gradient determination operation, wherein one or more observables or quantifiable values such as response time, number of swerves, number of collisions, vehicle sensor data values, or the like may be used to determine a general path likelihood ratio as a gradient estimate.”) [Estimating the trajectory or the gradient reads on “to estimate long-term values”.]

Regarding claim 5, Milton and Di Cairano teach all the features of claim 1.
Di Cairano further teaches:
wherein the memory includes a supervisor algorithm that obtains the state of the machine and computes a desired safety margin. (Di Cairano: [0005] “Some embodiments are based on recognition that performance of the motion of the vehicle following a trajectory depends on the objective of the motion. For example, performance of the motion of the vehicle following the trajectory needs to satisfy a measure of performance (M) that depends on the current objective of the motion. For example, in some situations the vehicle does not have to follow the desired trajectory exactly, but the maximum difference between the actual trajectory of the vehicle and the desired trajectory needs to be less than a threshold. Some embodiments are based on recognition that such a measure of performance can result from the actual practicalities of controlling the vehicle, but also can be accounted while generating the desired trajectory for the objective of the motion. For example, if the objective of the motion is lane keeping, all possible desired trajectories need to have a safety margin from the border of the lane equal to or greater than the threshold. Similarly, if the objective of the motion is the collision avoidance, all possible desired trajectories need to keep a safety distance margin from an obstacle, equal or greater than the threshold.”)
The motivation to combine Milton and Di Cairano, which teach the features of the present claim, as submitted in claim 1, is incorporated herein.

Regarding claim 6, Milton and Di Cairano teach all the features of claims 1 and 5.
Di Cairano further teaches:
wherein the supervisor generates a safe command when the RL algorithm generates a command that is deemed unsafe. (Di Cairano: [0054] “Because the motions that a vehicle can execute are limited by the mechanical and safety considerations, the trajectories generated by the SC can be limited also. In particular the limitations of the SC trajectories can be defined by ensuring that the first model, e.g., (5), (6) satisfies constraints”) [Defining the limitations of the SC trajectories by ensuring the first model satisfies the constraints reads on “a safe command when the RL algorithm generates a command that is deemed unsafe”.]
The motivation to combine Milton and Di Cairano, which teach the features of the present claim, as submitted in claim 1, is incorporated herein.

Regarding claim 10, Milton and Di Cairano teach all the features of claim 1.
Milton further teaches:
wherein the machine is a suspension system of a vehicle. (Milton: [0030] “The vehicle 102 may include an onboard computing device 106 having one or more processors to perform computations on the sensor data. In some cases, the computing device is distributed on the vehicle, e.g., with a collection of processors operating on data from various sub-systems, like a braking sub-system, a transmission sub-system, a vehicle entertainment sub-system, a vehicle navigation sub-system, a suspension sub-system, and the like.”)

Regarding claim 11:
The claim recites similar limitations as corresponding claim 1 and is rejected using the same teachings and rationale.

Regarding claim 12, Milton and Di Cairano teach all the features of claim 11.
The claim recites similar limitations as corresponding claim 2 and is rejected using the same teachings and rationale.

Regarding claim 13, Milton and Di Cairano teach all the features of claims 11-12.
The claim recites similar limitations as corresponding claim 3 and is rejected using the same teachings and rationale.

Regarding claim 15, Milton and Di Cairano teach all the features of claim 11.
The claim recites similar limitations as corresponding claim 5 and is rejected using the same teachings and rationale.

Regarding claim 16, Milton and Di Cairano teach all the features of claims 11 and 15.
The claim recites similar limitations as corresponding claim 6 and is rejected using the same teachings and rationale.

Regarding claim 20, Milton and Di Cairano teach all the features of claim 11.
The claim recites similar limitations as corresponding claim 10 and is rejected using the same teachings and rationale.


Claims 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Milton, in view of Di Cairano, further in view of Mnih et al. (US 2019/0258938 A1), hereinafter ‘Mnih’.

Regarding claim 4, Milton and Di Cairano teach all the features of claim 1.
Milton and Di Cairano do not explicitly teach: wherein the reward function is modified to an updated reward by subtracting the cost function from the reward function, wherein the updated reward is expressed by

    PNG
    media_image1.png
    33
    181
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    29
    43
    media_image2.png
    Greyscale
 where
 is the updated reward, r(t) is the reward function, c(t) is the cost function, and t is a current time of the system.
Mnih teaches:
wherein the reward function is modified to an updated reward by subtracting the cost function from the reward function, wherein the updated reward is expressed by

    PNG
    media_image1.png
    33
    181
    media_image1.png
    Greyscale
 where

    PNG
    media_image2.png
    29
    43
    media_image2.png
    Greyscale
 is the updated reward, r(t) is the reward function, c(t) is the cost function, and t is a current time of the system. (Mnih: [0017] “In some implementations, training the reward prediction neural network comprises: receiving an actual reward received with the next or a subsequent observation image; and training the immediate reward neural network to decrease a loss between the actual reward and the estimated reward, more particularly the value of a loss function dependent on a difference between the actual reward and the estimated reward. As described later training the reward prediction neural network may comprise sampling from sequences of observations stored in an experience replay memory, in particular so as to over-represent rewarding sequences/events, which can be advantageous when rewards in the environment are sparse.”) [The loss function reads on “the cost function”, the actual reward reads on “the reward function”, and the estimated reward reads on “the updated reward”.] [cost function = actual reward – estimated reward, and therefore, estimated reward = actual reward – cost function]
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Milton, Di Cairano and Mnih before them, to modify the system for computing adjustments for the vehicle control system using the reinforcement learning to incorporate updating the reward using the reward neural network.
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to do this modification because it would improve decreasing the loss between the actual reward and the estimated reward (Mnih: [0017] “In some implementations, training the reward prediction neural network comprises: receiving an actual reward received with the next or a subsequent observation image; and training the immediate reward neural network to decrease a loss between the actual reward and the estimated reward, more particularly the value of a loss function dependent on a difference between the actual reward and the estimated reward. As described later training the reward prediction neural network may comprise sampling from sequences of observations stored in an experience replay memory, in particular so as to over-represent rewarding sequences/events, which can be advantageous when rewards in the environment are sparse.”).

Regarding claim 14, Milton and Di Cairano teach all the features of claim 11.
The claim recites similar limitations as corresponding claim 4 and is rejected using the same teachings and rationale.


Claims 9 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Milton, in view of Di Cairano, further in view of Ratliff et al. (Nathan D. Ratliff, J. Andrew Bagnell, Marin A. Zinkevich, 2006, Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA., pp. 729-736), hereinafter ‘Ratliff’.

Regarding claim 9, Milton and Di Cairano teach all the features of claim 1.
Milton and Di Cairano do not explicitly teach: wherein a maximum penalty G for performing the RL algorithm is about twice a value of cb: G≈2 cb, wherein cb are bounds on the cost function.
Ratliff teaches:
wherein a maximum penalty G for performing the RL algorithm is about twice a value of cb: G≈2 cb , wherein cb are bounds on the cost function. (Ratliff: page 733 section 3.4 “Finally, we may incorporate domain knowledge in the form of constraints on w: e.g., we may require that a certain area state have at least double the cost of another state.”)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Milton, Di Cairano and Ratliff before them, to modify the system for computing adjustments for the vehicle control system using the reinforcement learning to incorporate having different costs for different states.
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to do this modification because it would improve transfer expert knowledge in machine learning in the form of constraints of a state by varying the cost for the state (Ratliff: pages 733-734 section 3.4 “All of these are powerful methods to transfer expert knowledge to the learner in addition to the use of training examples.”).

Regarding claim 19, Milton and Di Cairano teach all the features of claim 11.
The claim recites similar limitations as corresponding claim 9 and is rejected using the same teachings and rationale.


Allowable Subject Matter
Claims 7-8 and 17-18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
As allowable subject matter has been indicated, applicant's reply must either comply with all formal requirements or specifically traverse each requirement not complied with.  See 37 CFR 1.111(b) and MPEP § 707.07(a).


Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL W CHOI whose telephone number is (571)270-5069. The examiner can normally be reached Monday-Friday 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kenneth Lo can be reached on 571-272-9774. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/MICHAEL W CHOI/            Primary Examiner, Art Unit 2116