DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 07/20/2018 and 08/20/2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(4) because “Circle with horizontal hatching: target” is labeled 403 in Fig. 4, but is referred to as 402 in the specification. Similarly, “Circle with diagonal hatching: action range” is labeled 402 in the Fig. 4, but is referred to as 403 in the specification. Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 7 and 14 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 7 and 14 recite “wherein the first set and the second set of state-action tuples are used as action ranges by the supervised learning”. It is unclear how the first set and the second set of state-action tuples to be used as action ranges as looking to the specification, paragraph [0016] explains “the present invention uses good and bad examples to learn action ranges by supervised learning. These action ranges are then used to restrict exploration during reinforcement learning.” It is unclear how the first and second set of state action tuples (established to represent good and bad examples) are to be used as action ranges by supervised learning when the specification suggests the action ranges are the result of supervised learning. For the purposes of prior art examination, Examiner is interpreting that these claims are directed to the use of finite data sets to inform future actions in “continuous action spaces”.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 7-10, and 14-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dalal (Dalal et al., “Safe Exploration in Continuous Action Spaces”, arXiv, January 2018, 9 pages.), cited in the IDS submitted on 07/30/2018, in view of Hwang (Hwang et al., “Inverse Reinforcement Learning based on Critical State”, 16th World Congress of the International Fuzzy Systems Association (IFSA) 9th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT), June-July 2015, pages 771-775.).

With respect to claim 1, Dalal teaches A computer-implemented method for reinforcement learning, (Section 6, “We experiment with Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) whose policy network directly outputs actions and not their probabilities.”; Section 6, “In terms of implementation, it consists of a few primitive arithmetic operations: vector products followed by a 'max' operation. The benefits of its simplicity are three-fold: i) it has a trivial, almost effortless software implementation; ii) its computational cost is negligible; and iii) it is differentiable (almost everywhere, as is ReLu).” DDPG – reinforcement learning algorithm, Software implementation – computer implemented)
comprising: obtaining, by a processor device, a first set and a second set of state-action tuples, each of the state-action tuples in the first set representing a respective good demonstration, and each of the state- action tuples in the second set representing a respective bad demonstration; (Dalal teaches obtaining sets of state-action tuples representing demonstrations. “A first set and a second set of state-action tuples, each of the state-action tuples in the first set representing a respective good demonstration, and each of the state-action tuples in the second set representing a respective bad demonstration” will be taught later. Section 5, “While it is attractive to simply approximate them with NNs that take (s, a) as inputs, we choose a more elegant approach that comes with significant advantages listed in Subsection 6.1. Namely, we perform the following linearization:
 
    PNG
    media_image1.png
    50
    436
    media_image1.png
    Greyscale
 
Where wi are weights of a NN, g(s;wi), that takes s as input and outputs a vector of the same dimension as a. This model is a first-order approximation to ci(s,a) with respect to a; i.e., an explicit representation of sensitivity of changes in the safety signal to the action using features of the state.”; Section 5, Figure 1. “Each safety signal ci(s,a) is approximated with a linear model with respect to a, whose coefficients are features of s, extracted with a NN.”; Section 5, 
    PNG
    media_image2.png
    241
    611
    media_image2.png
    Greyscale

)
training, by the processor device using supervised learning with the first set and the second set, a neural network which takes as input a state to provide an output (Section 5, “While it is attractive to simply approximate them with NNs that take (s, a) as inputs, we choose a more elegant approach that comes with significant advantages listed in Subsection 6.1. Namely, we perform the following linearization:
 
    PNG
    media_image1.png
    50
    436
    media_image1.png
    Greyscale
 
Where wi are weights of a NN, g(s;wi), that takes s as input and outputs a vector of the same dimension as a. This model is a first-order approximation to ci(s,a) with respect to a; i.e., an explicit representation of sensitivity of changes in the safety signal to the action using features of the state.”; Section 5, Figure 1. “Each safety signal ci(s,a) is approximated with a linear model with respect to a, whose coefficients are features of s, extracted with a NN.”)
the output being parameterized to obtain each of a plurality of real-valued constraint functions used for evaluation of each of a plurality of action constraints; (Section 5, “While it is attractive to simply approximate them with NNs that take (s, a) as inputs, we choose a more elegant approach that comes with significant advantages listed in Subsection 6.1. Namely, we perform the following linearization:
 
    PNG
    media_image1.png
    50
    436
    media_image1.png
    Greyscale
 
Where wi are weights of a NN, g(s;wi), that takes s as input and outputs a vector of the same dimension as a. This model is a first-order approximation to ci(s,a) with respect to a; i.e., an explicit representation of sensitivity of changes in the safety signal to the action using features of the state.”; Section 5, Figure 1. “Each safety signal ci(s,a) is approximated with a linear model with respect to a, whose coefficients are features of s, extracted with a NN.”)
and training, by the processor device, a policy using reinforcement learning by restricting actions predicted by the policy according to each of the plurality of action constraints with each of the plurality of real-valued constraint functions. (Section 1, “We then utilize this  model in a safety layer that is composed directly on top the agent’s policy to correct the action if needed; i.e., after every policy query, it solves an optimization problem for finding the minimal change to the action such that the safety constraints are met. Thanks to the linearity with respect to actions, the solution can be derived analytically in closed-form and amounts to basic arithmetic operations. Thus, our safety layer is both differentiable and has a trivial three-line software implementation. Note that relating to our safety mechanism as a ‘safety layer’ is purely a sematic choice; it merely is a simple calculation that is not limited to the nowadays popular deep policy networks and can be applied to any continuous-control algorithm (not necessarily RL-based).”)

Hwang, however, does teach comprising: obtaining, by a processor device, a first set and a second set of state-action tuples, (Section 3.4, "Based on the above concept, we propose an algorithm, Inverse Reinforcement Learning based on Critical State (IRLCS), which is able to do self-organization and search an appropriate reward function through the good and bad demonstrations. In the beginning of the algorithm, we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally."; Fig 4., "DG : set of state-action pairs in a good demonstration DB : set of state-action pairs in a bad demonstration")
each of the state-action tuples in the first set representing a respective good demonstration, (Section 3.4, "Based on the above concept, we propose an algorithm, Inverse Reinforcement Learning based on Critical State (IRLCS), which is able to do self-organization and search an appropriate reward function through the good and bad demonstrations. In the beginning of the algorithm, we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally."; Fig 4., "DG : set of state-action pairs in a good demonstration DB : set of state-action pairs in a bad demonstration")
and each of the state- action tuples in the second set representing a respective bad demonstration; (Section 3.4, "Based on the above concept, we propose an algorithm, Inverse Reinforcement Learning based on Critical State (IRLCS), which is able to do self-organization and search an appropriate reward function through the good and bad demonstrations. In the beginning of the algorithm, we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally."; Fig 4., "DG : set of state-action pairs in a good demonstration DB : set of state-action pairs in a bad demonstration")
It would have been obvious to an artisan of ordinary skill before the effective filing date of the claimed invention to combine the computer-implemented method for reinforcement learning of Dalal with obtaining, by a processor device, a first set and a (Hwang, Introduction)

With respect to claim 2, modified Dalal teaches the computer-implemented method for reinforcement learning of claim 1, and Hwang also teaches wherein the neural network is trained such that the first set satisfies each of the plurality of action constraints (Section 3.4, "In the beginning of the algorithm, we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally." A good demonstration is directed toward the objective of the task and satisfied action constraints by avoiding collision.)
and the second set violates at least one of the plurality of action constraints, evaluated with each of the plurality of real-valued constraint functions. (Section 3.4, "In the beginning of the algorithm, we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally.” Failing intentionally is directed to violating constraints)
It would have been obvious to an artisan of ordinary skill before the effective filing date of the claimed invention to combine the computer-implemented method for reinforcement learning of modified Dalal with the neural network is trained such that the first set satisfies each of the plurality of action constraints and the second set violates at least one of the plurality of action constraints, evaluated with each of the plurality of real-valued constraint functions in order to allow the agent to learn good behaviors by incorrect demonstrations. (Hwang, Introduction)

With respect to claim 3, modified Dalal teaches the computer-implemented method for reinforcement learning of claim 1, and Dalal also teaches wherein training the policy comprises calculating, by using each of the plurality of real-valued constraint functions, an action closest to the action predicted by the policy among actions which satisfy each of the plurality of action constraints (Section 1, “We then utilize this model in a safety layer that is composed directly on top the agent's policy to correct the action if needed; i.e., after every policy query, it solves an optimization problem for finding the minimal change to the action such that the safety constraints are met.”)
and executing the calculated action on an environment to obtain a reward for the reinforcement learning. (Section 7.1, “The action a = vB; i.e., it is to set the velocity of the ball. Actions are taken every 4 time-steps and remain constant in between. The dynamics are governed by Newton’s laws, with a small amount of damping. The reward has a maximum of 1 when the ball is exactly at the target and quickly diminishes to 0 away from it:
    PNG
    media_image3.png
    33
    372
    media_image3.png
    Greyscale
. Lastly, γ = 0.99. Our experiments are conducted on two tasks: Ball-1D where d = 1, Ball – 3D where d = 3.”; Section 7.2, “In this domain, the state is the spaceship's location and velocities; the action a ∈ [-1, 1] 2 actuates two thrust engines in forward/backward and right/left directions; the transitions are governed by the rules of physics where damping is applied; the reward is sparse; 1000 points are obtained when reaching the target and 0 elsewhere; and γ = 0.99.” Dalal explains two distinct experiments involving calculating actions to be performed in simulated environments, resulting in rewards for the reinforcement learning process.)

With respect to claim 7, modified Dalal teaches the computer-implemented method for reinforcement learning of claim 1, and Dalal also teaches wherein the first set and the second set of state-action tuples are used as action ranges by the supervised learning. (Section 1, “Therefore, in this work, we define our goal to be maintaining zero-constraint-violations throughout the whole learning process. Note that accomplishing this goal for discrete action spaces is more straightforward than for continuous ones. For instance, one can pre-train constraint-violation classifiers on offline data for pruning unsafe actions. However, in our context, this goal becomes considerably more challenging due to the infinite number of candidate actions. Nevertheless, we indeed manage to accomplish this goal for continuous action spaces and show to never violate constraints throughout the whole learning process.”; Section 7, Figs. 3, 4 – experimental results demonstrating function in a continuous action space)

With respect to claim 8, it is substantially similar to claim 1 and is rejected in the same manner, the same art and reasoning applying. Further, Dalal also teaches A computer program product for reinforcement learning, (Section 6, “We experiment with Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) whose policy network directly outputs actions and not their probabilities.”; Section 6, “In terms of implementation, it consists of a few primitive arithmetic operations: vector products followed by a 'max' operation. The benefits of its simplicity are three-fold: i) it has a trivial, almost effortless software implementation; ii) its computational cost is negligible; and iii) it is differentiable (almost everywhere, as is ReLu).” DDPG – reinforcement learning algorithm, Software implementation – computer implemented. Examiner asserts a computer program product is inherent in the implementation of a method using software implementation. Software implementation is well-understood in the art to involve a computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method.)
 the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, (Section 6, “We experiment with Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) whose policy network directly outputs actions and not their probabilities.”; Section 6, “In terms of implementation, it consists of a few primitive arithmetic operations: vector products followed by a 'max' operation. The benefits of its simplicity are three-fold: i) it has a trivial, almost effortless software implementation; ii) its computational cost is negligible; and iii) it is differentiable (almost everywhere, as is ReLu).” DDPG – reinforcement learning algorithm, Software implementation – computer implemented. Examiner asserts a computer program product is inherent in the implementation of a method using software implementation. Software implementation is well-understood in the art to involve a computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method.)

With respect to claim 9, it is substantially similar to claim 2 and is rejected in the same manner, the same art and reasoning applying.

With respect to claim 10, it is substantially similar to claim 3 and is rejected in the same manner, the same art and reasoning applying.

With respect to claim 14, it is substantially similar to claim 7 and is rejected in the same manner, the same art and reasoning applying.

With respect to claim 15, it is substantially similar to claim 1 and is rejected in the same manner, the same art and reasoning applying. Further, Dalal also teaches A (Section 6, “We experiment with Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) whose policy network directly outputs actions and not their probabilities.”; Section 6, “In terms of implementation, it consists of a few primitive arithmetic operations: vector products followed by a 'max' operation. The benefits of its simplicity are three-fold: i) it has a trivial, almost effortless software implementation; ii) its computational cost is negligible; and iii) it is differentiable (almost everywhere, as is ReLu).” DDPG – reinforcement learning algorithm, Software implementation – computer implemented. Examiner asserts a computer processing system is inherent in the implementation of a method using software implementation. Software implementation is well-understood in the art to involve a computer processing system comprising a memory for storing program code and a processor device operatively coupled to the memory for running the program code.)
comprising: a memory for storing program code; (Section 6, “We experiment with Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) whose policy network directly outputs actions and not their probabilities.”; Section 6, “In terms of implementation, it consists of a few primitive arithmetic operations: vector products followed by a 'max' operation. The benefits of its simplicity are three-fold: i) it has a trivial, almost effortless software implementation; ii) its computational cost is negligible; and iii) it is differentiable (almost everywhere, as is ReLu).” DDPG – reinforcement learning algorithm, Software implementation – computer implemented. Examiner asserts a computer processing system is inherent in the implementation of a method using software implementation. Software implementation is well-understood in the art to involve a computer processing system comprising a memory for storing program code and a processor device operatively coupled to the memory for running the program code.)
and a processor device operatively coupled to the memory for running the program code to (Section 6, “We experiment with Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) whose policy network directly outputs actions and not their probabilities.”; Section 6, “In terms of implementation, it consists of a few primitive arithmetic operations: vector products followed by a 'max' operation. The benefits of its simplicity are three-fold: i) it has a trivial, almost effortless software implementation; ii) its computational cost is negligible; and iii) it is differentiable (almost everywhere, as is ReLu).” DDPG – reinforcement learning algorithm, Software implementation – computer implemented. Examiner asserts a computer processing system is inherent in the implementation of a method using software implementation. Software implementation is well-understood in the art to involve a computer processing system comprising a memory for storing program code and a processor device operatively coupled to the memory for running the program code.)

With respect to claim 16, it is substantially similar to claim 2 and is rejected in the same manner, the same art and reasoning applying.

With respect to claim 17, it is substantially similar to claim 3 and is rejected in the same manner, the same art and reasoning applying.

Claims 4, 5, 11, 12, 18, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dalal (Dalal et al., “Safe Exploration in Continuous Action Spaces”, arXiv, January 2018, 9 pages.), cited in the IDS submitted on 07/30/2018, in view of Hwang (Hwang et al., “Inverse Reinforcement Learning based on Critical State”, 16th World Congress of the International Fuzzy Systems Association (IFSA) 9th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT), June-July 2015, pages 771-775.) and further in view of Chen (Chen et al., “Decentralized Non-communicating Multiagent Collision Avoidance with Deep Reinforcement Learning”, 2017 IEEE International Conference on Robotics and Automation (ICRA), May-June 2017, pages 285-292.).

With respect to claim 4, modified Dalal teaches the computer-implemented method for reinforcement learning of claim 1, but modified Dalal does not explicitly teach wherein each of the plurality of action constraints is an inequality constraint.
Chen, however, does teach wherein each of the plurality of action constraints is an inequality constraint. (Section 2A, 

    PNG
    media_image4.png
    503
    612
    media_image4.png
    Greyscale


    PNG
    media_image5.png
    172
    611
    media_image5.png
    Greyscale
 
Section 3D,
 
    PNG
    media_image6.png
    366
    615
    media_image6.png
    Greyscale

Collision avoidance constraints and kinematics constraints are both constraining the possible actions of the agent, they are action constraints. In Chen, all forms of actions constraints are inequalities.)
It would have been obvious to an artisan of ordinary skill before the effective filing date of the claimed invention to combine the computer-implemented method for reinforcement learning of modified Dalal with each of the plurality of action constraints being an inequality constraint in order to improve solution quality over reaction-based methods. (Chen, Introduction)

With respect to claim 5, modified Dalal teaches the computer-implemented method for reinforcement learning of claim 1, but modified Dalal does not explicitly teach wherein the first set is relaxed to allow non-optimal demonstrations that are directed closer towards succeeding than failing.
Chen, however, does teach wherein the first set is relaxed to allow non-optimal demonstrations that are directed closer towards succeeding than failing. (Abstract, “Simulation results show more than 26% improvement in paths quality (i.e., time to reach the goal) when compared with optimal reciprocal collision avoidance (ORCA), a state-of-the-art collision avoidance strategy.”; “Section 3C, "This work uses optimal reciprocal collision avoidance (ORCA) [11] to generate a training set of 500 trajectories, which contains approximately 20,000 state-value pairs. We make a few remarks about this initialization step. First, the training trajectories do not have to be optimal. For instance, two of the training trajectories generated by ORCA [11] are shown in Fig. 4a. The red agent was pushed away by the blue agent and followed a large arc before reaching its goal." The set of training trajectories constitutes a set of good demonstrations. By generating training trajectories using ORCA, they are directed towards succeeding even if they are not optimal.)
It would have been obvious to an artisan of ordinary skill before the effective filing date of the claimed invention to combine the computer-implemented method for reinforcement learning of modified Dalal with the first set is relaxed to allow non-optimal demonstrations that are directed closer towards succeeding than failing in order to improve solution quality over reaction-based methods. (Chen, Introduction)

With respect to claim 11, is it substantially similar to claim 4 and is rejected in the same manner, the same art and reasoning applying.

With respect to claim 12, is it substantially similar to claim 5 and is rejected in the same manner, the same art and reasoning applying.



With respect to claim 19, it is substantially similar to claim 5 and is rejected in the same manner, the same art and reasoning applying.

Claims 6, 13, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dalal (Dalal et al., “Safe Exploration in Continuous Action Spaces”, arXiv, January 2018, 9 pages.), cited in the IDS submitted on 07/30/2018, in view of Hwang (Hwang et al., “Inverse Reinforcement Learning based on Critical State”, 16th World Congress of the International Fuzzy Systems Association (IFSA) 9th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT), June-July 2015, pages 771-775.) and further in view of Johnson (Johnson et al., “Semi-Supervised Nonlinear Distance Metric Learning via Forests of Max-Margin Clustering Hierarchies”, arXiv, February 2014, 11 pages.).

With respect to claim 6, modified Dalal teaches the computer-implemented method for reinforcement learning of claim 1, but does not explicitly teach wherein the evaluation of each of the plurality of action constraints is performed relative to a violation margin and a satisfaction margin, wherein for a given one of the restricted actions, the violation margin represents a margin of violation between the action and the plurality of action constraints, and the satisfaction margin represents a margin of satisfaction between the action and the plurality of action constraints.
(Section 3.2, “Standard SSMMC, however, will instead attempt to simultaneously satisfy the cannot-link constraints between all of these classes, which is impossible. As a result, the optimization algorithm may seek a compromise solution that weakly violates all or most of the constraints, rather than one that strongly satisfies a subset of the constraints and ignores the others (e.g. that separates apples and oranges from bicycles and motorcycles).”; Section 3.2.1, 
    PNG
    media_image7.png
    216
    499
    media_image7.png
    Greyscale

Satisfaction margin; Section 3.2.1, 

    PNG
    media_image8.png
    161
    497
    media_image8.png
    Greyscale

Highest-scoring constraint-violating – violation margin)
and a satisfaction margin, (Section 3.2, “Standard SSMMC, however, will instead attempt to simultaneously satisfy the cannot-link constraints between all of these classes, which is impossible. As a result, the optimization algorithm may seek a compromise solution that weakly violates all or most of the constraints, rather than one that strongly satisfies a subset of the constraints and ignores the others (e.g. that separates apples and oranges from bicycles and motorcycles).”; Section 3.2.1,
 
    PNG
    media_image7.png
    216
    499
    media_image7.png
    Greyscale

Satisfaction margin; Section 3.2.1, 

    PNG
    media_image8.png
    161
    497
    media_image8.png
    Greyscale

Highest-scoring constraint-violating – violation margin)
wherein for a given one of the restricted actions, the violation margin represents a margin of violation between the action and the plurality of action constraints, (Section 3.2, “Standard SSMMC, however, will instead attempt to simultaneously satisfy the cannot-link constraints between all of these classes, which is impossible. As a result, the optimization algorithm may seek a compromise solution that weakly violates all or most of the constraints, rather than one that strongly satisfies a subset of the constraints and ignores the others (e.g. that separates apples and oranges from bicycles and motorcycles).”; Section 3.2.1,
 
    PNG
    media_image7.png
    216
    499
    media_image7.png
    Greyscale

Satisfaction margin; Section 3.2.1, 

    PNG
    media_image8.png
    161
    497
    media_image8.png
    Greyscale

Highest-scoring constraint-violating – violation margin)
and the satisfaction margin represents a margin of satisfaction between the action and the plurality of action constraints. (Section 3.2, “Standard SSMMC, however, will instead attempt to simultaneously satisfy the cannot-link constraints between all of these classes, which is impossible. As a result, the optimization algorithm may seek a compromise solution that weakly violates all or most of the constraints, rather than one that strongly satisfies a subset of the constraints and ignores the others (e.g. that separates apples and oranges from bicycles and motorcycles).”; Section 3.2.1,
 
    PNG
    media_image7.png
    216
    499
    media_image7.png
    Greyscale

Satisfaction margin; Section 3.2.1, 

    PNG
    media_image8.png
    161
    497
    media_image8.png
    Greyscale

Highest-scoring constraint-violating – violation margin)
It would have been obvious to an artisan of ordinary skill before the effective filing date of the claimed invention to combine the computer-implemented method for reinforcement learning of modified Dalal with wherein the evaluation of each of the plurality of action constraints is performed relative to a violation margin and a satisfaction margin, wherein for a given one of the restricted actions, the violation margin represents a margin of violation between the action and the plurality of action constraints, and the satisfaction margin represents a margin of satisfaction between the action and the plurality of action constraints in order to take a relaxed approach to constraint satisfaction rather than attempting to simultaneously satisfy all constraints, leading to a more robust learning algorithm. (Johnson, Abstract)



With respect to claim 20, it is substantially similar to claim 13 and is rejected in the same manner, the same art and reasoning applying.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Judah (Judah et al., “Reinforcement Learning Via Practice and Critique Advice”, Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, July 2010, pages 481-486.)
Zhu (Zhu et al., “Combining Dynamic Reward Shaping and Action Shaping for Coordinating Multi-Agent Learning”, 2013 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT), November 2013, pages 321-328.)
Datta (Datta et al., “Probabilistic Constraint Handling in the Framework of Joint Evolutionary-Classical Optimization with Engineering Applications”, Kanpur Genetic Algorithms Laboratory (KanGAL), March 2012, 8 pages.)

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARK J TURNER whose telephone number is (571)272-8469. The examiner can normally be reached Monday-Thursday 9am-7pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/M.J.T./Examiner, Art Unit 2121                                                                                                                                                                                                        

/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121