DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-8, 13-15, 17, 21 are rejected under 35 U.S.C. 103 as being unpatentable over King et al (US 20220121260 A1) in view of Venayagamoorthy et al (US 20130268131 A1) and Hasselt et al (“Reinforcement Learning in Continuous Action Spaces”, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, April 2007, pages 272-279).
Regarding claims 1, 2, 8, King et al, discloses a method (figs. 1A-B) for controlling a power system (a method for managing the distribution of generated electrical power; the control system 40 first functions to sense and analyze the instantaneous power being generated and delivered to the system from renewable energy source 12; paragraph 0009, 0135), comprising: formulating a voltage control problem (identifying an identifier of the electrical power generation device and a first value corresponding to a voltage associated with the first unit of power and a second value corresponding to a current associated with the first unit of power ; scheme for online voltage security; paragraph 0010, 0020) using the bus voltages of a power grid (a centralized grid 110; paragraph 0060) within a predefined zone before and after a disturbance (electrical disturbance, power loss; generate, by the controller, the first record object responsive to the determination that the voltage and the current satisfy the threshold ;paragraph 0063-0065, 0233).
However, King et al, does not specifically disclose the features of performing offline training with historical data to train the DRL agent; performing online retraining of the DRL agent using live PMU data; and providing autonomous control of the power system below a sub-second after training.
On the other hand, Venayagamoorthy et al, from the same field of endeavor, discloses the features of performing (DSOPF control scheme: a DSOPF controller, performs the function of AGC and RVC; performance charts for a power system over time as controlled by AGC controllers and DSOPF controllers ; paragraph 0077, 0024-0025) offline training (the model network is trained offline to minimize the one-step-ahead prediction error over all the recorded data; paragraph 0080) with historical data to train the DRL agent (after the offline training, the model network is used to provide system-wide cross-coupling sensitivity signals over a wide operating range; in addition, the action network is trained to approximate the optimal control law by minimizing the partial derivative of J(k) with respect to u(k); paragraph 0079-0080; paragraph 0082-0083); performing (control performance of the Area DSOPF Controllers; the control performance of the area DSOPF controllers is compared with that of using only AGCs; paragraph 0095-0096, 0104) online retraining (the DHP critic network is trained online; the training of the critic network starts with a small discount factor; paragraph 0081-0082) of the DRL agent using live PMU data (the control state of an electrical grid is normally monitored by electronic collection of data corresponding to the conditions in the electrical grid; note that phasor measurement units (PMUs) are also used to sense and collect condition data; paragraph 0086); and providing autonomous control of the power system below a sub-second after training (a three-phase-to-ground fault 13 happens somewhere along line 2-5 at 400 s into the simulation; the DSOPF control algorithm coordinates both active and reactive power for a power system; furthermore, the model weights are continuously updated with a small learning rate to ensure tracking of new operating conditions; the random initial weights of both the critic and action networks are limited to small values such that the initial outputs of both the critic and action networks are close to zero; paragraph 0084-0085, 0088-0089).
Hasselt et al also discloses a new class of algorithms named “Continuous Actor Critic Learning Automation” that can handle continuous states and actions. Note that reinforcement learning can be used to make an agent learn to interact with an environment. The goal is to optimize the behavior of the agent in respect to a reward signal that is provided by the environment. The actions of the agent can also affect the environment. Reinforcement Learning can be used to find solutions for Markov Decisions Processes. Furthermore, the “Continuous Actor Critic Learning Automation” algorithm is model free, which we think is a good property to have since we do not want to assume the agent has a priori model of the environment (see page 272 for details). Note that an agent follows a target. The positions of the agent and the target are real valued two dimensional vectors. The state description is a four-dimensional vector containing the positions of the agent ). We can conclude that new class of algorithms named “Continuous Actor Critic Learning Automation”  for the Reinforcement Learning framework can be extended to handle problems that involve continuous state and action spaces (page 276-279). Therefore, it would have been obvious to one of ordinary skill in the art, at the time the invention was made to apply the technique of Hasselt to the modified system of Venayagamoorthy and King in order to provide a reinforcement learning method that can be used to solve problems that can be modeled as a Markov Decision process.
Regarding claim 3, King et al as modified, discloses a method (figs. 1A-B) for controlling a power system (a method for managing the distribution of generated electrical power; the control system 40 first functions to sense and analyze the instantaneous power being generated and delivered to the system from renewable energy source 12; paragraph 0009, 0135), wherein representative operating conditions are collected or created, including random load changes (implement load shifting and peak shaving; if the controller determines that one of the devices has an increased load and requires more power, the controller can identify the path associated with the device as the identified path; paragraph 0109-0110), variations in renewable generation, generation dispatch patterns, major topology changes due to maintenance and contingencies (the controller can identify the path based on a status of each of the devices; the path can be identified by assessing the amount of electrical need at each of the devices; furthermore, the controller can identify the path in response to a condition of the storage device; the controller can identify the path in response to a signal received from the transmission device indicating a request for electrical power ; paragraph 0110-0111, 0117, 0128). [AltContent: rect] Regarding claim 5, King et al as modified, discloses a method (figs. 1A-B) for controlling a power system (a method for managing the distribution of generated electrical power; the control system 40 first functions to sense and analyze the instantaneous power being generated and delivered to the system from renewable energy source ; paragraph 0009, 0135), comprising providing rewards to minimize the system loss or to balance multiple control objectives (power equipment of the load; control system 40; each objective function represents a loss of electrical power associated with the corresponding path; paragraph 0019, 0133-0134).
Regarding claim 6, King et al as modified, discloses a method (figs. 1A-B) for controlling a power system (a method for managing the distribution of generated electrical power; the control system 40 first functions to sense and analyze the instantaneous power being generated and delivered to the system from renewable energy source ; paragraph 0009, 0135), comprising defining states as a vector of voltage magnitudes, phase angles, and active and reactive power flows on branches (generate an electrical current and an electrical voltage to power the nanogrid 130, the microgrid 120, or the centralized grid 110; in addition, detect voltage and current values from each of the Flow Path P1, the Flow Path P2, or any other path or component of the architecture 100 using one or more voltage sensors and one or more current sensors ; paragraph 0064-0065) directly provided by EMS or WAMS systems coordinated voltage control (the controller can measure, for example, the energy generated by the energy generation device 131 to determine when a unit of energy has been generated; the control signal manager 325 can generate a control signal to route the generated unit of power or energy from the identified source device and the identified destination device; furthermore, the path manager 320 can identify the path along which to transmit the generated unit of power or energy based on the objective functions corresponding to each of the plurality of paths; note that the path manager can access the power information (the voltage sensor information, the current sensors information, type of generation device, etc.) to determine the inputs to each function associated with each path; paragraph 0096-0098).
Regarding claim 7, King et al as modified, discloses a method (figs. 1A-B) for controlling a power system (a method for managing the distribution of generated electrical power; the control system 40 first functions to sense and analyze the instantaneous power being generated and delivered to the system from renewable energy source ; paragraph 0009, 0135), wherein for a power grid  (electrical power grid with several power plants; a system for generating energy across parallel energy nano-grids) with N power plants used for voltage control, a total combination of control actions forms a space in the dimension of 5N; one centralized power transmission grid 112, and at least one centralized grid load 113; paragraph 0060, 0064-0065).   
Regarding claims 13, 14, 21, King et al, discloses a system (figs. 1A-B) for controlling a power system (a method for managing the distribution of generated electrical power; the control system 40 first functions to sense and analyze the instantaneous power being generated and delivered to the system from renewable energy source 12; paragraph 0009, 0135), comprising: a processor; power sensors (voltage sensors and one or more current sensors) coupled to the processor and a grid (one or more processors can be configured to generate a first record object responsive to the electrical power generation device generating a first unit of power; in addition, a first value corresponding to a voltage associated with the first unit of power, and a second value corresponding to a current associated with the first unit of power ; paragraph 0010); using the bus voltages of a power grid (a centralized grid 110; paragraph 0060) within a predefined zone before and after a disturbance (electrical disturbance, power loss; paragraph 0063-0065, 0233). 
However, King et al, does not specifically disclose the features of performing offline training with historical data to train the DRL agent; performing online retraining of the DRL agent using live PMU data; and providing autonomous control of the power system below a sub-second after training.
On the other hand, Venayagamoorthy et al, from the same field of endeavor, discloses the features of performing (DSOPF control scheme: a DSOPF controller, performs the function of AGC and RVC; performance charts for a power system over time as controlled by AGC controllers and DSOPF controllers ; paragraph 0077, 0024-0025) offline training (the model network is trained offline to minimize the one-step-ahead prediction error over all the recorded data; paragraph 0080) with historical data to train the DRL agent (after the offline training, the model network is used to provide system-wide cross-coupling sensitivity signals over a wide operating range; in addition, the action network is trained to approximate the optimal control law by minimizing the partial derivative of J(k) with respect to u(k); paragraph 0079-0080; paragraph 0082-0083); performing (control performance of the Area DSOPF Controllers; the control performance of the area DSOPF controllers is compared with that of using only AGCs; paragraph 0095-0096, 0104) online retraining (the DHP critic network is trained online; the training of the critic network starts with a small discount factor; paragraph 0081-0082) of the DRL agent using live PMU data (the control state of an electrical grid is normally monitored by electronic collection of data corresponding to the conditions in the electrical grid; note that phasor measurement units (PMUs) are also used to sense and collect condition data; paragraph 0086); and providing autonomous control of the power system below a sub-second after training (a three-phase-to-ground fault 13 happens somewhere along line 2-5 at 400 s into the simulation; the DSOPF control algorithm coordinates both active and reactive power for a power system; furthermore, the model weights are continuously updated with a small learning rate to ensure tracking of new operating conditions; the random initial weights of both the critic and action networks are limited to small values such that the initial outputs of both the critic and action networks are close to zero; paragraph 0084-0085, 0088-0089).
Hasselt et al also discloses a new class of algorithms named “Continuous Actor Critic Learning Automation” that can handle continuous states and actions. Note that reinforcement learning can be used to make an agent learn to interact with an environment. The goal is to optimize the behavior of the agent in respect to a reward signal that is provided by the environment. The actions of the agent can also affect the environment. Reinforcement Learning can be used to find solutions for Markov Decisions Processes. Furthermore, the “Continuous Actor Critic Learning Automation” algorithm is model free, which we think is a good property to have since we do not want to assume the agent has a priori model of the environment (see page 272 for details). Note that an agent follows a target. The positions of the agent and the target are real valued two-dimensional vectors. The state description is a four-dimensional vector containing the positions of the agent ). We can conclude that new class of algorithms named “Continuous Actor Critic Learning Automation”  for the Reinforcement Learning framework can be extended to handle problems that involve continuous state and action spaces (page 276-279). Therefore, it would have been obvious to one of ordinary skill in the art, at the time the invention was made to apply the technique of Hasselt to the modified system of Venayagamoorthy and King in order to provide a reinforcement learning method that can be used to solve problems that can be modeled as a Markov Decision process.
Regarding claim 15, King et al as modified, discloses a system (figs. 1A-B) for controlling a power system (a method for managing the distribution of generated electrical power; the control system 40 first functions to sense and analyze the instantaneous power being generated and delivered to the system from renewable energy source ; paragraph 0009, 0135), wherein representative operating conditions are collected or created, including random load changes (implement load shifting and peak shaving; if the controller determines that one of the devices has an increased load and requires more power, the controller can identify the path associated with the device as the identified path; paragraph 0109-0110), variations in renewable generation, generation dispatch patterns, major topology changes due to maintenance and contingencies (the controller can identify the path based on a status of each of the devices; the path can be identified by assessing the amount of electrical need at each of the devices; furthermore, the controller can identify the path in response to a condition of the storage device; the controller can identify the path in response to a signal received from the transmission device indicating a request for electrical power ; paragraph 0110-0111, 0117, 0128).
Regarding claim 17, King et al as modified, discloses a system (figs. 1A-B) for controlling a power system (a method for managing the distribution of generated electrical power; the control system 40 first functions to sense and analyze the instantaneous power being generated and delivered to the system from renewable energy source 12; paragraph 0009, 0135), comprising code for providing rewards to minimize the system loss or to balance multiple control objectives (power equipment of the load; control system 40; each objective function represents a loss of electrical power associated with the corresponding path; paragraph 0019, 0133-0134).
 Claims  9-10, 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over King et al (US 20220121260 A1) in view of Venayagamoorthy et al (US 20130268131 A1) and Hasselt et al (“Reinforcement Learning in Continuous Action Spaces”, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, April 2007, pages 272-279) as applied to claims 1, 13  above, and further in view of Lillicrap et al (“Continous control with deep reinforcement learning”, CoRR, July 2019, arxiv.org/abs/1509.02971, 14 pages).
Regarding claims 9-10, 18-19, King, Venayagamoorthy and Hasselt disclose everything claimed as explained above except the features of applying DQN reinforcement learning by combining Q-Learning with two or more deep neural networks for reinforcement learning in a high-dimensional environment, wherein parameters of the target network are fixed and periodically updated from an evaluation network.  
However, Lillicrap et al discloses the features of applying DQN reinforcement learning (adapting deep reinforcement learning methods such as DQN to continuous domains is to simply discretize the action space; a model free, off-policy  actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces; deterministic policy gradient algorithm; see page 1 for details) by combining Q-Learning with two or more deep neural networks (using neural networks for reinforcement learning) for reinforcement learning in a high-dimensional environment (reinforcement learning set up consisting of an agent interacting with an environment E in discrete timesteps; apply Q-learning to continuous action spaces; furthermore, the DPQ algorithm maintains a parameterized action function which specifies the current policy by deterministically mapping states to a specific action; Q-learning applies in neural network function approximators; policy gradient; pages 3-4), wherein parameters of the target network are fixed and periodically updated from an evaluation network (Q-Learning algorithm; Reinforcement learning includes a class of learning methods, which can approach optimal control iteratively through online learning; scheme for online voltage security assessment ;online voltage security assessment scheme using synchronized phasor measurements and periodically updated decision trees; synchronized critical attributes are obtained in real time from phasor measurements units; Deep Q-Networks: directly implementing Q learning with neural networks; construct stochastic neural network policies without decomposing problems into optimal control and supervised phases; deep dynamical model network along with model predictive control to solve the pendulum swing-up task from pixel input; evaluate policy periodically during training by testing it without exploration noise; performance after training across all environments; pages 4-8). Therefore, it would have been obvious to one of ordinary skill in the art, at the time the invention was made to apply the technique of Lillicrap to the modified system of Hasselt, Venayagamoorthy and King in order to provide a deep learning and deep reinforcement learning methods that results in an algorithm that robustly solves challenging problems across a variety of domains with continuous action spaces.
Allowable Subject Matter
Claims  4, 11-12, 16, 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARCEAU MILORD whose telephone number is (571)272-7853. The examiner can normally be reached 10-6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, CHARLES APPIAH can be reached on 571-2727904. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

MARCEAU MILORD
Examiner
Art Unit 2641



/MARCEAU MILORD/Primary Examiner, Art Unit 2641