DETAILED ACTION
This action is in response to the application filed January 31, 2019 which claims priority to PRO 62/624543 filed on January 31, 2018. Claims 1-20 are pending and have been considered. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 

(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitations uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are:
a receiver configured to obtain in claim 1.
a data storage configured to maintain in claim 1.
a supervised classifier for training in claim 1.
an action execution processor configured to generate in claim 1. 
a state observer configured to monitor in claim 1.
a notification engine is configured to in claim 10.
Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the 
If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitations to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitations recite sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 20 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claims does not fall within at least one of the four categories of patent eligible subject matter because the claim can be interpreted as directed to signal per se. The claim recites a computer readable medium however, the specification fails to explicitly disclose whether a computer readable medium is not to be construed as being transitory signals per se. Examiner proposes the applicant to amend computer readable medium to be non-transitory computer readable medium. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 2, 8, 9, 11, 12, 18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Hester et al. ("Deep Q-learning from Demonstrations", hereinafter "Hester") in view of Brys et al. ("Combining Multiple Correlated Reward and Shaping Signals by Measuring Confidence", hereinafter "Brys").

claim 1, Hester teaches A system for biasing a machine learning architecture using one or more demonstrator data sets, the machine learning architecture for controlling one or more actions conducted by an agent in an environment which transitions between one or more states (“While accurate simulators are difficult to find, most of these problems have data of the system operating under a previous controller (either human or machine) that performs reasonably well. In this work, we make use of this demonstration data to pre-train the agent so that it can perform well in the task from the start of learning, and then continue improving from its own self-generated data” [pg. 1, § Introduction, ¶2; note: It is implicit that a computer is performing the tasks; the claim limitations invoke 112f thus the examiner is interpreting the limitations of the claim to correspond to a generic computer to perform the same functions.]), the system comprising:
 a receiver configured to obtain one or more demonstrator data sets, each demonstrator data set including a data structure representing one or more state-action pairs observed in one or more interactions with the environment (“Therefore, we want the agent to learn as much as possible from the demonstration data before running on the real system. The goal of the pre-training phase is to learn to imitate the demonstrator with a value function that satisfies the Bellman equation so that it can be updated with TD updates once the agent starts interacting with the environment” [pg. 3, § Deep Q-Learning from Demonstrations, ¶1; See § Background, ¶1 for state action pairs]);
(“Second, the agent adds all of its experiences to a replay buffer Dreplay, which is then sampled uniformly to perform updates on the network.” [pg. 2, § Background, ¶1; See further Algorithm 1, pg. 4, step 11; Examiner is interpreting maintaining to be equivalent to storing datasets and updating them.]);
 a supervised classifier for training using the one or more demonstrator data sets or sub-portions thereof (“During this pre-training phase, the agent samples mini-batches from the demonstration data and updates the network by applying four losses: the 1-step double Q-learning loss, an n-step double Q-learning loss, a supervised large margin classification loss, and an L2 regularization loss on the network weights and biases. The supervised loss is used for classification of the demonstrator’s actions” [pg. 3, § Deep Q-Learning from Demonstrations, ¶1; classifier is implicit to perform the classification of the demonstrator’s actions); and
a state observer configured to monitor a new state resulting from the execution of the action and an associated reward outcome (“In each state s ∈ S, the agent takes an action a ∈ A. Upon taking this action, the agent receives a reward R(s, a) and reaches a new states” [pg. 2, § Background, ¶1]); and to update the internal policy function maintained by the machine learning architecture based at least on the observed reward outcome (“A policy π specifies for each state which action the agent will take. The goal of the agent is to find the policy π mapping states to actions that maximizes the expected discounted total reward over the agent’s lifetime. The value Qπ (s, a) of a given state-action pair (s, a) is an estimate of the expected future reward that can be obtained from (s, a) when following policy π. The optimal value function Q∗ (s, a) provides maximal values in all states and is determined by solving the Bellman equation” [pg. 2, Background, ¶1]).
Hester fails to explicitly teach one or more confidence data values 
an action execution processor configured to generate control signals for executing an action associated with an action-source selected from at least one of the one or more demonstrator data sets based on the supervised classifier or an internal policy function maintained by the machine learning architecture, the selecting based at least upon the one or more confidence data values; 
Brys teaches one or more confidence data values (“This technique makes a temporal difference learner estimate the Q-function for each objective in parallel, and introduces a way of measuring confidence in these estimates” [Abstract])
an action execution processor configured to generate control signals for executing an action associated with an action-source selected from at least one of the one or more demonstrator data sets based on the supervised classifier or an internal policy function maintained by the machine learning architecture (“This paper deals with such reinforcement learning (RL) problems, formulated as Correlated Multi-Objective Markov Decision Processes (CMOMDP). (Single-objective) MDPs describe a system as a set of potential observations of that system’s state S, a set of possible actions A, transition probabilities T for state-action state triplets, and a reward function R that probabilistically maps these transitions to a scalar reward indicating the utility of that transition. The goal of an RL agent operating in an MDP is to maximize the expected, discounted return of the reward function” [pg. 1687, § Introduction, ¶3; Examiner is interpreting taking an action to implicitly teach the existence of an action processor. Performing the action would correspond to generating a control signal.]), the selecting based at least upon the one or more confidence data values (“This technique makes a temporal difference learner estimate the Q-function for each objective in parallel, and introduces a way of measuring confidence in these estimates. This confidence metric is then used to choose which objective’s estimates to use for action selection.” [Abstract])
Hester and Brys are both in the same field of endeavor of reinforcement learning. Hester discloses Q-learning from using human demonstration data. Brys discloses a method for measuring confidence values to estimate a Q-function. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Hester’s Deep Q-learning algorithm with the confidence measurements as taught by Brys. One would have been motivated to make this modification since measuring confidence values to estimate the Q-function would help the agent decide the appropriate actions to perform. [pg. 1688, § Adaptive Objective Selection, Brys]

Regarding claim 2, the combination of Hester and Brys teaches The system of claim 1, Brys further teaches wherein the state observer is configured to update at least one of the confidence data values of the one or more confidence data values based on the observed reward outcome (“We define confidence as an estimation of the likelihood that the estimates are correct. Higher-variance reward distributions will make any estimate of the average reward less confident, and always selecting the objective whose estimates are most likely to be correct will maximize the likelihood of correctly ranking the action set.” [pg. 1688, § Adaptive Objective Selection, ¶2]).
Hester and Brys are both in the same field of endeavor of reinforcement learning. Hester discloses Q-learning from using human demonstration data. Brys discloses a method for measuring confidence values to estimate a Q-function. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Hester’s Deep Q-learning algorithm with the confidence measurements as taught by Brys. One would have been motivated to make this modification since measuring confidence values to estimate the Q-function would help the agent decide the appropriate actions to perform. [pg. 1688, § Adaptive Objective Selection, Brys]

Regarding claim 8, the combination of Hester and Brys teaches The system of claim 1, where Brys further teaches wherein the selecting of the action-source is based upon an action selection mechanism including a soft-hard-ε decision architecture including an ε-greedy switch for greedily (“For example, if a multi-objectivized problem is solved using Q-learning with adaptive objective selection, then the Q-learning guarantees make sure the estimates for every objective converge to the true values, and given the shaping guarantees, the greedy policies for each objective will be the same, and thus optimal.” [pg. 1689, top left col, ¶2; Examiner is interpreting soft-hard-ε decision architecture to be Q-learning with ε-greedy policy (i.e. switch).]) exploiting a determined confidence value while performing (“To make this objective selection decision, we introduce the concept of confidence in learned estimates. We define confidence as an estimation of the likelihood that the estimates are correct.” [pg. 1688, § Adaptive Objective Selection, ¶2; Finding an estimation of the likelihood corresponds to probabilistic exploration.]).
Hester and Brys are both in the same field of endeavor of reinforcement learning. Hester discloses Q-learning from using human demonstration data. Brys discloses a method for measuring confidence values to estimate a Q-function. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Hester’s Deep Q-learning algorithm with the confidence measurements as taught by Brys. One would have been motivated to make this modification since measuring confidence values to estimate the Q-function would help the agent decide the appropriate actions to perform. [pg. 1688, § Adaptive Objective Selection, Brys]

Regarding claim 9, the combination of Hester and Brys teaches The system of claim 1, where Hester further teaches wherein the one or more demonstrator data sets are a plurality of demonstrator data sets, including at least a first demonstrator data set provided from a first demonstrator and a second demonstrator data set provided from a second demonstrator (“We ran experiments on a randomly selected subset of 42 Atari games. We had a human player play each game between three and twelve times. Each episode was played either until the game terminated or for 20 minutes. During game play, we logged the agent’s state, actions, rewards, and terminations. The human demonstrations range from 5,574 to 75,472 transitions per game. DQfD learns from a very small dataset compared to other similar work, as AlphaGo learns from 30 million human transitions, and DQN learns from over 200 million frames. DQfD’s smaller demonstration dataset makes it more difficult to learn a good representation without over-fitting. The demonstration scores for each game are shown in a table in the Appendix.” [pg. 5, left col, ¶3), and the selection of the action-source includes selecting at least from between the first demonstrator data set and the second demonstrator data set (“The agent updates its network with a mix of demonstration and self-generated data. In practice, choosing the ratio between demonstration and self-generated data while learning is critical to improve the performance of the algorithm.” [pg. 2, top left col, ¶1; human demonstration dataset corresponds to a first demonstrator dataset and DQfD’s demonstration dataset would corresponds to a second demonstrator dataset. Learning would be include the selecting of an action.]).

Regarding claim 11, Hester teaches A method of biasing a machine learning architecture using one or more demonstrator data sets, and the machine learning architecture for controlling one or more actions conducted by an agent in an environment which transitions between one or more states (“While accurate simulators are difficult to find, most of these problems have data of the system operating under a previous controller (either human or machine) that performs reasonably well. In this work, we make use of this demonstration data to pre-train the agent so that it can perform well in the task from the start of learning, and then continue improving from its own self-generated data” [pg. 1, § Introduction, ¶2; note: It is implicit that a computer is performing the tasks; the claim limitations invoke 112f thus the examiner is interpreting the limitations of the claim to correspond to a generic computer to perform the same functions.]), the method comprising:
 receiving the one or more demonstrator data sets, each demonstrator data set including a data structure representing one or more state-action pairs observed in one or more interactions with the environment (“Therefore, we want the agent to learn as much as possible from the demonstration data before running on the real system. The goal of the pre-training phase is to learn to imitate the demonstrator with a value function that satisfies the Bellman equation so that it can be updated with TD updates once the agent starts interacting with the environment” [pg. 3, § Deep Q-Learning from Demonstrations, ¶1; See § Background, ¶1 for state action pairs]);
for each demonstrator data set or sub-portions thereof, maintaining, associated with at least one state of the one or more states (“Second, the agent adds all of its experiences to a replay buffer Dreplay, which is then sampled uniformly to perform updates on the network.” [pg. 2, § Background, ¶1; See further Algorithm 1, pg. 4, step 11; Examiner is interpreting maintaining to be equivalent to storing datasets and updating them.]);
training a supervised classifier using the one or more demonstrator data sets or sub-portions thereof (“During this pre-training phase, the agent samples mini-batches from the demonstration data and updates the network by applying four losses: the 1-step double Q-learning loss, an n-step double Q-learning loss, a supervised large margin classification loss, and an L2 regularization loss on the network weights and biases. The supervised loss is used for classification of the demonstrator’s actions” [pg. 3, § Deep Q-Learning from Demonstrations, ¶1; classifier is implicit to perform the classification of the demonstrator’s actions); and
observing a new state resulting from the execution of the action and an associated reward outcome (“In each state s ∈ S, the agent takes an action a ∈ A. Upon taking this action, the agent receives a reward R(s, a) and reaches a new states” [pg. 2, § Background, ¶1]); and updating the internal policy function maintained by the machine learning architecture based at least on the observed reward outcome. (“A policy π specifies for each state which action the agent will take. The goal of the agent is to find the policy π mapping states to actions that maximizes the expected discounted total reward over the agent’s lifetime. The value Qπ (s, a) of a given state-action pair (s, a) is an estimate of the expected future reward that can be obtained from (s, a) when following policy π. The optimal value function Q∗ (s, a) provides maximal values in all states and is determined by solving the Bellman equation” [pg. 2, Background, ¶1]).
Hester fails to explicitly teach one or more confidence data values 
executing an action associated with an action-source selected from at least one of the one or more demonstrator data sets based on the supervised classifier or an internal policy function maintained by the machine learning architecture, the selecting based at least upon the one or more confidence data values; 
(“This technique makes a temporal difference learner estimate the Q-function for each objective in parallel, and introduces a way of measuring confidence in these estimates” [Abstract])
executing an action associated with an action-source selected from at least one of the one or more demonstrator data sets based on the supervised classifier or an internal policy function maintained by the machine learning architecture (“This paper deals with such reinforcement learning (RL) problems, formulated as Correlated Multi-Objective Markov Decision Processes (CMOMDP). (Single-objective) MDPs describe a system as a set of potential observations of that system’s state S, a set of possible actions A, transition probabilities T for state-action state triplets, and a reward function R that probabilistically maps these transitions to a scalar reward indicating the utility of that transition. The goal of an RL agent operating in an MDP is to maximize the expected, discounted return of the reward function” [pg. 1687, § Introduction, ¶3; Examiner is interpreting taking an action to implicitly teach the existence of an action processor. Performing the action would correspond to generating a control signal.]), the selecting based at least upon the one or more confidence data values (“This technique makes a temporal difference learner estimate the Q-function for each objective in parallel, and introduces a way of measuring confidence in these estimates. This confidence metric is then used to choose which objective’s estimates to use for action selection.” [Abstract])
Hester and Brys are both in the same field of endeavor of reinforcement learning. Hester discloses Q-learning from using human demonstration data. Brys discloses a [pg. 1688, § Adaptive Objective Selection, Brys]

Regarding claim 12, the combination of Hester and Brys teaches The method of claim 11, where Brys further teaches comprising: updating at least one of the confidence data values of the one or more confidence data values based on the observed reward outcome (“We define confidence as an estimation of the likelihood that the estimates are correct. Higher-variance reward distributions will make any estimate of the average reward less confident, and always selecting the objective whose estimates are most likely to be correct will maximize the likelihood of correctly ranking the action set.” [pg. 1688, § Adaptive Objective Selection, ¶2]).
Hester and Brys are both in the same field of endeavor of reinforcement learning. Hester discloses Q-learning from using human demonstration data. Brys discloses a method for measuring confidence values to estimate a Q-function. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Hester’s Deep Q-learning algorithm with the confidence measurements as taught by Brys. One would have been motivated to make this modification since measuring [pg. 1688, § Adaptive Objective Selection, Brys]

Regarding claim 18, the combination of Hester and Brys teaches The method of claim 11, where Brys further teaches wherein the selecting of the action-source is based upon an action selection mechanism including a soft-hard-ε decision architecture including an ε-greedy switch for greedily (“For example, if a multi-objectivized problem is solved using Q-learning with adaptive objective selection, then the Q-learning guarantees make sure the estimates for every objective converge to the true values, and given the shaping guarantees, the greedy policies for each objective will be the same, and thus optimal.” [pg. 1689, top left col, ¶2; Examiner is interpreting soft-hard-ε decision architecture to be Q-learning with ε-greedy policy (i.e. switch).]) exploiting a determined confidence value while performing probabilistic exploration (“To make this objective selection decision, we introduce the concept of confidence in learned estimates. We define confidence as an estimation of the likelihood that the estimates are correct.” [pg. 1688, § Adaptive Objective Selection, ¶2; Finding an estimation of the likelihood corresponds to probabilistic exploration.]).
Hester and Brys are both in the same field of endeavor of reinforcement learning. Hester discloses Q-learning from using human demonstration data. Brys discloses a method for measuring confidence values to estimate a Q-function. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Hester’s Deep Q-learning algorithm with the confidence measurements as taught by [pg. 1688, § Adaptive Objective Selection, Brys]

Regarding claim 20, Hester teaches A computer readable medium storing machine interpretable instructions, which when executed, cause a processor to perform a method of biasing a machine learning architecture using one or more demonstrator data sets, and the machine learning architecture for controlling one or more actions conducted by an agent in an environment which transitions between one or more states (“While accurate simulators are difficult to find, most of these problems have data of the system operating under a previous controller (either human or machine) that performs reasonably well. In this work, we make use of this demonstration data to pre-train the agent so that it can perform well in the task from the start of learning, and then continue improving from its own self-generated data” [pg. 1, § Introduction, ¶2; note: It is implicit that a computer is performing the tasks; the claim limitations invoke 112f thus the examiner is interpreting the limitations of the claim to correspond to a generic computer to perform the same functions.]), the method comprising:
 receiving the one or more demonstrator data sets, each demonstrator data set including a data structure representing one or more state-action pairs observed in one or more interactions with the environment (“Therefore, we want the agent to learn as much as possible from the demonstration data before running on the real system. The goal of the pre-training phase is to learn to imitate the demonstrator with a value function that satisfies the Bellman equation so that it can be updated with TD updates once the agent starts interacting with the environment” [pg. 3, § Deep Q-Learning from Demonstrations, ¶1; See § Background, ¶1 for state action pairs]);
for each demonstrator data set or sub-portions thereof, maintaining, associated with at least one state of the one or more states (“Second, the agent adds all of its experiences to a replay buffer Dreplay, which is then sampled uniformly to perform updates on the network.” [pg. 2, § Background, ¶1; See further Algorithm 1, pg. 4, step 11; Examiner is interpreting maintaining to be equivalent to storing datasets and updating them.]);
training a supervised classifier using the one or more demonstrator data sets or sub-portions thereof (“During this pre-training phase, the agent samples mini-batches from the demonstration data and updates the network by applying four losses: the 1-step double Q-learning loss, an n-step double Q-learning loss, a supervised large margin classification loss, and an L2 regularization loss on the network weights and biases. The supervised loss is used for classification of the demonstrator’s actions” [pg. 3, § Deep Q-Learning from Demonstrations, ¶1; classifier is implicit to perform the classification of the demonstrator’s actions); and
observing a new state resulting from the execution of the action and an associated reward outcome (“In each state s ∈ S, the agent takes an action a ∈ A. Upon taking this action, the agent receives a reward R(s, a) and reaches a new states” [pg. 2, § Background, ¶1]); and updating the internal policy function (“A policy π specifies for each state which action the agent will take. The goal of the agent is to find the policy π mapping states to actions that maximizes the expected discounted total reward over the agent’s lifetime. The value Qπ (s, a) of a given state-action pair (s, a) is an estimate of the expected future reward that can be obtained from (s, a) when following policy π. The optimal value function Q∗ (s, a) provides maximal values in all states and is determined by solving the Bellman equation” [pg. 2, Background, ¶1]).
Hester fails to explicitly teach one or more confidence data values 
executing an action associated with an action-source selected from at least one of the one or more demonstrator data sets based on the supervised classifier or an internal policy function maintained by the machine learning architecture, the selecting based at least upon the one or more confidence data values; 
Brys teaches one or more confidence data values (“This technique makes a temporal difference learner estimate the Q-function for each objective in parallel, and introduces a way of measuring confidence in these estimates” [Abstract])
executing an action associated with an action-source selected from at least one of the one or more demonstrator data sets based on the supervised classifier or an internal policy function maintained by the machine learning architecture (“This paper deals with such reinforcement learning (RL) problems, formulated as Correlated Multi-Objective Markov Decision Processes (CMOMDP). (Single-objective) MDPs describe a system as a set of potential observations of that system’s state S, a set of possible actions A, transition probabilities T for state-action state triplets, and a reward function R that probabilistically maps these transitions to a scalar reward indicating the utility of that transition. The goal of an RL agent operating in an MDP is to maximize the expected, discounted return of the reward function” [pg. 1687, § Introduction, ¶3; Examiner is interpreting taking an action to implicitly teach the existence of an action processor. Performing the action would correspond to generating a control signal.]), the selecting based at least upon the one or more confidence data values (“This technique makes a temporal difference learner estimate the Q-function for each objective in parallel, and introduces a way of measuring confidence in these estimates. This confidence metric is then used to choose which objective’s estimates to use for action selection.” [Abstract])
Hester and Brys are both in the same field of endeavor of reinforcement learning. Hester discloses Q-learning from using human demonstration data. Brys discloses a method for measuring confidence values to estimate a Q-function. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Hester’s Deep Q-learning algorithm with the confidence measurements as taught by Brys. One would have been motivated to make this modification since measuring confidence values to estimate the Q-function would help the agent decide the appropriate actions to perform. [pg. 1688, § Adaptive Objective Selection, Brys]

Claims 6, 10, 16, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Hester in view of Brys and further in view of Wang et al. ("Effective Transfer via Demonstrations in Reinforcement Learning: A Preliminary Study", hereinafter "Wang").
garding claim 6, the combination of Hester and Brys teaches The system of claim 1, however the combination of Hester and Brys fails to explicitly teach wherein the selecting of the action-source is based upon an action selection mechanism including a hard decision architecture adapted for maximizing a current confidence expectation.
Wang teaches wherein the selecting of the action-source is based upon an action selection mechanism including a hard decision architecture adapted for maximizing a current confidence expectation (“The HAT algorithm can be briefly summarized in three steps. First, the source agent acts for a time in the task and the target agent records a set of demonstrations. Second, a decision tree learning method summarizes the demonstrated policy as a static mapping from states to actions. Third, these rules are used by the target agent as a bias in the early stages of its learning. The key component of HAT is that it uses the decision tree to bias its exploration. Initially, the target task agent follows the decision tree, attempting to mimic the source agent. Over time, it incorporates random exploration and exploitation of its learned knowledge with exploiting the decision tree, effectively improving its performance relative to the source agent.” [pg. 77, § Introduction, ¶2; Examiner is interpreting a hard decision architecture to be equivalent to a decision tree. Wang further discloses: “In this work, we introduce a refinement to HAT, an existing transfer learning method, by integrating the target agent’s confidence in its representation of the source agent’s policy” [Abstract]]).
Hester, Brys, and Wang are all in the same field of endeavor of reinforcement learning. Hester discloses Q-learning from using human demonstration data. Brys discloses a method for measuring confidence values to estimate a Q-function. Wang [pg. 77, § Introduction, ¶2, Wang]

Regarding claim 10, the combination of Hester and Brys teaches The system of claim 1, however the combination of Hester and Brys fails to explicitly teach wherein upon the confidence data value associated with a specific demonstrator data set or a portion thereof is detected to be reduced beyond a threshold value, a notification engine is configured to generate a notification requesting improved demonstration data sets from an associated source of demonstration data sets 
Wang teaches wherein upon the confidence data value associated with a specific demonstrator data set or a portion thereof is detected to be reduced beyond a threshold value (“Generally speaking, higher threshold of confidence could guarantee the cautiousness of taking a suggested action, while a lower one could bring more randomness. What should be noticed is that this kind of randomness is not uniform — in the Keepaway domain, randomness results in executing the pass action more often, often resulting in worse performance than holding the ball (see Figure 6 and Table 2). But if the threshold is set too high (e.g., 0.99), the player will hardly pass the ball, which will prevent itself from reinforcement learning. In order to guarantee the effectiveness of this GP method, we need an appropriate confidence threshold. To achieve this, we could select from different thresholds according to the jumpstart performance and this parameter tuning could be realized by agent itself, trying different values.” [pg. 81-82, § Tuning GPHAT’s Confidence Threshold, ¶2; Examiner is interpreting a lower threshold to be equivalent to reducing beyond a threshold value.]), a notification engine is configured to generate a notification requesting improved demonstration data sets from an associated source of demonstration data sets (“Consider an even worse case where we have a demonstration with an average performance of 7.45s with a standard deviation of 2.2s, which is only slightly better than a random policy. This “Novice” demonstration will often provide poor actions” [pg. 81, § GPHAT, ¶2; See further: “We conclude that with Gaussian process, the transferred prior knowledge is more robust even if the quality of human demonstration is limited. We also see that the performance of using prior knowledge is better than the original performance of the teacher.” [pg. 81, § GPHAT, ¶4; It is implicit that if the teacher data (i.e. demonstrator data) is limited, using prior knowledge would include improved demonstration data sets.]).
Hester, Brys, and Wang are all in the same field of endeavor of reinforcement learning. Hester discloses Q-learning from using human demonstration data. Brys discloses a method for measuring confidence values to estimate a Q-function. Wang discloses a human transfer learning method using decision tree method. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Hester’s Deep Q-learning algorithm and the confidence measurements as taught by Brys with the confidence threshold method disclosed by Wang. One would have been [pg. 81, § Tuning GPHAT’s Confidence Threshold, ¶2, Wang]

Regarding claim 16, the combination of Hester and Brys teaches The method of claim 11, however the combination of Hester and Brys fails to explicitly teach wherein the selecting of the action-source is based upon an action selection mechanism including a hard decision architecture adapted for maximizing a current confidence expectation.
Wang teaches wherein the selecting of the action-source is based upon an action selection mechanism including a hard decision architecture adapted for maximizing a current confidence expectation (“The HAT algorithm can be briefly summarized in three steps. First, the source agent acts for a time in the task and the target agent records a set of demonstrations. Second, a decision tree learning method summarizes the demonstrated policy as a static mapping from states to actions. Third, these rules are used by the target agent as a bias in the early stages of its learning. The key component of HAT is that it uses the decision tree to bias its exploration. Initially, the target task agent follows the decision tree, attempting to mimic the source agent. Over time, it incorporates random exploration and exploitation of its learned knowledge with exploiting the decision tree, effectively improving its performance relative to the source agent.” [pg. 77, § Introduction, ¶2; Examiner is interpreting a hard decision architecture to be equivalent to a decision tree. Wang further discloses: “In this work, we introduce a refinement to HAT, an existing transfer learning method, by integrating the target agent’s confidence in its representation of the source agent’s policy” [Abstract]]).
Hester, Brys, and Wang are all in the same field of endeavor of reinforcement learning. Hester discloses Q-learning from using human demonstration data. Brys discloses a method for measuring confidence values to estimate a Q-function. Wang discloses a human transfer learning method using decision tree method. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Hester’s Deep Q-learning algorithm and the confidence measurements as taught by Brys with the decision tree disclosed by Wang. One would have been motivated to use a decision tree learning method in order to improve the performance of an agent learning from a demonstrator. [pg. 77, § Introduction, ¶2, Wang]

Regarding claim 19, the combination of Hester and Brys teaches The method of claim 11, however the combination of Hester and Brys fails to explicitly teach wherein if the confidence data value associated with a specific demonstrator data set or a portion thereof is reduced beyond a threshold value, the method comprises generating a notification requesting improved demonstration data sets from an associated source of demonstration data sets.
Wang teaches wherein if the confidence data value associated with a specific demonstrator data set or a portion thereof is detected is reduced beyond a threshold value (“Generally speaking, higher threshold of confidence could guarantee the cautiousness of taking a suggested action, while a lower one could bring more randomness. What should be noticed is that this kind of randomness is not uniform — in the Keepaway domain, randomness results in executing the pass action more often, often resulting in worse performance than holding the ball (see Figure 6 and Table 2). But if the threshold is set too high (e.g., 0.99), the player will hardly pass the ball, which will prevent itself from reinforcement learning. In order to guarantee the effectiveness of this GP method, we need an appropriate confidence threshold. To achieve this, we could select from different thresholds according to the jumpstart performance and this parameter tuning could be realized by agent itself, trying different values.” [pg. 81-82, § Tuning GPHAT’s Confidence Threshold, ¶2; Examiner is interpreting a lower threshold to be equivalent to reducing beyond a threshold value.]), the method comprises generating a notification requesting improved demonstration data sets from an associated source of demonstration data sets (“Consider an even worse case where we have a demonstration with an average performance of 7.45s with a standard deviation of 2.2s, which is only slightly better than a random policy. This “Novice” demonstration will often provide poor actions” [pg. 81, § GPHAT, ¶2; See further: “We conclude that with Gaussian process, the transferred prior knowledge is more robust even if the quality of human demonstration is limited. We also see that the performance of using prior knowledge is better than the original performance of the teacher.” [pg. 81, § GPHAT, ¶4; It is implicit that if the teacher data (i.e. demonstrator data) is limited, using prior knowledge would include improved demonstration data sets.]).
Hester, Brys, and Wang are all in the same field of endeavor of reinforcement learning. Hester discloses Q-learning from using human demonstration data. Brys [pg. 81, § Tuning GPHAT’s Confidence Threshold, ¶2, Wang]
Allowable Subject Matter
Claims 3-5, 7, 13-15, and 17 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Suay et al. ("Learning from demonstration for shaping through invese reinforcement learning") discloses reinforcement learning from demonstrations.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL H HOANG whose telephone number is (571)272-8491.  The examiner can normally be reached on Mon-Fri 8:30AM-4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/M.H.H./Examiner, Art Unit 2122                                                                                                                                                                                                        

/ERIC NILSSON/           Primary Examiner, Art Unit 2122